A Systematic Review of Psychometric Properties of Knee-Related Outcome Measures Translated, Cross-Culturally Adapted, and Validated in Arabic Language

During the previous two decades, patient-reported outcome measures (PROMs) have been well tested, and the tools were validated in different languages across the globe. This systematic review aimed to identify the knee disease-specific outcome tools in Arabic and evaluate their methodological quality of psychometric properties of the most promising tools based on the COSMIN checklist and PRISMA guidelines. Articles published in English, from the inception of databases until the date of search (10 August 2022), were included. Articles without at least one psychometric property (reliability, validity, and responsiveness) evaluation, and articles other than in the English language, were excluded from the study. The key terms [“Arabic” AND “Knee” AND (“Questionnaire” OR “Scale”)] were used in three databases, i.e., PubMed, Scopus, and Web of Science (WoS) in the advanced search strategy. Key terms were either in the title or abstract for PubMed. Key words were in the topic (TS) for WoS. COSMIN (COnsensus-based Standards for the selection of health Measurement Instruments) risk of bias checklist was used to evaluate the methodological quality of psychometric properties of the Arabic knee-related outcome measures. A total of 99 articles were identified in PubMed, SCOPUS, and WoS. After passing inclusion and exclusion criteria, 20 articles describing 22 scales from five countries were included in this review. The instruments validated in the Arabic language are Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), knee injury and osteoarthritis outcome score (KOOS), knee outcome survey- activities of daily living scale (KOS-ADLS), Oxford knee score (OKS), anterior knee pain scale, osteoarthritis of knee and hip health-related quality of life (OAKHQoL) scale, Lysholm knee score (LKS), international documentation committee subjective knee form (IKDC), intermittent and constant osteoarthritis pain (ICOAP) questionnaire, Kujala patellofemoral pain scoring system (PFPSS), anterior knee pain scale (AKPS) and osteoarthritis quality of life questionnaire (OAQoL),. All were found to have good test-retest reliability (Intra Correlation Coefficient), internal consistency (Cronbach’s alpha), and construct validity (Visual Analog Scale, Short Form-12, RAND-36, etc.). Of 20 instruments available to assess self-reported knee symptoms and function, 12 were validated in the Saudi Arabian population. Among them, KOS-ADLS is the best PROM to be used in various knee conditions, followed by KOOS and WOMAC. The assessed methodological quality of evidence says that the knee Arabic PROMs are reliable instruments to evaluate knee symptoms/function.


Introduction
Knee pain is one of the most common musculoskeletal conditions, with every fifth individual aged 30 or over suffering from knee pain [1,2]. Age, female gender, and obesity are some risk factors for knee pain, including knee osteoarthritis (OA) [2][3][4]. More than 250 million people are affected globally, with increased years lived with disability [5,6]. A

Registration and Protocol
The protocol of this systematic review has followed the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analysis) guidelines [22] and registered with the International Prospective Register of Systematic Reviews (PROSPERO) PROSPERO, reg. No. CRD42020203456 Available from: https://www.crd.york.ac.uk/prospero/display_ record.php?RecordID=203456&VersionID=1580292 (accessed on 13 August 2020). According to the registered protocol, we must complete the review by 31 August 2020, but due to unavoidable circumstances, we have extended our search and update until 10 August 2022.

Eligibility Criteria
Articles published in English, from the inception of databases until the date of search (10 August 2022), were included. Articles that focused on cross-cultural validation and psychometric analysis in the Arabic version of subjective knee outcome measures of various knee disease populations, such as OA knee, ACL, meniscus, patellofemoral knee, and postsurgical patients. Review articles, articles without at least one psychometric property (reliability, validity, and responsiveness) evaluation, and articles other than in the English language, were excluded from the study.

Information Sources
The search was done in three databases, namely MEDLINE/PubMed, SCOPUS, and Web of Science (WoS) search engines, from inception until 10 August 2022.

Search Strategy
Key terms used were [("Questionnaire" OR "Scale") AND "Knee" AND "Arabic"] in the advanced search option. Key terms were used in the 'title/abstract' section of PubMed and the 'topic (TS)' of WoS.

Selection Process
Two reviewers (MAq and AAk) independently assessed the title and abstract of the studies, then full-text articles based on the inclusion and exclusion criteria. Cohen's kappa coefficient calculated the strength and agreement between the authors [20]. Both reviewers discussed with the first author (MAt) cases of conflicts, and concluded consensus. MAz, SAs, Aal and Sal have reviewed the articles for final approval.

Data Collection Process
The cross-cultural adaptation process was followed by guidelines given by Beaton et al. [18], and the COSMIN guideline assessed measurement properties [21].

Data Items
COSMIN risk of bias checklist is a standardized and validated scoring tool with 10 boxes, two of which are on content validity, three on internal structures, and the remaining five on measurement properties of PROM. Each box was assessed by various standards (items) and each standard (item) was scored on a three points rating scale (i.e., from high-to-low: "+" = sufficient, "−" = insufficient, "?" = indeterminate) [23]. An overall score from the study's methodological quality was determined by taking the lowest rating of the item in the box [21]. We did not report the content validity because it must be done in a lengthy, systematic way, and many PROMs had not mentioned the standards reported by Terwee et al [24]; instead, we used Beaton et al. stages of cross-cultural translation [18]. Cross-cultural translation consists of five stages: forward translation, synthesis, backward translation, expert review, and pilot study [18].
Structural validity, internal consistency, and cross-cultural validity, were covered under the internal structure. Five boxes were presented under other measurement properties: reliability, measurement error, criterion validity, hypothesis testing for construct validity, and responsiveness. Apart from assessing boxes by standards (items), we evaluated each box by updated criteria for good measurement properties [22]. Overall qualities of each PROM were determined through a modified GRADE approach (Grading of Recommendations Assessment, Development, and Evaluation) [25]: (1) risk of bias-the methodological quality of the studies; (2) inconsistency-unexplained inconsistency of results across studies; (3) imprecision -total sample size of the available studies; and (4) indirectness-evidence from different populations than the population of interest in the review [18] by combining scores of COSMIN (COnsensus-based Standards for the selection of health Measurement Instruments) [21]. During the quality assessment process, only measurement properties (boxes) reported in each PROM were evaluated, and other boxes were considered NR (not reported). Therefore, it was not necessary to include all properties (boxes).

Study Risk of Bias in Individual Studies
Any missing stages in the cross-cultural adaptation process, or missing boxes in the COSMIN risk of bias checklist, were considered risk bias for PROM. A country of origin other than Saudi Arabia was considered a risk bias because many words' wordings and meaning differ from one Arab country to the other. The translation and adaptability of original questionnaires into local Arabic are critical to obtaining comprehensive, subjective data from the local population to evaluate a specific disease. Two independent reviewers assessed each included study manually and came to a consensus.

Effect Measures
Validity was assessed by correlation, internal consistency by Cronbach's alpha; and reliability either by intraclass correlation coefficient (ICC) or by Pearson's/Spearman correlation coefficient (r/ρ). Measurement error should be reported by the standard error of the mean (SEM), minimal detectable change (MDC), or by the area under the curve (AUC), and responsiveness should be reported by effect size (ES) or pre-post analysis.

Synthesis Methods
Reliability/validity is considered strong if ICC is 0.70 or more and if 'r' is 0.80 or more. Internal consistency is considered strong if Cronbach's alpha is 0.70 or more. Effect size (ES) is classified into weak, moderate, and strong, by 0.2-0.49, 0.5-0.79, and 0.8 or more, respectively [24]. Convergent validity of 0.50 or more was considered an acceptable hypothesis for construct validity [18].

Study Selection
Entering key terms in search has resulted in a total of 99 hits in PubMed, Scopus, and Web of Science (WoS). Removing duplicates and adding one article from references of searched articles has resulted in 45 articles at stage II. After eliminating articles based on inclusion and exclusion criteria, we finally selected 20 articles for this review. Study selection is diagrammatically represented using the PRISMA flowchart in Figure 1.

Risk of Bias within Studies
Among the 20 included studies, three [25,28,40] have not formed an expert review (stage IV) to examine or discuss back translation; two studies did not mention stage II [30,32]; three studies did not mention stage IV [14,25,40], and two studies did not mention stage V [32,40], respectively. All except two [26,36] used at least 10 patients at stage V (pilot study). Recall may be a source of bias in one study's test-retest reliability [14]. Mode of administration was an interview based on one study [14,32]. The clinical population was not homogenous in three studies [30,38,42]. OAKHQoL, Kujala PFPSS, LKS, and IKDC were not validated in the Saudi Arabian population (Table 1). Four studies included only males [26,27,29,33], while one study included only females [35], and in another two studies, the majority (~90%) of the patients were females. Eight studies [14,[36][37][38][39][40][41][42] were from countries other than Saudi Arabia, using local dialects in their PROM.
Internal consistency of individual subscales was not mentioned for KOS-ADLS [27,30,42] and anterior knee pain scale [31]. Evaluation of internal consistency was mentioned in the methodology, but values were not given in the results for reduced WOMAC [28]. Testretest reliability of individual subscales was not given for anterior knee pain scale [31] and KOS-ADLS [42]. Measurement error was not reported for OAKHQoL, LKS, IKDC, Kujala PFPSS, ICOAP, and OAQoL [25,32,36,38,40,41]. Measurement error (only SEM, not MDC, or AUC) for subscales was not provided for anterior knee pain scale [31].

Results of Individual Studies
The minimal clinically important difference (MCID) was reported for four PROMs, i.e., reduced WOMAC, KOOS, Oxford knee score (all from Saudi Arabia), and KOS-ADLS (Kuwait). Subgroup analyses for age, sex, and joint involvement, were done for OAKHQoL [31]. During the cross-cultural validation process, Sfax WOMAC removed three questions from the function subscale based on the floor effect. (Table 2).
All studies evaluated at least three boxes/properties out of a possible eight boxes (range 3-5; median 4). All studies checked construct validity through hypothesis testing, followed by test-retest reliability (18 studies; one study not clear), internal consistency (16 studies; one study unclear), and measurement error (seven studies). None of the studies checked criterion validity, and one study each checked the cross-cultural validity and responsiveness (Tables 2-4). Table 3 shows good measurement properties of individual included studies. It shows enough results for internal consistency and test-retest reliability. There are intermediate results for construct validity and measurement error.

Results of Synthesis
Structural validity of PROM through factor analysis was done for KOS-ADLS [29], Sfax WOMAC [14], and KOOS [39]. Cronbach alpha value for internal consistency was greater than 0.8 for all subscales of included PROMs, except OAKHQoL's social support and social function subscales (divergent items) [36]. Similarly, ICC values for test-retest reliability were greater than 0.8 in all subscales of reported PROMs, except OAKHQoL's social support and social function subscales. All SEM, MDC, and AUC for measurement error were reported in only one study, i.e., KOS-ADLS [42]; four studies reported SEM and MDC for all subscales in their results, i.e., reduced WOMAC [28], KOOS [43], Oxford knee score [29], and KOOS-PF [33], two studies reported only SEM, i.e., anterior knee pain scale [31], KOOS [39]. Construct validity measured by convergent correlation was within an acceptable value of 0.5 or more, except five PROM subscales, i.e., WOMAC ADL subscale [37], OAKHQoL pain subscale [36], Sfax WOMAC function subscale [14], IKDC [38], and ICOAP [25]. Responsiveness was given for only one study, i.e., KOS-ADLS [42], with a strong ES of 1.12. Detailed quality criteria based on updated COSMIN guidelines for good measurement properties are tabulated in Table 5.
There were three studies each in WOMAC and KOOS, and two in KOS-ADLS. We applied a modified GRADE approach to these PROMs. KOOS had a 'high' grade for internal consistency property, followed by WOMAC ('low' grade) and KOS-ADLS ('very low' grade). KOS-ADLS had a 'moderate' grade for test-retest reliability property, followed by WOMAC ('low' grade) and KOOS ('very low' grade). WOMAC had a 'high' grade for measurement error property, followed by KOOS ('moderate' grade) and KOS-ADLS ('low' grade). All three PROMs had a 'moderate' grade for construct validity property. All PROMs were validated on more than 250 patients in at least two countries. KOOS was validated in two different clinical populations, apart from KOS-ADLS.

Reporting Risk of Bias across Studies
Knee OA was the primary diagnosis of all studies except two, i.e., patellofemoral pain [31], ACL, meniscal, and combined injuries [39]. All studies included more than a hundred patients except six [26,27,29,31,37,41] with 40 patients as lowest [31]. The mean age of the included studies was greater than 50 years in all studies except three [31,39,42]. All studies included both genders, except three [26,27,29], where all patients were males, and one study [35] recruited only females.

Summary of Evidence
Twenty validated tools have been compiled in this systematic review and assessed for their methodological quality of psychometric properties of patient-reported outcome measures (PROMs) of knee-related disease-specific questionnaires in Arabic. Of them, 12 were validated in the Saudi Arabian version of Arabic, and the remaining eight were other than the Saudi Arabian versions.
The internal consistency (IC) of all the included studies had obtained a Cronbach's α value between 0.7 and 0.9, the total Cronbach's α value of Oxford knee score (OKS) by Alghadir et al. 2017 was the highest scored value [30], with an excellent consistency between the items, and the patients would have found a good flow of disease symptoms in osteoarthritis (OA) participants than the rest of the included studies [25,28,[30][31][32][33][34]43].
The reproducibility of all the included studies, with a considerable time gap, yielded a good test-retest reliability property of more than 0.8, which is the minimum required measurement value as per the psychometric analysis; 0.841 as the lowest by Alageel M et al. [23], and the rest of the eight studies were above the required measurement values, with a good to excellent reliability [23][24][25][26][27][28][29][30][31]. Furthermore, of different domains/subscales of all the included studies, the Oxford knee score (OKS) by Alghadir et al. 2017 obtained a total ICC of 0.973, though with a good recall period of one-week duration [27], representing a promising property.
All the studies conducted in Saudi Arabia have shown structural validity; however, most of the studies did not fully (intermediate only) report the hypothesis testing for construct validity, except knee ICOPQ by Alageel et al. 2020; KOOS-PF by Ateef 2020; OKS by Bodor et al. 2020 [25,33,34]. The Arabic OKS version pain item subscale was associated strongly with the Arabic KOOS pain subscale (rs = 0.73), as the pain threshold in TKR awaiting patients was considered high and correlated strongly with the Arabic KOOS pain subscale, where the correlation coefficient above 0.70 was considered strong [24].              Western Most of the included studies were not fully reported in cross-cultural validity, except for two studies, KOS-ADLS by Algarni et al. 2017 [30] and KOOS-PF by Ateef 2020 [30]. In the later study, the cross-cultural validity was very well reported by adapting the religious activity, such as prayers, where most of the items of KOOS-PF were analogues to the prayer activities; the patients with extreme symptoms would be able to appreciate the symptoms of the disease during such activities, fulfilling the meaning of true cross-cultural adaptation and psychometric validation of the adapted questionnaire [33]. A study by Algarni et al. 2017 has tried to justify the cross-cultural adaptation by adjuring the cultural background with a kneeling construct, which the Muslim populace could comprehend easily [30].

Minimal Important Change (MIC) or Minimal Important Difference (MID)
An Arabic tool reduced WOMAC by Alghadir et al. 2016 has shown that the total minimal important change (MIC) or minimal important difference (MID) score was 8.13, indicating a good response to treatment outcomes [28].
Among the PROMs, the KOS-ADLS seems to be a better option as it has fewer items [42], and can be used in various conditions. In addition, it has been validated in both clinical and post-surgical conditions. All measurements based on COSMIN guidelines were evaluated for KOS-ADLS, and had methodologically high-quality ratings [30]. Both WOMAC and KOOS could be used in knee OA patients [14,39], but inconsistencies found in WOMAC [14] lead to the selection of KOOS to evaluate knee OA [39].
Even though WOMAC was validated in three countries, its content was different between studies. For example, Sfax WOMAC [14] removed some items from the function subscale, whereas reduced WOMAC [28] had a low number of items in the subscale. The original complete WOMAC was used by Faik et al. in the Moroccan Arabic dialect. KOOS [37], an extension of WOMAC, was used in two studies without mentioning such difficulties. OKS [38] and KOOS-PF [33] were not validated in females before 2022 [35]. But later, the Arabic translation of OKS overcame this shortcoming by recruiting 45% of females during the translation and validation process, and in KOOD-PF-F [35]. Different dialects among the Arabic-speaking population are another important limitation in selecting kneerelated PROMs.

Discussion
This is the first study to conduct a systematic review of studies based on COSMIN guidelines to evaluate the methodological quality of psychometric properties of different Arabic knee-related outcome measures. Among the three PROMs (WOMAC, KOS-ADLS, and KOOS-PF-F) [14,30,35,39], with two or more studies, KOS-ADLS seems a better option because it has fewer items [38] and can be used in various conditions. In addition, it has been validated in both clinical and post-surgical conditions. All measurement properties are evaluated for KOS-ADLS and have methodologically high-quality studies [30].
Overall, our review points to a scarcity of evidence of sufficient psychometric properties of knee-related outcome measures translated, cross-culturally adapted, and validated in Arabic. However, we wish to highlight a few key points that were borne in mind while decoding our outcomes. The COSMIN tool comprises categories, with essential arbitrary scores chosen as cut-offs to discriminate between adequate and inadequate measurement properties. Occasionally, the statistical outcomes leading to a negative rating consist of a proximate score to the acceptable positive rating. This point mimics methodological quality ratings, the 'worst score counts' algorithm reported by COSMIN. It was measured on three points rating scale (i.e., from high-to-low: '"+" = sufficient', '"−" = insufficient', '"?" = indeterminate') [23]. This signifies the terminal rating of methodological quality described by the minimum score obtained for that measurement property; thus, a single flaw can guide to a rating of 'insufficient' when it is alternatively rated as 'sufficient'. We employed a similar rule while rating the adequacy of psychometric properties of kneerelated outcome measures translated, cross-culturally adapted, and validated in Arabic, where data for various subscales were provided: one sub-optimal score was sufficient to yield a negative rating of the adequacy of that particular property. The inference of these key points provides a glance at our findings, which may lead to an underestimation of the adequacy of measurement properties and methodological quality of the evidence.
However, including studies specifically not focused on examining the psychometric properties would have boosted the risk of bias, paving the path to an unwieldy number of studies for review, and becoming a more challenging task for future researchers to reproduce our review. Yet, we agree that by adapting COSMIN, we have undertaken a rigorous approach for the selection and rating. Furthermore, we believe that this review has collectively acknowledged the state of evidence on psychometric properties for individual PROMs.
The COSMIN initiative focuses on developing new and updating the existing methodology criteria, based on broad consensus. The COSMIN criteria have been introduced recently, focusing on biomedical healthcare and research, and measuring constructs such as health-related quality of life, symptom status, or functional status [21,23,24]. Later, the methodology extended its scope to systematic reviews in other healthcare contexts, like pediatric populations [45,46] and patients with fibromyalgia [47]. Considering this review, it should be taken in mind that many studies on psychometric properties of knee-related outcome measures translated, cross-culturally adapted, and validated in the Arabic language were accomplished before the publication of COSMIN criteria, which signifies that authors of previous studies were not aware of these criteria and/or did not use them in their research. Also, it has not yet been understood whether these standards generally apply to all types of PROMs.

Limitations
The scoring system adopted by COSMIN is the limitation of the systematic review of psychometric properties of knee-related outcome measures translated, cross-culturally adapted, and validated in Arabic. We have the issue with counting the lowest score in assessing the methodological quality of PROMs included in this review. Therefore, we have rated according to COSMIN criteria; the overall score is '"?" = indeterminate' even though the particular PROM has '"+" = sufficient' on all requirements except one criterion.
Another concern regarding the heterogeneity of measurement properties reported in the included PROMs is that most of the study does not provide the same amount of required information as recommended by COSMIN. Also, the heterogeneity of patient conditions was used in the review as few studies were used to record PROM post-surgery as in ACLR, while few were in regular rehabilitation follow-ups such as knee OA. Last, we could not get a clear idea whether the psychometric properties of knee-related outcome measures translated, cross-culturally adapted, and validated in the Arabic language used in this review were not performed as recommended by COSMIN, or not reported as recommended by COSMIN.
Studies in our review report revealed a range of different statistics across the measurement properties. Also, there was no valid and reliable method to check the publication bias and researcher bias towards publishing positive results. As a result, it is possible that our review overestimates the adequacy of psychometric properties across measures since there may be unpublished data showcasing the negative results.
Finally, we end this discussion by conveying those four scales, WOMAC [14], KOOS [39], KOS-ADLS [30], and KOOS-PF-F [35], have methodologically high-quality grades based on COSMIN guidelines for evaluating PROMs, except the property of responsiveness. However, the above limitations should be considered before their clinical implications.

Conclusions
Current evidence among the included studies reflect that all knee Arabic PROMs are reliable instruments to evaluate knee symptoms/function. Among them, KOS-ADLS is the best PROM to be used in various knee conditions with high-quality evidence, followed by KOOS, WOMAC could be utilized in knee OA patients, and KOOS-PF-F among females with PFPS.