Psychometric Properties of Quality of Life Questionnaires for Patients with Breast Cancer-Related Lymphedema: A Systematic Review

Backgrounds: Assessing quality of life (QoL) using a well-developed and validated questionnaire is an essential part of a breast cancer-related lymphedema (BCRL) treatment. However, a QoL questionnaire with the best psychometric properties is so far unknown. The aim of this systematic review is to evaluate the psychometric properties of the questionnaires measuring the QoL of patients with BCRL. Methods: A thorough search was performed to identify published studies in electronic databases such as Medline (via Ovid), EBSCOhost, PubMed, Scopus, and Web of Science, on 8 February 2022, by using search terms as follows: ‘quality of life’; ‘breast cancer’; ‘upper limb’; ‘lymphedema’; ‘questionnaire’; and ‘measurement properties.’ Two reviewers conducted article selection, data extraction, and quality assessment independently. The third reviewer helped solve any possible disagreements between the two reviewers. The COSMIN checklist and manual were used to assess the quality of included studies. Results: A total of nineteen articles with nine questionnaires were included and assessed using the COSMIN Risk of Bias checklist. Most studies only assessed content validity, structural validity, internal consistency, reliability, and construct validity. Lymph-ICF-UL showed the most ‘sufficient’ and ‘high’ quality of evidence ratings for its measurement properties. Conclusion: The most appropriate questionnaire for use based on our assessment is Lymph-ICF-UL.


Introduction
Breast cancer is the most prevalent cancer diagnosis in developed and less developed countries worldwide. It impacts over two million women each year and causes the most considerable number of cancer-related deaths among women. According to the International Agency for Research in Cancer, more than six hundred thousand women globally died from breast cancer in 2018 [1]. In recent years, the advancement of breast cancer management has led to a higher survival rate from this disease [1], resulting in greater demand for post-cancer care [2].
However, these advanced improvements also come with side effects, such as fatigue, psychological distress, arm lymphedema, or sexual dysfunction [3][4][5]. Arm lymphedema or breast cancer related-lymphedema (BCRL) affects almost one in five breast cancer survivors (21.4%) [6], with the overall incidence rate ranging from 15.5% to 54% [6][7][8][9][10]. The incidence is most likely to increase over time, up to 24 months following a breast cancer diagnosis or surgery [6]. Lymphedema is a chronic swelling resulting from a protein-rich fluid over-accumulation in extracellular space due to the transport capacity insufficiency of the lymphatic system [11,12]. Based on its etiology, there are two types of lymphedema: primary and secondary [13]. Factors that could increase the risk of developing lymphedema after breast cancer treatments are scar from the surgical procedures [14], the number of lymph nodes removed [15,16], chemotherapy [9], radiotherapy [15,16], obesity, and being

Quality Assessment
The quality of full-text articles identified as eligible studies was assessed using the COSMIN checklist and scoring manual. COSMIN steering committee developed an extensive methodological guideline and checklists for systematic reviews of PROMs [29]. The COSMIN guideline was well-established per the current guidelines for reviews, such as the Cochrane Handbook for systematic reviews of intervention [35] and for diagnostic test accuracy reviews [36], the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [37], the Institute of Medicine (IOM) standards for systematic reviews of comparative effectiveness research [38], and the Grading of Recommendations Assessment, Development and Evaluation (GRADE) principles [39].
We utilized the COSMIN risk of bias checklist, one of three versions of the original COSMIN checklists to assess the quality of included PROMs [30]. This checklist provided preferred design requirements and statistical methods of each measurement property. The term 'risk of bias' abides by the Cochrane methodology for systematic reviews of trials and diagnostic studies, which indicates whether the study's methodological quality results are trustworthy [29]. The COSMIN risk of bias checklist consists of ten boxes for PROM development standards (box 1) and for nine measurements properties which are content validity (box 2), structural validity (box 3), internal consistency (box 4), cross-cultural validity/measurement invariance (box 5), reliability (box 6), measurement error (box 7), criterion validity (box 8), hypotheses testing for construct validity (box 9), and responsiveness (box 10) [30]. Table 1 presents the definitions of these measurement properties adapted from the COSMIN guideline [30]. Table 1. COSMIN definitions of measurement properties.

Measurement Properties
Definition *

Content validity
The degree to which the content of a PROM is an adequate reflection of the construct to be measured

Structural validity
The degree to which the scores of a PROM are an adequate reflection of the dimensionality of the construct to be measured Internal consistency The degree of the interrelatedness among the items

Cross-cultural validity
The degree to which the performance of the items on a translated or culturally adapted PROM is an adequate reflection of the original version of the PROM

Reliability
The proportion of the total variance in the measurements which is due to "true" differences between patients

Measurement error
The systematic and random error of a patient's score that is not attributed to true changes in the construct to be measured

Criterion validity
The degree to which the scores of a PROM are an adequate reflection of a "gold standard" Hypothesis testing for construct validity The degree to which the scores of a PROM are consistent with the hypothesis (for instance with regard to internal relationships, relationships to scores of other instruments, or differences between relevant groups) based on the assumption that the PROM validly measures the construct to be measured

Responsiveness
The degree to which the scores of a PROM to detect change over time in the construct is to be measured * Definitions were adapted from COSMIN manual for systematic reviews of PROMs [30]; PROMs = patientreported outcome measures.
Quality assessment of included PROMs was performed in three steps. Two reviewers performed the quality assessment independently (E.M. and A.Z.). A further discussion with the third reviewer (N.A.M.N.) was available if no agreement could be reached.

Step 1. COSMIN Risk of Bias Checklist
The methodological quality assessment was performed using corresponding boxes in the COSMIN risk of bias (RoB) checklist [30]. Each box consists of 4 to 35 items and is rated with a four-point rating system which is, 'V = very good', 'A = adequate', 'D = doubtful', and 'I = inadequate'. The overall rating of each study was determined by taking the lowest rating of any items within each box. This rating would be used in grading the quality of evidence (step 3b) [29].

2.5.2.
Step 2. Applying Criteria for Good Measurement Properties i.
Step 2a: Content validity The result of each study on PROM development and content validity was rated against the 10 criteria for good content validity. The ratings of all available studies were then qualitatively summarized to determine whether the overall ratings of each PROM were sufficient (+), insufficient (−), or indeterminate (?) in terms of relevance, comprehensiveness, comprehensibility, and overall content validity [40]. Suppose the content validity of the PROM was rated as insufficient. In that case, the PROM should not be recommended for use and will be excluded from further evaluation of the remaining measurement properties [30]. ii.
Step 2b: Remaining measurement properties The result of each study on other measurement properties was rated against the updated criteria for good measurement properties as either sufficient (+), insufficient (−), or indeterminate (?) [29]. The updated criteria for good measurement properties are provided in Table 2.
Step 3a. Content validity The overall ratings of each PROM determined in step 2a were also rated for the quality of evidence as either high, moderate, low, or very low, using a modified GRADE approach. GRADE rated the quality of evidence by considering the following factors: risk of bias (quality of the studies), inconsistency (of the results of the studies), indirectness (evidence comes from different populations, interventions, or outcomes than the ones of interest in the review), imprecision (wide confidence intervals), and publication bias [39]. However, only three of these factors were relevant in evaluating content validity, including risk of bias, inconsistency, and indirectness [40]. ii.
Step 3b. Remaining measurement properties The results of all available studies were summarized and rated again against the criteria for good measurement properties (Table 2) to determine whether the measurement properties of each PROM were sufficient (+), insufficient (−), inconsistent (±), or indeterminate (?). If the results per study are all-sufficient (or all-insufficient or all-indeterminate), the overall rating will also be sufficient (or insufficient or indeterminate). In principle, to rate the qualitatively summarized results as sufficient (or insufficient), 75% of the result should fit the criteria [29]. Next, the quality of evidence of each measurement property was graded using the modified GRADE approach [39]. When evaluating the quality of measurement properties, only four of five factors were considered: risk of bias, inconsistency, imprecision, and indirectness. Meanwhile, publication bias is difficult to assess in studies on measurement properties [29].

Study Outcomes
The literature search identified 1013 articles. The details of the study selection process were provided in the PRISMA flow chart (Figure 1). After duplicates were removed, a total of 698 studies were then excluded based on the title and abstract screening. Subsequently, 29 articles were included in the full-text screening. In the full-text screening, 10 articles were excluded, and finally, a total of 19 articles met the inclusion criteria.

Study Outcomes
The literature search identified 1013 articles. The details of the study selection process were provided in the PRISMA flow chart (Figure 1). After duplicates were removed, a total of 698 studies were then excluded based on the title and abstract screening. Subsequently, 29 articles were included in the full-text screening. In the full-text screening, 10 articles were excluded, and finally, a total of 19 articles met the inclusion criteria.  Table 3 presents the characteristics of the 19 included studies. Thirteen studies translated and validated the original questionnaire into their respective languages. One study performed a revision of a PROM and investigated its measurement properties. One study conducted an assessment on the responsiveness of a questionnaire. The remaining four studies developed a new questionnaire then validated it. The average age of the samples included in the studies ranged from 19 to 92 years old. Not all measurement properties were assessed for each PROM in the included studies. Reliability was assessed multiple times: internal consistency and test-retest reliability were assessed 18 and 14 times, respectively, while the assessment for measurement error was performed four times. All studies assessed the content validity, while the remaining validity domains were assessed 12 times for structural validity, 17 times for construct validity via hypothesis testing, and once for criterion validity. Meanwhile, responsiveness was only assessed twice.  Table 3 presents the characteristics of the 19 included studies. Thirteen studies translated and validated the original questionnaire into their respective languages. One study performed a revision of a PROM and investigated its measurement properties. One study conducted an assessment on the responsiveness of a questionnaire. The remaining four studies developed a new questionnaire then validated it. The average age of the samples included in the studies ranged from 19 to 92 years old. Not all measurement properties were assessed for each PROM in the included studies. Reliability was assessed multiple times: internal consistency and test-retest reliability were assessed 18 and 14 times, respectively, while the assessment for measurement error was performed four times. All studies assessed the content validity, while the remaining validity domains were assessed 12 times for structural validity, 17 times for construct validity via hypothesis testing, and once for criterion validity. Meanwhile, responsiveness was only assessed twice.

Characteristics of Included PROMs
The characteristics of nine identified PROMs are presented in Table 4. All included PROMs were evaluated in various languages. The number of items ranged from 14 to 68, with total subscales or domains ranging from two to seven. Five PROMs did not provide a specific recall period; meanwhile, the recall period of the remaining four ranged from right at the moment of assessment to two weeks. All included PROMs used total scores and domains scores to determine the quality of life, except LYMPH-Q Upper Extremity that only used scales scores in determining the patient's quality of life.

Methodological Quality and Rating against Good Measurement Properties for Results of Each Included Studies
The methodological quality of 19 studies assessing psychometric properties of QoL PROMs was rated as "very good" (41 times), "adequate" (13 times), "doubtful" (21 times), and "inadequate" (11 times). Results of all the studies were rated against criteria for good measurement properties and showed 109 times for "sufficient", four times for "indeterminate", and nine times for "insufficient" ratings. The study findings of included studies, the methodological quality rating, and the rating against good measurement property are presented in Table 5.

Overall Rating and Grading of the Quality of Evidence per Measurement Properties for Each PROM
Each study's results were summarized and rated again against criteria for good measurement by COSMIN to examine each PROM's quality as a whole. The summarized results of each PROM were rated as "sufficient" (39 times), "indeterminate" (three times), and "insufficient" (six times). The detailed assessment of the summarized results is presented in the last column of each PROM assessment in Table 5. The quality of evidence for each measurement property of each PROM is provided in Table 6.               Cross-cultural validity/measurement invariance                Internal responsiveness → there were: a significant changes in mean total score between pre-and postintensive treatment (p < 0.05); no significant difference in mean total scores between pre-and posttreatment in stable group (p > 0.05); moderate responsiveness for total score (SRM = 0.65); External responsiveness → there were: a significant difference in mean change score between responders and non-responders after intensive treatment (p < 0.001); weak correlation between ∆-Lymph-ICF-UL and the GPE scores; MCID (total scores) = 9% (5+, 1-) Results in line with 5 hypotheses, but not with 1 hypothesis (+)    The correlation between symptoms, function, appearance, psychological, arm sleeve with each other was higher than with information (r = >0.50); All six scales were associated with increased severity of arm swelling, reporting of arm problem caused by cancer treatments, and wearing of a compression sleeve to reduce or prevent swelling in the past 12 months ( [41,42]. Meanwhile, the other one, Borman et al. included all the seven sub-questions into their analysis, resulting in a total of 28 items assessed [43]. All Turkish versions of LYMQOL-Arm were rated "sufficient" for content validity and construct validity [41][42][43]. However, LYMQOL-Arm B was rated "insufficient" for structural validity because the model fit indices of the confirmatory factor analysis (CFA) did not meet the criteria for good measurement properties (CFI and TLI <0.95; RMSEA >0.06). Due to this "insufficient" rating for structural validity, internal consistency for LYMQOL-Arm B was rated "indeterminate", even though the Cronbach's α values of both domains and overall scores were good to excellent. Moreover, LYMQOL-Arm B was also rated "insufficient" for reliability because the ICC values were less than 0.7 [43]. Both versions' quality of evidence for content validity was "low". The low rating was given due to the lack of information on the content validation process [41][42][43]. LYMQOL-Arm A was rated "low" for reliability due to a low sample size (<100) and only one study with "adequate" quality available [42,43]. LYMQOL-Arm B received a "very low" rating for structural validity because it only has one study with "inadequate" quality [43].
Lymphedema Life Impact Scale version 1 (LLIS ver.1) is an 18-item self-reported questionnaire that measures physical, psycho-social, and functional impact on the lives of patients with BCRL. Each item is rated on a five-point Likert scale ranging from 1 to 5. LLIS ver.1 was rated "sufficient" for content validity, internal consistency, reliability, and construct validity with "moderate" quality of evidence. The "moderate" rating was given because some of the study population was not BCRL patients (8.7% of the total study population for structural validity and internal consistency, 22.65% of the total study population for reliability, and 2.8% of the total study population for construct validity, were lower limb lymphedema patients) [44,45].
Lymphedema Life Impact Scale version 2 (LLIS ver.2) is the updated version of LLIS ver.1 that included a question regarding knowledge of lymphedema management and used a 0 to 4 scoring system. LLIS ver.2 also has a separate question regarding the number of infection occurrences. It was rated "sufficient" for content validity, structural validity, internal consistency, reliability, and construct validity. However, LLIS ver.2 was rated "insufficient" for criterion validity due to weak correlation with the gold measurement standard limb volume differences (r < 0.40, p < 0.05). LLIS ver.2 was rated "high" only for construct validity. Meanwhile, the quality of evidence of the other measurement properties was varied from "very low" for reliability, "low" for content validity and structural validity, to "moderate" for internal consistency and criterion validity These scores were given due to the following reasons: a poor description of content validation process; only one available study with "adequate" quality on structural validity and reliability; the insufficient sample size (<50 for reliability; <100 for criterion validity); and also because the study included non-lymphedema patients for structural validity, internal consistency, criterion validity analysis (44.8% of the total study population) [46,47].Lymphedema Functioning, Disability, and Health Questionnaire for Upper Limb (Lymph-ICF-UL) is a 29-item selfreported questionnaire developed by Devoogdt et al. in 2011 that aimed to quantitatively evaluate problems in functioning related to lymphedema of the upper limb [49]. When compared to the other included PROMs, Lymph-ICF-UL assessed the greatest number of measurement properties as recommended by COSMIN. It was rated "sufficient" for all reported measurement properties. Lymph-ICF-UL received a "high" quality of evidence score for all reported measurement properties, except structural validity and responsiveness which rated moderate; and measurement error which scored "low" due to an insufficient number of at least "adequate" quality studies [48][49][50][51][52][53].
Lymphedema Symptom Intensity and Distress Survey-Arm (LSIDS-A) is a lymphedemaspecific questionnaire that assesses upper limb lymphedema and its multidimensional symptoms. LSIDS-A was rated as "sufficient" for all reported measurement properties, except "insufficient" on construct validity because more than 25% of study results were not aligned with the predetermined hypotheses. The quality of evidence of LSIDS-A was scored "very low" on reliability because there was an insufficient sample size (<100) and only one "doubtful" quality study available. Moreover, the content validity was scored "low" due to the lack of information in the content validation process [54,55].
Upper Limb Lymphedema 27 (ULL-27) is a patient-reported questionnaire that evaluates the QoL of patients with upper limb lymphedema in three domains (physical, psychological, and social). ULL-27 was rated "sufficient" for content validity, structural validity, and internal consistency. However, it was rated "indeterminate" for reliability and "insufficient" for construct validity. The "indeterminate" rating was given because they were not reporting the reliability to result in a preferred measure, such as intraclass correlation (ICC) or weighted Kappa (r = 0.40, p > 0.05). Meanwhile, the "insufficient" rating was given because less than 75% of the results were aligned with the hypotheses. ULL-27 quality of evidence was scored "low" for content validity and "very low" for structural validity and reliability. These scores were given due to the lack of information on the content validation process and the insufficient sample size of reliability (<50). Furthermore, there was only one "inadequate" quality study on structural validity and reliability [56,57].
Upper Limb Lymphedema Quality of Life Questionnaire (ULL-QoL) is a self-reported tool to measure the physical and emotional well-being of patients with upper limb lymphedema. It was rated "sufficient" for content validity, structural validity, internal consistency, reliability, construct validity, and responsiveness. However, the quality evidence of reliability and responsiveness were scored "very low" due to insufficient sample size (<50 for reliability and <100 for responsiveness). The score was given because there was only one study with "adequate" quality on reliability and only one methodologically "doubtful" study on responsiveness [58].
LYMPH-Q Upper Extremity is a patient-reported questionnaire that measures QoL among women with BCRL. LYMPH-Q consists of six independently functioning scales (appearance, function, psychological, symptoms, information, and arm sleeve), which means that only scales relevant to the patient's situation need to be completed. Higher scales for LYMPH-Q scales indicated a better quality of life. It was rated "sufficient" for content validity, reliability, and construct validity. Meanwhile, the other measurement properties received various ratings: "insufficient" for structural validity, which was given because the study provided not enough information on the model fit; "indeterminate" for internal consistency, as the result of the structural validity "insufficient" rating. LYMPH-Q received "very low" for the quality of evidence of reliability because it has a low sample size (<100) and only one study with "doubtful" quality. Furthermore, similar to most of the PROMs reported in this review, LYMPH-Q was rated as "high" for its internal consistency and construct validity [59].

Discussion
Our review aims to assess the psychometric properties quality of QoL questionnaires and propose the most valid and reliable PROM for clinical and research use. To our knowledge, this is the first systematic review and critical appraisal of published studies reporting the psychometric properties of PROMs measuring BCRL patients' QoL that utilized an updated COSMIN guideline and checklist.
Our findings indicated that most of the PROMs were evident in a few measurement properties only, such as content validity, structural validity, internal consistency, reliability, and hypothesis testing for construct validity. There was inadequate evidence on crosscultural validity, measurement error, criterion validity, and responsiveness. A total of thirteen studies [41][42][43][44][45][46][47]52,53,[55][56][57] evaluated the translated version of the PROMs, but cross-cultural validity has not yet been assessed. Cross-cultural validity should be assessed in these translation studies because it is essential to know whether the translated versions assess in the same manner as their original version. Measurement error needs to be evaluated to determine actual changes from systematic and random error so that the clinician can be more confident of the instrument's reliability. Criterion validity is required because without it, a clinician could not be assured whether the instrument is already well-reflecting the gold standard. Responsiveness is important to be investigated to detect any change in the assessment following the interventions received by patients. The diverse quality of measurement properties in the included studies might be the result of a different approach used by the authors. This review revealed that only six studies use the COSMIN recommendations as their guideline in developing and validating the PROMs [48][49][50][51][52][53]58,59]. Other studies that translated and validated PROMs to other languages also used different translation guidelines [41][42][43][44][45][46][47]52,53,[55][56][57].
According to Prinsen et al., recommendations on the most suitable PROM for use both in clinical and research settings can be formulated by categorizing the included PROMs into three categories: (A) PROMs that have the potential to be recommended as the most suitable PROM for the construct and population of interest (i.e., PROMs with evidence for sufficient content validity (any level) and at least low evidence for sufficient internal consistency); (B) PROMs that may have potential to be recommended, but further validation studies are needed (i.e., PROMs categorized not in A or C); (C) PROMs that should not be recommended (i.e., PROMs with high quality of evidence for insufficient measurement properties) [29,30]. Based on the quality assessments, we categorized the included PROMs into each category: (A) LLIS ver.1 [44,45], Lymph-ICF-UL [48][49][50][51][52][53], and ULL-QoL [58]; (B) LYMQOL-Arm [41,42], LLIS ver.2 [46,47]; (C) LSIDS-A [54,55] and ULL-27 [56,57]. They also advised recommending only one most suitable PROM. In case there are more than one PROMs that are difficult to differentiate in terms of quality, the one with the best evidence for content validity could be chosen as the most suitable instrument. It is also recommended that feasibility or interpretability aspects should be taken into consideration in the selection process [29,30].
Feasibility is the ease of administration of the PROM, given the time or money constraints. Feasibility aspects include: patient's comprehensibility, clinician's comprehensibility, type and ease of administration, length of the instrument, completion time, patient's required mental and physical ability level, ease of standardization, ease of score calculation, copyright, cost of instrument, required equipment, availability in different settings, and regulatory agency's requirement for approval. Interpretability is the degree to which one can assign qualitative meaning to a PROM's quantitative scores or change in scores. Interpretability can be obtained from the following information: distribution of scores in the study population, percentage of missing items and percentage of missing total scores, floor and ceiling effects, scores and change scores available for relevant subgroups, minimal important change (MIC) or minimal important difference (MID), and information on response shift [30].
Among the three PROMs that we categorized as "A", Lymph-ICF-UL [49][50][51][52][53][54] has the best evidence for content validity with "high" quality of evidence at any level (relevance, comprehensiveness, and comprehensibility). In terms of feasibility aspects, Lymph-ICF-UL has short, clear, and straightforward questions and an 11-point numerical scale that can be easily understood by the patients and the clinicians. The questionnaire also comes with an easy score calculation that is available in Excel formula. Lymph-ICF-UL only took 5-10 min to be completed and is available in various languages [48][49][50][51][52][53]. The other two PROMs are less suitable because: they only have "moderate" quality of evidence for the content validity; LLIS ver.1 [44,45] was validated in a population other than BCRL; ULL-QoL [58] has less-detailed daily activities-related questions (e.g., work activities, leisure activities) compared to Lymph-ICF-UL (i.e., clean, iron, work in the garden, perform computer work, drive a car, ride a bike), making it a little hard to address the patients' difficulties in some daily activities. However, we are unable to compare the interpretability of the three PROMs due to the lack of information provided in the included studies. Overall, we consider Lymph-ICF-UL as the most suitable PROM to assess QoL in BCRL patients.
Based on the quality of evidence assessments, we found that Lymph-ICF-UL [48][49][50][51][52][53] had assessed seven of nine measurement properties suggested by COSMIN: content validity, structural validity, internal consistency, reliability, measurement error, hypothesis testing for construct validity, and responsiveness. Moreover, the overall rating of these measurement properties was mostly "sufficient" with "high" evidence levels. The structural validity was supported with exploratory factor analysis with acceptable factor loadings. The internal consistency of Lymph-ICF-UL was acceptable to excellent, with Cronbach's alpha value ranging from 0.72 to 0.98. At the same time, the test-retest reliability was also considered good to very good with ICCs ranging from 0.79 to 0.95. Lymph-ICF-UL was also the only PROM reporting measurement error with the overall results of SEM = 4.51-12.6 and SDC = 12.5-34.91. The results for construct validity via hypothesis testing revealed that Lymph-ICF-UL has a moderate to high correlation with other PROMs measuring a similar construct. In terms of internal and external responsiveness, Lymph-ICF-UL was proven to be responsive to change after BCRL treatments.
Moreover, our result was in concordance with a systematic review [21] which indicated that lymphedema-specific questionnaires have strong psychometric properties and offer greater validity and reliability in measuring QoL of BCRL patients. A lymphedema-specific questionnaire contains items that address the patients' complaints more precisely than the generic and cancer-specific questionnaire. The Lymph-ICF-UL domains (physical function, mental function, household, mobility, and social activities) are developed based on the International Classification of Functioning, Disability, and Health domains recommended by WHO [60].
Recommendation of PROM does not only depend on the measurement properties evaluation, but it also considers the other aspects (i.e., feasibility and interpretability aspects). Interpretability and feasibility are non-formal measurement properties because they do not refer to the quality of a PROM. Hence, they are only described and not evaluated. Both are important aspects that should be taken into account in selecting the most appropriate questionnaire, because: poor patient's and clinician's comprehensibility may indicate insufficient content validity; floor and ceiling effects can result in insufficient reliability.
This review's strength is that compared to other reviews by Cornelissen et al., which only assess the completeness of the PROM by assessing the number of domains [32], this review provides a focused and comprehensive assessment of PROMs' measurement properties as recommended by COSMIN [29]. A susceptible search strategy developed by Terwee et al. [34] was applied to identify relevant studies. In addition, this is the first study to focus on the breast cancer-related lymphedema population solely. However, our decision not to consider certain lymphedema severity as the inclusion criteria might be the limitation of this review. This limitation could make the result difficult to generalize to all stages of severity. Our rationale is that most studies did not specify the severity of their study population, making it difficult for us to identify it. Another limitation is the possibility of publication bias due to the assumption that if the PROMs validation studies were not identified through our search, these had not been carried out. Furthermore, since this study focuses only on PROMs assessing QoL in the BCRL population, other PROMs measuring QoL might be omitted if they were not explicitly assessed in the BCRL population.

Conclusions
This systematic review provides an overview of the psychometric properties of updated PROMs assessing QoL in BCRL populations. Lymph-ICF-UL was found to have assessed most of the measurement properties as suggested by COSMIN and showed a "sufficient" overall rating with a high-quality level of evidence. Thus, we consider Lymph-ICF-UL to be a suitable PROM in measuring the QoL of patients with BCRL in either clinical or research settings.