A Scoping Review on Outcomes and Outcome Measurement Instruments in Rehabilitative Interventions for Patients with Haematological Malignancies Treated with Allogeneic Stem Cell Transplantation

Rationale: Allogeneic hematopoietic stem cell transplantation (HSCT) is associated with increased treatment-related mortality, loss of physical vitality, and impaired quality of life. Future research will investigate the effects of multidisciplinary rehabilitative interventions in alleviating these problems. Nevertheless, published studies in this field show considerable heterogeneity in selected outcomes and the outcome measurement instruments used. The purpose of this scoping review is to provide an overview of the outcomes and outcome measurement instruments used in studies examining the effects of rehabilitative interventions for patients treated with allogeneic HSCT. Methods: We conducted a scoping review that included randomized controlled trials, pilot studies, and feasibility studies published up to 28 February 2022. Results: We included n = 39 studies, in which n = 84 different outcomes were used 227 times and n = 125 different instruments were used for the measurements. Conclusions: Research in the field of rehabilitation for patients with haematological malignancies treated with allogeneic HSCT is hampered by the excess outcomes used, the inconsistent outcome terminology, and the inconsistent use of measurement instruments in terms of setting and timing. Researchers in this field should reach a consensus with regard to the use of a common terminology for the outcomes of interest and a homogeneity when selecting measurement instruments and measurement timing methods.


Rationale
Allogeneic hematopoietic stem cell transplantation (HSCT) improves the survival rate of patients with haematological malignancies and offers the best chance for cure in a wide range of patients [1,2]. Graft versus Host disease (GvHD) is the most recognized post-allogeneic HSCT complication [3]. Immunosuppressive therapy (IST) is used to treat or prevent both GvHD and further organ damage once GvHD occurs. GvHD and IST are the two factors most commonly associated with impaired quality of life in these patients [4], distinguishing these patients from those undergoing autologous HSCT. In addition to impaired quality of life, patients treated with allogeneic HSCT for haematological malignancies may have increased treatment-related mortality and loss of physical vitality [5].
Rehabilitation is a complex problem-solving process that is delivered by multidisciplinary teams in inpatient or outpatient settings that aims to improve the patient's quality of life and degree of social integration [6]. Rehabilitative interventions for patients undergoing allogeneic HSCT can improve physical vitality and quality of life as well as decrease mortality [7]. Moreover, early rehabilitation reduces the duration of hospitalization for allogeneic HSCT [8].
Rehabilitative interventions for allogeneic HSCT patients can be challenging with regard to the feasibility of their many phases of treatment. Prior to transplantation, problems related to blood count may not allow the patient to participate in certain rehabilitative interventions. During hospitalization, symptom burden, infections, blood count limitations, or severe fatigue may further prevent the use rehabilitative interventions. Post-hospitalization, GvHD symptoms, blood count fluctuations, or even psychosocial factors may affect the feasibility of rehabilitative interventions. Researchers have long been aware of the importance of the feasibility of rehabilitative interventions among allogeneic HSCT patients [9] and they argue that feasibility and safety should be assessed prior to the development of rehabilitative programs [10].
Future research in this field will investigate the effects of multidisciplinary rehabilitative interventions in a variety of settings. Research in this field has already shown considerable heterogeneity in selected outcomes and in the outcome measurement instruments used [11,12]. Synthesizing, comparing, and interpreting the results from different studies can be challenging when they refer to different outcomes and are measured by different instruments.

Objectives
The purpose of this scoping review is to provide an overview of the outcomes and outcome measurement instruments used in studies examining the effects of rehabilitative interventions for patients treated with allogeneic HSCT, thus enabling a better understanding of the sources of heterogeneity.

Methods
This scoping review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses Extension for Scoping Reviews statement PRISMA ScR (www.prisma-statement.org, (accessed on 28 February 2022)).

Data Sources and Study Selection
The search strategy was developed in collaboration with a librarian to retrieve articles of interest from the MEDLINE, EMBASE, and Cochrane databases in February 2022. Searches were performed with the following terms: (1. xp Hematopoietic Stem Cell Transplantation/ or (transplant* adj5 ("stem cell*" or "hematopoietic cell*" or "haematopoietic cell*")).ti,ab. 2. exp Rehabilitation/ or exp Physical Therapy Modalities/ or exp Exercise/ or exp Exercise Therapy/ or (rehabilitation* or rehabilitative or exercise* or physiotherap* or readaption* or readaptation* or readjustment* or kinesiotherap* or kinesitherap* or training* or (physical adj3 (therap* or treatment*))).ti,ab. 3. (1 and 2) 4. 3 not (animals not humans).sh.). To be included, publications had to be randomized controlled trials, pilot studies, or feasibility studies; published in English or in German; and had to investigate the effects of a rehabilitative intervention shortly before, during, or after allogeneic stem cell transplantation in adult patients with haematological malignancies. After removing duplicates, the titles and abstracts were screened by two reviewers (AM and DK) against the agreed upon inclusion and exclusion criteria. Studies with no obvious relevance to the research questions were removed. Final inclusion was performed after retrieving and screening the full texts, while disagreements between reviewers were resolved by consensus.

Data Extraction, Data Synthesis and Analysis
The following data were extracted from each article by the lead author: population, intervention, setting, year of publication, country where the research was conducted, outcomes used, outcome measurement instruments used, and timing of measurements. The outcomes and outcome instruments were extracted and classified based on the exact way the authors used them, regardless of the conformity of their terminology with the literature. For example, "aerobic capacity", "peak aerobic capacity", and "functional aerobic capacity" were considered and classified as three different outcomes. The extracted outcomes were not classified as primary or secondary, as this information could not be consistently retrieved from the studies. Furthermore, we classified them according to their measurement core area (Life Impact or Pathophysiological Manifestation) based on the conceptual framework of Boers et al. [13]. According to this framework, outcomes, including the symptoms, signs, events, and biomarkers, that describe how health conditions manifest themselves by abnormal physiology are classified as "Pathophysiological Manifestation" outcomes. Outcomes describing how patients feel, function, or survive are classified as "Life Impact" outcomes. Boers et al. [13] label adverse events separately in their framework in recognition of the prominent role of feasibility in outcome measurements. In this scoping review, we used a third core area to classify all of the feasibility concepts separately. Based on the descriptions of El Kotob et al. [14] and Thabane et al. [15], outcomes describing the feasibility with regard to the safety, processes, resources, and management of a study were classified as "Feasibility" outcomes. Furthermore, the timing of outcome measurements was extracted to show the time-point of the measurements in relation to the day of transplantation and the number of measurements in hierarchical order. We classified the timing of the measurements according to hospital or non-hospital settings.

Core Area Feasibility
In the core area of "Feasibility", n = 8 different outcomes were measured 30 times using n = 15 different instruments ( Table 2). The outcome feasibility was the most frequently measured outcome in this core are. It was measured two times in studies that only included allogeneic HSCT patients and seven times in studies including mixed HSCT patients.

Core Area Life Impact
In the core area "life impact", n = 37 different outcomes were measured 105 times using n = 49 different instruments ( Table 2). Fatigue was the most frequently measured outcome (n = 15) in all of the studies, regardless of design, setting, or the included population. It was measured using n= 12 different instruments. In studies that only included allogeneic HSCT patients, fatigue was measured 8 times using n = 7 different instruments. Studies including mixed HSCT patients measured fatigue 7 times using n = 8 different instruments.
Quality of Life (n = 5) and Health Related Quality of Life (n = 4) were measured 9 times using n = 2 different instruments in studies that only included allogeneic HSCT patients. The most frequently used instrument used to measure quality of life in this population was the EORTC QLQ-C30 [55]. In studies including a mixed HSCT population, Quality of Life (n = 5) and Health related Quality of Life (n = 7) were measured 12 times using n = 6 different instruments.
Depression was measured 11 times in studies including allogeneic HSCT (n = 5) or mixed HSCT (n = 6) patients. The Hospital Anxiety and Depression Scale [56] was the most frequently used instrument used to measure depression in studies including mixed HSCT patients. Studies including allogeneic HSCT patients only used n = 6 different instruments.
Anxiety was measured eight times. In studies including allogeneic HSCT patients only, anxiety was measured in n = 3 studies using n = 3 different instruments. In studies including mixed HSCT patients, it was measured in n = 5 studies using n = 2 different instruments. The most frequently used instrument used to measure anxiety was the Hospital Anxiety and Depression Scale [56]. Anxiety was measured in seven out of eight studies during the "Hospital" phase.

Core Area Pathophysiology
In the core area "pathophysiological manifestations", 39 different outcomes were measured 85 times using 61 instruments (Table 2). Endurance (n = 4) and handgrip Strength (n = 4) were the most frequently used outcomes. Both outcomes were used two times in studies including both allogeneic HSCT patients only and mixed HSCT patients. All four studies used a handgrip dynamometer to measure handgrip strength. Endurance was measured using five different instruments, always during the "Hospital" phase.

Timing of Measurement
In 23 out of 39 of the studies, measurements were performed at two time points (see Table 3). The maximum number of measurements was n = 7 time points. Regardless of the setting, the initial measurements (T1) were not always performed on admission. A total of 22 studies were conducted in a hospital setting; in n = 13 studies, measurements were performed on admission, while in n = 9 studies, measurements were not performed on admission. A total of 17 studies were conducted in a non-hospital setting; in n = 13 studies, measurements were performed on admission, and in n = 4 studies, measurements were not performed on admission. Table 3. Timing of measurement.

Discussion
In this review, we observed a tendency toward the use of the same specific outcomes and outcome measurement instruments within the two core areas Feasibility and Life Impact; however, we saw a much more diverse use of outcomes and tools in the core area "Pathophysiological Manifestations". Despite the use of the same outcomes and outcome measurement instruments, the scientific efforts in this field do not fully exploit the potential for evidence synthesis, clinical interpretation, and constructive implications for further research. The main reasons for this are measurement bias due to the heterogeneity and inconsistency of outcomes and outcome measurement instruments used, which is in line with similar statements in the COMET Handbook [57] that describe problems related to outcome reporting bias and inconsistency in outcome measurement. Below, we discuss four main aspects of measurement bias that we encountered based on our results.

Outcome Excess and Inconsistent Use
The 84 different outcomes that were measured in the studies that we included in this scoping review as well as the wide variety of terms used for the same outcomes indicate an excess of outcomes and the inconsistent use of terms in the body of literature that we reviewed. For example, in the "Pathophysiology" core area, thirteen terms were used to describe similar outcomes, of which we only recognize three distinct outcomes, all of which are related to, in different degrees, the body's capacity to produce energy through aerobic metabolic pathways (peak aerobic capacity, peak oxygen consumption, aerobic fitness, functional aerobic capacity, cardiorespiratory fitness, and aerobic endurance performance capacity) or to move itself in a specific manner within a specific timeframe (exercise capacity, functional exercise capacity, maximal exercise capacity, submaximal exercise capacity, and endurance) as well as a third more complex outcome that includes multiple components of fitness (physical capacity and physical performance).
This heterogeneous use of terminology hampers communication between researchers and impedes synthesis in secondary research. It also generates confusion concerning the content of each outcome, which could lead to aberrant inclusions or exclusions in reviews or even incorrect interpretations by clinicians.
Researchers in the field of rehabilitation for patients treated with allogeneic HSCT should seek to reduce the number of the outcomes they measure by reaching consensus about the relevant outcomes to be collected and reported, thus defining a core outcome set (COS). Ideally, COS development should involve patients, so that their needs and insights are taken in consideration.
Strength is an important outcome in the core area "Pathophysiology" because its reduction due to corticosteroid regimens can determine functional performance in postallogeneic HSCT long-term survivors [58,59]. Handgrip strength can be used as a surrogate marker of strength among patients undergoing allogeneic HSCT, and it can detect strength loss and be regained post-allogeneic HSCT [60]. It is a widely used outcome in HSCT research, something that is probably due to the practicability of its measurement. Other authors underline the importance of this outcome during hospitalization for allogeneic HSCT since detecting strength loss can improve fall prevention [61]. However, in addition to handgrip strength, eight other aspects of strength were measured in the studies that we included (i.e., isokinetic leg performance, knee extension strength, muscle strength, peripheral muscle strength, strength, strength capacity, trunk strength, and upper limb muscle strength). As a result, again, there is heterogeneity in the outcomes being measured, which hampers synthesis and adds data waste to this research field. Given the importance of the outcome strength for patients treated with HSCT, researchers should reach consensus on which aspect of strength is the most relevant to be measured.
In the "Feasibility" core area, we observed the interchangeable use of terms (for example, "accrual acceptance", "acceptability", "rate of participant enrolment", "recruitment", and "recruitment rate") since similar terms were used to describe identical phenomena. The most frequently used outcome in this core area-the outcome feasibility-is in our view, a multidimensional construct that comprises dimensions such as safety, attrition, acceptability, and adherence. Some researchers in the field of allogeneic HSCT rehabilitation have already begun to approach feasibility in the manner in which we see it [14,62]. In this review, we noticed that various authors classified specific terms as distinct outcomes (i.e., "acceptability", "adherence", and "attrition"), while others used these terms as instruments to measure the outcome feasibility. This difference in definitions and outcome operationalization leads to incomparable data and is a waste of resources.
Dimensions such as safety, attrition, acceptability, and adherence should not be considered outcome measurement instruments and should not be used and reported as such because they refer to what is measured, i.e., an outcome, while an instrument refers to how an outcome is measured. Ideally, the research community in this field should reach a consensus on the definition of feasibility and on how to measure it.

Outcome Measurement Instrument Excess and Inconsistent Use
The 84 outcomes that were found were measured by 134 different measurement instruments. In the "Pathophysiology" core area alone, 59 different instruments were used to measure 39 different outcomes. This diverseness in the outcome measurement instruments indicates an excess of outcome measurement instruments.
This excess of outcome measurement instruments makes synthesis across studies more difficult. A meta-analytical systematic review studying the effects of physical activity on fatigue confirms our statement [63]. In that study, the authors had to describe intervention effects using standardized mean differences-which are more difficult to interpret-rather than weighted mean differences, because the studies that they reviewed used different outcome instruments to measure fatigue.
Patients undergoing allogeneic HSCT commonly experience fatigue both during hospitalization and in the long-term [64]. Different items could be relevant to measure fatigue in one situation but not in the other since fatigue during hospitalization (i.e., cancer treatment related fatigue) may have different characteristics than long-term fatigue (i.e., cancer-related fatigue). However, the variety of instruments used to measure fatigue remains wide, making comparing fatigue measurements difficult. An item response theory (IRT) -based item bank, such as the Patient-Reported Outcomes Measurement Information System (PROMIS) [65] Fatigue Item bank, could address problems related to measuring different levels of fatigue, as tailored shortforms for different patient populations can be developed or computer adaptive testing could be used.
The variety of outcome instruments has a positive impact when it serves the practicability of measurement conduction in different settings and phases. For example, in our review, we found that (n = 5) different instruments were used to measure the outcome "endurance." Patients treated with allogeneic HSCT are unable to perform the six-minute walk test or the cardiopulmonary exercise testing during hospitalization, as they are generally restricted to their rooms to reduce the risk of infection and because they are connected to medication-administering devices. In this case, an endurance test that can be performed in a small space, such as the six-minute step test, has better practicability than the six-minute walk test. The appropriate use of a wide variety of outcome measurement instruments requires specific context-and phase-including guidelines, which would serve the avoidance of inconsistent scientific output. Ideally, such guidelines should be informed based on clinimetric studies to confirm the reliability and validity of the indicated instruments in defined settings and phases.
We noticed that some instruments such as the EORTC QlQ-C30 and the FACT were often used to measure distinct outcomes such as Health-Related Quality of Life and Quality of Life [66]. We made the same observation for the six-minute walk test, which was used to measure different outcomes. Using a single measurement instrument to measure different outcomes is often not a correct practice because the measurement properties of a measurement instrument may be sufficient to measure one outcome but insufficient to measure another outcome. Therefore, before use, researchers should ensure that the clinimetric properties of each outcome measurement instrument are appropriate for measurement in the population of interest.

Timing and Setting of Measurement Inconsistency
In this review, we found notable heterogeneity in the timing of measurements across studies. Our findings confirm those of van Haren et al. [67] that time-point heterogeneity does not allow for follow-up measurement synthesis in systematic reviews. The general condition of patients treated with allogeneic HSCT fluctuates depending on the phase of their treatment. At the beginning of hospitalization, they may be sturdy, but, later on and depending on chemotherapy intensity, they may suffer from severe fatigue, infection symptoms, and nutritional deficits due to mucositis or other reasons. When patients begin to recover, they gradually show an improved general condition. However, those who suffer from severe symptoms during hospitalization are usually weaker at discharge than at admission. Therefore, heterogeneity in the timing of measurements is an important source of bias since timing is associated with the general condition of the patient. For example, if the "baseline" measurements of one study are performed on admission and the final measurements are performed at discharge, then the results of these measurements or their differences are incomparable to those of another study in which the measurements were performed at day four or ten after admission and at three months after discharge.
Due to the fluctuating condition of patients treated with allogeneic HSCT, not all measurements are always feasible or even meaningful across settings. Measurements might have less value for patients, increase their workload during a period in which filling in questionnaires is not their highest priority, add to data waste, and increase heterogeneity in measurement timing. In order to avoid unnecessary patient effort and the production of data waste and in an effort to improve our understanding of phenomena with established clinical significance, researchers should agree on some basic assumptions: (a) the phases they recognize in the process of allogeneic HSCT (i.e., before allogeneic HSCT, during hospitalization, 100 days after allogeneic HSCT, one year after HSCT-Van der Lans et al. have already made efforts to recognize different phases based on patient insights during recovery) [68]; (b) the outcomes to be measured in each phase; and (c) the timing at which the measurements for each outcome are taken and the method used to measure them in each phase.

Allogeneic HSCT vs. HSCT Population
In this review, we found that 64% of the reported research projects recruited both allogeneic HSCT and autologous HSCT patients. There are some arguments for combining these populations in a study, though there are no formal restrictions at all since the EBMT Handbook [69] does not even have a dedicated article on rehabilitation from which arguments for the distinction of these two populations could arise. Both populations suffer from haematological malignancies, and both populations undergo transplantation. Therefore, researchers in the field of rehabilitation include samples from both populations to achieve the targeted sample size much more quickly.
However, major differences exist between these two populations, which could lead to problems during the interpretation of study results. First, although both undergo "transplantation", the two populations do not undergo the same medical treatment. Chemotherapeutic and, more importantly, immunosuppressive treatments differ with regard to duration and side effects. Second, allogeneic HSCT patients normally undergo a longer and more strict isolation period in addition to a longer planned hospital stay. Third, allogeneic HSCT patients often suffer from GvHD and require additional medical treatment, resulting in significant physical and psychological deterioration.
Consequently, these two different populations cannot be combined in research due to differences in measurement timing and the relevance of the outcomes.
There are many published studies indicating that patients from both populations have been recruited. However, the scientific community should consider whether recruiting patients from both populations is appropriate practice and should reach consensus concerning future practice.

Limitations
To our knowledge, this review is the first attempt to describe the outcomes and measurement instruments used in the study of rehabilitative interventions for patients undergoing allogeneic HSCT. Although we managed to elucidate major issues concerning heterogeneity in the outcomes and measurement instruments used, our findings must be interpreted in light of the limitations of this review. First, we only included interventional studies and we only included research published in German and English. This strategy may have prevented the retrieval and inclusion of publications in other languages and from a wider range of disciplines. As a result, this scoping review focuses on the main body of work on psychological and physical rehabilitative interventions. Second, we classified the outcomes we retrieved based on two different frameworks, as the Boers et al. framework was designed for another purpose and thus does not offer a distinct classification for feasibility outcomes. Finally, we extracted and classified outcomes and instruments according to the terms used by the authors, without modification or interpretation, and therefore, the extracted terms were not always appropriate.

Conclusions
Research in the field of rehabilitation for patients with haematological malignancies treated with allogeneic HSCT covers measurements in all relevant core areas. However, this field of study is hampered by excess outcomes and inconsistent outcome terminology. Furthermore, we detected the inconsistent use of measurement instruments in terms of setting and timing. The combined recruitment of allogeneic and autologous HSCT patients may exacerbate these problems, thus reducing the successful exploitation of the study results by hampering synthesis and clinical interpretation. We recommend that researchers reach a consensus with regard to the use of common terminology for the outcomes of interest and homogeneity in measurement instrument selection and measurement timing.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.