Measuring Equine-Assisted Therapy: Validation and Confirmatory Factor Analysis of an ICF-Based Standardized Assessment-Tool

The International Classification of Functioning, Disability, and Health (ICF) of the World Health Organization (WHO) was established as an international framework for monitoring rehabilitation outcomes and the impacts of health interventions since, as the term “functioning” implies, it emphasizes a person’s “lived health” in addition to their biological health status. Equine-assisted therapy (EAT) represents a holistic intervention approach that aims to improve both biomedical functioning and the patient’s lived health in relation to performing activities and participating in social situations. In this study, the psychometric properties of an ICF-based digital assessment tool for the measurement of the rehabilitation impacts of EAT were analyzed via simultaneous confirmatory factor analyses (CFA) and reliability and sensitivity tests. In total, 265 patients from equine-assisted therapy centers in Germany were included for CFA. Change sensitivity was assessed via multi-level analyses based on 876 repeated assessments by 30 therapists. Results show satisfactory model-fit statistics; McDonald’s omega (ML) showed excellent scores for the total scale (ω = 0.96) and three subscales (ω = 0.95; ω = 0.95, ω = 0.93). The tool proved itself to be change sensitive and reliable (change sensitivity p ≤ 0.001), retest r = 0.745 **, p ≤ 0.001). Overall, the developed assessment tool satisfactorily fulfills psychometric requirements and can be applied in therapeutic practice.


Introduction
In international health systems, the International Classification of Functioning, Disability, and Health (ICF) of the World Health Organization (WHO) plays a key role in patient recovery. The ICF has widely been established as an international standard for describing a person's health status and for monitoring the impacts of health interventions to ensure favorable health and rehabilitation outcomes, since it unites biomedical and "lived health" perspectives on functioning and health status [1,2]. Operationalized via the biopsychosocial approach, a person's functioning can be evaluated by the dynamic interaction between biological aspects, activities, and participation, as well as individual environmental and personal factors. Therefore, a holistic view of an individual's functioning status in rehabilitation contributes to monitoring the responses of health systems more precisely in relation to the suitability of said responses to their individual health needs [2]. Furthermore, in the WHO global health system, functioning has been considered a third health indicator, alongside mortality and morbidity, and has further been designated the key indicator for rehabilitation [2]. A global aim of successful health strategies has been the reduction in morbidity and mortality and the promotion and assurance of optimal functioning [2]. Monitoring the performance and outcomes of rehabilitation interventions and health services via the functioning indicator will be linkable to the ICF classification, as well as the upcoming ICD-11; therefore, the application and development of ICF-based assessment tools represent a promising method to reduce the complexity of the ICF and create precisely tailored therapeutic interventions in rehabilitation systems [2].
The development and use of standardized and valid assessment in equine-assisted therapy (EAT) has been a huge challenge in the scientific discourse in the past years. Scientific studies have identified EAT outcomes and effects in recent years, yet the absence of both common and consistent terminology usage and clarity in intervention targets and intended therapy outcomes remains a major challenge in the professionalization of the field [3]. EAT represents a holistic intervention approach through horse use, which aims to improve both biomedical functioning and the patient's lived health in relation to their physical and mental abilities to perform activities and participate in social situations. The umbrella term EAT includes various subdisciplines of equine-assisted interventions, of which the most salient are EAT as curative education in individual and group settings and hippotherapy, a horse-assisted form of physical therapy [3]. Additionally, equine-assisted psychotherapy, trauma pedagogy, ergotherapy, and sports-related interventions represent other areas of equine-assisted therapy and support [3].
A comparison and collection of EAT findings has been complicated in the past years, not only across languages and countries, but also with respect to the reported conditions or intended outcomes of the target groups of EAT and its subdisciplines. As Wood et al. describe in their terminology consensus report, 78 scientific studies could be found that used the term "hippotherapy" in over 60 different ways to describe varying therapy contents and outcomes [4]. Besides restraining progress in the collection of scientific evidence about factors that influence the effects of EAT, this conceptual uncertainty poses a further practical difficulty in the form of reimbursement obstacles with stakeholders, for whom therapy orientations and outcomes might appear opaque [4].
In this regard, more accurately assessing EAT and its subdisciplines in intervention practice could be a relevant step within the field of therapeutic subdisciplines, but the challenge of making EAT rehabilitation outcomes comparable with the other health services and interventions within the global health care system must also be considered. To reliably monitor EAT intervention outcomes and validly assess whether rehabilitation goals have been attained in a manner that combines both biological and lived health, the linking of EAT to the ICF classification represents a promising approach to building the basis for systematic and standardized assessment in EAT. Since EAT goals are closely related to rehabilitation targets in terms of functioning, as represented in the ICF via the biopsychosocial view of the patient's health, the incorporation of EAT outcomes and factors affecting these outcomes could provide an important step in collecting and comparing evidence related to EAT interventions. Furthermore, linking EAT to the ICF classification could provide increased transparency for funding agencies and stakeholders by verifying and validating the effectiveness of therapy outcomes relative to other health care interventions.
In the past, a few studies have clinically tested the applicability of EAT interventions to the ICF and shown promising tendencies [5][6][7]. In a study by Hsieh et al., the authors concluded that their ICF-CY (ICF children and youth version) assessment approach provided a suitable framework to identify the physical benefits of hippotherapy for children with cerebral palsy (N = 14) [6]. The findings of Borino et al. confirmed the suitability of a self-developed ICF-based assessment tool for measuring behavioral changes and treatment effects in persons with intellectual disabilities participating in EAT and onotherapy (therapy with donkeys) (N = 23) [5]. Authors highlighted the suitability of the ICF-based assessment in terms of quantification of therapy effects, individual treatment planning, and the direct availability of health-related intervention outcomes to the international scientific community [5]. Lanning et al. conducted both two standardized ICF-based questionnaires (the WHODAS 2.0 and the SF36v2) and an ICF-linked qualitative interview to measure the effects of EAT on veterans with post-traumatic stress disorder (PTSD) (N = 51) [7]. The results indicated that the usage of the ICF provided a comprehensive view of the overall functioning of PTSD-diagnosed individuals, examining changes on both mental and physical levels, including dynamic intervention outcomes, and influencing factors, such as environmental aspects and patients' current health status [7]. The authors emphasize the connection of these health states and physical and social environments in regard to their impacts on an individual's activities and participation domains, and stress the importance of not overlooking these factors when diagnosing and treating persons with reactive disorders such as PTSD [7].
Considering the promising tendencies of past studies, the aim of this study, realized by the Research Institute for Inclusion through Physical Activity and Sport, was the validation and confirmation of the multidimensional factor structure of an ICF-based standardized assessment tool for the measurement of functioning in EAT and its subdisciplines using the global language of the ICF. For this purpose, an ICF-based assessment tool was developed through an extensive scientific process, field tested in therapeutic practice, and analyzed with regard to its psychometric properties (including both EAT in general and the main subdisciplines of individual EAT, group EAT, and hippotherapy).

Study Design and Setting
The study employed a longitudinal design. The digital ICF-based assessment tool was tested in therapeutic practice by 30 therapists, who assessed 265 patients with indications for EAT nationwide in 26 EAT centers in Germany. The data collection took place from August 2020 to August 2021 (12 months). The collection period was extended from the originally planned eight-month period to twelve months due to restrictions caused by the COVID-19 pandemic. Therapists also assessed the therapy progress of 127 of these patients with the digital assessment tool over 15 weeks (876 repeated assessments in total). Furthermore, within two additional datasets, therapists assessed retest reliability with a one-week interval and interrater reliability through three repeated measures of the same patients, assessed by three raters.

Participants
Patients were included in the study if they had an indication for EAT and gave written informed consent (for children, this included the consent of parents or legal guardians). A medical declaration of no objection was conducted to carry out the therapy. Therapies at all involved centers were conducted by professionals according to the standardized nationwide procedure regulations of the German Curatorship for Therapeutic Riding, which additionally ensured that no contraindications were present [8]. The study was approved by the ethics committee of the German Sport University Cologne and in accordance with the 1964 Helsinki Declaration and its later amendments (ethical approval code 076-2019).

Measures
The ICF-based assessment tool was administered using Questback UniPark (Questback GmbH, Cologne, Germany). It comprises a general module for the assessment of functioning in EAT overall and three specialized submodules for the assessment of main EAT subdisciplines (EAT in the individual setting, EAT in the group setting, and hippotherapy). The assessment tool was developed in a preliminary study, where a pilot tool was developed based on qualitative focus group findings, which was in turn linked to the ICF via Cieza's redefined linking rules [9]. Afterwards, it was field tested and modified after performing exploratory factor analyses and bivariate correlations. Thereafter, the assessment tool was reintroduced to therapeutic practice and evaluated relative to its psychometric properties in this study. The complete tool can be found in Appendix B with the associated ICF code and descriptive statistics (Table A5 general EAT module and all  submodules). It is differentiated in a general module, which is to be used superordinated for EAT, as well as the three specified submodules: EAT in the individual and group settings and hippotherapy, and is assessed using a unipolar ten-step Likert scale from "does not apply at all" to "applies fully". Only endpoint categories were verbalized. The response scaling was designed in this way in order to assess functioning in EAT with a high degree of change sensitivity.
The general module contains 25 items, differentiated in three subscales (motor functioning, mental functioning, and psychosocial functioning). One example item, related to the mental functioning scale of the general module is G17. Can memorize processes and tasks in the therapy and reproduce them later, which is linked to the ICF code b1442 Retrieval of Memory. The submodule for EAT in the individual setting contains 11 items in total, differentiated in two subscales (specific motor functioning and specific mental functioning). An example item is IS05. Is able to adapt their movements to the movements of the horse in a targeted manner, linked to ICF code b1471 Quality of psychomotor functions of the specific motor functioning scale. The submodule for EAT in the group setting also contains 11 items, differentiated in two subscales (interpersonal functioning and intrapersonal functioning). One example item is GS06. Can handle conflict constructively, linked to ICF code d7103 Criticism in relationships, assigned to the interpersonal functioning scale. The hippotherapy submodule contains 16 items, which are differentiated in two subscales (movement functioning and motor control functioning). An example item for submodule H is H04. Can perceive proprioceptive stimuli (this includes, for example, the perception of movement and position), which is coded with ICF b260 Proprioception function. It is part of the motor control functioning scale.
All therapists evaluated both the general module and one specified submodule for their patients. In addition, demographic data (gender, age, disability, or chronic disease) were obtained.

Statistical Analyses
Statistical analyses were performed using the statistical software programs IBM SPSS 27 (IBM Corp, Armonk, NY, USA) and IBM SPSS AMOS 26 (IBM Corp, Armonk, NY, USA). Descriptive statistics of the general module and submodules were calculated (frequencies, means, ± standard deviation). To determine the dimensionality and model fit of the conceptual model, confirmatory factor analyses (CFA) were carried out using the sample covariance matrix. Factorial validity was analyzed using maximum likelihood (ML) analysis. Global fit indices (χ 2 -Goodness-of-Fit-Test, number of degrees of freedom (df), chi-square fit statistics/degree of freedom (PCMIN/DF), comparative fit index (CFI), root mean square error of approximation (RMSEA), Akaike information criterion (AIC), consistent Akaike information criterion (CAIC), and modification indices (MI)) were examined. As stated by Sherer et al., CFI levels greater than 0.90 were considered to be acceptable, while levels greater than 0.95 were considered to represent a very good fit [10]. For RMSEA, levels of less than 0.08 indicated satisfactory model fit, whereas levels of <0.05 were considered to be a very good fit [10]. AIC and CAIC were used in the fitting process as cutoff values indicating increased model fit. Accordingly, lower AIC and CAIC values indicated increased model fit of the models compared in the fitting process [10]. Construct validity was examined via Cronbach's alpha (α) and McDonald's omega (ω). Scales were considered reliable with values of α/ω = 0.70 and α/ω = 0.80 [11,12]. Sensitivity was determined based on an aggregated dataset via repeated measurements over 15 weeks, which were analyzed using hierarchical linear mixed models (GLMM, multi-level analyses). Test stability was assessed via retest reliability and inter-rater reliability on the basis of two additional datasets. Retest reliability was assessed with a one-week interval and analyzed via Pearson correlations. Values greater than 0.7 were considered acceptable, and values greater than 0.8 were considered good [13]. Inter-rater reliability was assessed via intraclass correlations (ICC) of three repeated measures of the same patients with equal intervals, each of which was evaluated by three therapists. Interclass correlations values over 0.6 were considered good and values over 0.75 were considered as very good [14]. Because of the small subsample size, normality could not be assured in this test, therefore results of ICC were examined by the nonparametric Friedman test with Bonferroni adjustment, since the assumptions of repeated-measures ANOVA were not met.

Results
The sample included 265 patients in total (men = 119, women = 145, other = 1). Of these patients, 55 were adults (>18 years) and 209 were children (not specified: 1). Disabilities were heterogenous and mainly located in areas of motor development and mental-perceptual impairments, such as autism, attention deficit hyperactivity disorders, trisomy 21, or cerebral movement disorders and chronic degenerative diseases such as multiple sclerosis. In addition, psychological diagnoses such as dissociative disorders and posttraumatic stress disorder were included but were uncommon among participants. For the submodule EAT in the individual setting, a total of 115 patients were analyzed, for the submodule EAT in the group setting, a total of 87 patients were assessed, and for the submodule hippotherapy, a total of 60 patients could be included (descriptive data and the correspondence of all items to the ICF classification can be obtained in Table A5 of the Appendix B).

General Module (G)
For the general module, a conceptual three-dimensional model was developed, based on the results of a preliminary explorative factor analysis (EFA) based on a different sample. Confirmatory factor analysis (CFA) was performed to determine whether the proposed multidimensional three-factor structure of the EFA (Scale 1: Motor functioning, Scale 2: Mental functioning, Scale 3: Psychosocial functioning) fits the data. As Tables 1 and A1 (Appendix A) indicate, the three-factor structure of the hypothesized model represents an adequate fit for the data, which could be optimized via reduction in items and by allowing cross-loadings and error correlations. Factor loadings of the hypothesized model were acceptable according to the usual criteria, but global fit statistics needed modification [15].   Table A1, Appendix A). Modification indices showed a residual correlation between Items G12 and G13 (e13 <-> e14 MI = 56) because both items thematize the concept of "trust" as a psychosocial aspect. As such, an error correlation was added and Model 3 was run. The global model fit of Model 3 increased, especially chi-squared (χ 2 = 1675.2). Modification indices indicate that the error covariance related to Items G23 and G24 (e4 <-> e5 MI = 73) remains a strong misspecified parameter.
Both Items thematize different aspects of physical movement on horseback while in motion, so both Items were retained, an error correlation was added, and Model 4 was run. Model fit of Model 4 increased slightly, while modification indices showed high cross-loadings of Item G14 with the mental functioning (MI = 52) and psychosocial functioning scales (MI = 50). Item G14 was therefore reduced to clearly distinguish the scales. Model 5 indicated an improved model fit, while modification indices showed a high cross loading of Item G11 with the mental functioning scale (Item G11 <-Mental functioning MI= 65); therefore, Item G11 was reduced, and the model run again. Model 6 modification indices showed a residual correlation for Items G21 and G22 (e2 <-> e3 MI = 43), both of which thematize motor control functioning; therefore, an error correlation was added. Model 7 made further progress in the global model fit (CFI 0.875), but still needed improvement for a suitable data fit. Modification indices showed a high residual correlation for Items G16 and G17 (e26 <-> e27 MI = 36), since both items thematize aspects of memory functions. An error correlation was added. Model 8 showed a noticeable improvement in model fit (CFI = 0.880) and modification indices showed a residual correlation of Item G2 with the motor functioning scale (e22 <-Motor functioning scale MI= 28). Item G2 did not discriminate precisely between the scales, so Item G2 was reduced.
Model 9 showed a cross-loading of Item G18 with the psychosocial functioning scale (e28 <-Psychosocial functioning scale), so Item G18 was reduced to ensure a consistent fitting process. Model 10 showed a lower value in the AIC model fit (1073.32) and modification indices showed a residual correlation for Items G21 and G22 (correlated error e3 <-> e4 MI = 26.8), so an error correlation was added again. Model 11 indicated an improved global model fit. Modification indices showed a residual correlation for Items G24 and G25 (correlated error e5 <-> e6 MI = 22.2), both of which operationalize aspects of physical rhythmizing, so an error correlation was added again. In Model 12, fit indices improved slightly, but modification indices showed a residual correlation for Items G23 and G25 (correlated error e4 <-> e6), both of which thematize aspects of balance and physical rhythmizing, so another error correlation was added. In Model 13, global model fit further increased (CFI = 0.910 is acceptable). Modification indices showed a residual correlation of Items G15 and G16 (correlated error e25 <-> e26 MI = 21), which thematize mental functions of consideration and concentration, and thus an error correlation was added. The overall model fit of Model 14 is acceptable and represents the best-fitting model, including all parameters that are meaningful and relevant (χ 2 = 823.9, df = 264, CFI ≤ 0.90, RMSEA = 0.090, AIC = 945.93, CAIC = 1225.3, Table 2). Modification indices do not show remaining misspecified parameters. In total, five items were reduced via the fitting process to ensure scale economy of the general EAT module. Scale 1 (Motor functions) contains 10 items, Scale 2 (Mental functions) contains 7 items, and Scale 3 (Psychosocial functions) contains 8 items. The reliabilities of the individual scales are in very good range (α = 0.95; α = 0.95; α = 0.93), while the reliability of the total scale of the general module is in excellent range α = 0.96. Due to the high reliability, a reduction in the instrument was possible for improved temporal-economic implementation. McDonald's omega (ML) showed higher scores for the total scale (ω = 0.96) motor functioning scale (ω = 0.95), mental functioning scale (ω = 0.95), and psychosocial functioning scale (ω = 0.93), and incorporated the loadings and error correlations of the model (Brown, 2015) [15].
For the measurement of inter-rater reliability, ten patients were assessed three times each by three independent raters. Intraclass correlation (ICC) is reported throughout the results section with the average values, not individual measures. ICC for the total scale showed significant values over time, which remained robust for all raters (measurement time 1: ICC = 0.788, α = 0.791, p = 0.002; measurement time 2: ICC = 0.775, α = 0.786, p = 0.003, measurement time 3: ICC = 0.811, α = 0.826, p = 0.001). The non-parametric Friedman test with Bonferroni adjustment confirmed significant values concerning the three measurement times; accordingly, patients improve significantly from one measurement time to the next for all raters (Friedman Test: Chi-Square (2) = 7.800, p = 0.020, n = 10). Pairwise comparison shows significant changes between the first and the third measurement over time (p = 0.007). Between the first and the second measurement, as well as between the second and third measurement, the values show smaller changes (p = 0.044, p = 0.502). The ranks show steady progress from the first to the third measurement (first measurement: mean rank = 1.30, second measurement: mean rank = 2.20, third measurement mean rank = 2.50). Descriptively, the first and second measurement ranks marks the largest difference. With regard to inter-rater agreement, intraclass correlations did not show significant values, i.e., all raters measure a change, but this change is not measured in a consistent way (ICC = 0.161, p = 0.352).
With respect to change sensitivity, Tables 3 and 4 show a significant positive change over 15 weeks of therapy for the general module (total scale p < 0.001; subscales p < 0.001, p < 0.001, p = 0.007). An aggregated dataset based on repeated measurements over 15 weeks was analyzed using hierarchical linear mixed models (GLMM) to generate meaningful results. The results confirm that the general module sensitively depicts change in patient functioning over the course of therapy. The normal distribution test of the residuals confirms this effect for all scales.
To locate the therapeutic effects exactly over the course of 15 weeks, an additional specific mixed linear model with "time" as a categorical variable was calculated. In this model, each measurement time point was compared to the remaining points, so as to show where the most significant changes were located. This cannot indicate significant treatment effects over time as precisely as the multilevel analysis presented in Tables 3-5, so it is only used for additional information. The model located the main therapy effects in the first three therapy weeks (week 1: p = 0.001, week 2: p = 0.019 and week 3: p = 0.036). During the following therapy weeks, the p-values remained quite small, in week eight the p-value became once again non-significantly larger (p = 0.180). Four-week time intervals confirm the result (Table 6).

EAT in the Individual (IS) and in the Group setting (GS) Submodules
For the submodules IS and GS, confirmatory factor analyses (CFA) were performed to determine whether each of the proposed two-dimensional factor structures of the EFA (IS: Scale 1: Specific mental functioning, Scale 2: Specific motor functioning; GS: Scale 1: Interpersonal functioning, Scale 2: Intrapersonal functioning) fit the data. Since sample sizes were smaller than in the general module, CFA model fitting indices did not provide such precise indications to fit the data as in the general module. Table 7 indicates that the two-factor structure of the hypothesized model of EAT IS still needs improvement to fit the data. The factor loadings of the hypothesized model were acceptable according to the usual criteria of Brown, but the global fit statistics needed modification [15]. In the first step, an error correlation was added between Items HFPE5 and HFPE8 (HFPE05 <-> HFPE8, MI = 78), since both items represent different aspects of mental capacity. The factor loadings remained robust, while the fit indices improved slightly (see Model 2, Table A2, Appendix A). Modification indices showed no more indications for further model fitting. Estimates showed a low standardized regression weight loading for Item HFPE4 (MI = 44), so it was removed. Model 3 shows improvement in the global fit statistics (CFI = 0.908), but due to the small sample size, modification indices did not show any further indications on how to modify the model, and therefore Model 3 represents the best-fit model for the data (χ 2 = 132.3, df = 42, CFI ≤ 0.90, RMSEA = 0.137, AIC = 180.33, CAIC = 270.21, see Table 8).  The reliabilities of the scales of IS are in the good-to-excellent range (total scale α = 0.94; specific mental functioning scale α = 0.93; specific motor functioning scale α = 0.81), while the reliability of the total scale of the submodule is in the excellent range with α = 0.83. Due to the high reliability, a further reduction in the instrument is possible for an improved temporal-economic implementation. McDonald's omega confirmed scores for the total scale (ω = 0.94) and subscale (specific mental functioning scale ω = 0.94). Omega for subscale specific motor functioning scale could not be executed due to the small number of items (2 Items).
The fixed effects of the linear mixed model for the total scale of the IS submodule and for both subscales show no significant change over time with respect to the measurement timepoint (total scale r = −0.001, p = 0.825; specific mental functioning scale r = −0.001, p = 0.586, specific motor functioning scale r = 0.010, p = 0.443). Tables 9 and 10 show descriptive changes over time for the mixed linear model of submodule IS, based on the aggregated dataset. Calculations were based on a small data set; therefore, descriptive changes may not have been significantly verified (see Table 9). As Table 11 shows, the hypothesized two-factor structure of the EAT GS submodule also needed improvement to fit the data. Factor loadings were also acceptable according to the usual criteria, but global fit statistics required modification [15]. In total, eight models were calculated to generate a model with an optimal fit for the data (see Table 12, final model).  As a first step in the model fitting process, an error correlation between items HFPG10 and HFPG11 was added, since both items thematize ICF aspects of interacting with others (Item correlation e11 <-> e12, MI = 21). Fit indices improved slightly; factor loadings remained robust (see Model 2, Table A3, Appendix A). Model 3 showed an item correlation for Items HFPG9 and HFPG10, since both also include ICF aspects of elementary interpersonal activities as appreciation and understanding (e10 <-> e11). Therefore, an error correlation was added, which dissolved in the further modeling process due to the reduction in item HFGP9. Model 4 shows improvement in the global fit statistics (CFI > 0.900), but still needs improvement. An error correlation was added between items HFPG7 and HFPG10 (Item correlation e8 <-> e11, MI = 19), because both items include exercises involving the horse. In model 5, MI showed item correlations for Items HFPG3 and HFPG5 (Item correlation e3 <-> e6, MI = 12). Both items include aspects of the understanding of social situations, so another error correlation was added. Model 6 showed further progress in the global model fit (CFI = 0.928) but MI indicated an item correlation for Items HFPG1 and HFPG12 (Item correlation e1 <-> e5, MI = 9). The correlation indicates that item content is related. Both items thematize different affective components of consciousness processes, so another error correlation was added (HFPG1= expressing wishes and needs, HFPG12 = making decisions and finding solutions). Model 7 showed item correlations between items HFPG7 and HFPG8 since both items include different aspects of interaction in relationships, so another error correlation was added (Item correlation e8 <-> e9, MI = 9). Model 8 showed a satisfactory global model fit (CFI = 0.949, see Table A3, Appendix A), and factor loadings also remained robust. Modification indices showed a cross-loading of Item HFPG 9 with the intrapersonal functioning scale (HFPG9 <-interpersonal functioning scale MI = 5, HFPG9 <-intrapersonal functioning scale MI = 5), which indicates that the item does not clearly discriminate between the scales. Therefore, Item HFPG 9 was removed. The model fit of Model 8 improved, but item HFPG 13 did not discriminate clearly between the scales, (cross-loading HFPG9 <-interpersonal functioning scale MI= 6, intrapersonal functioning scale MI = 6), so it was removed to ensure that the final model was the best-fitting model. CFI increased slightly above the ideal value in the final model, but other global fit statistics improved in line with the target, so the reduction was considered appropriate. Model 8 (Table 12) Tables 13 and 14). This indicates that the submodule sensitively depicts change in functional ability over the course of therapy. A normal distribution test of the residuals confirms this for the total scale and both subscales. Tables 15 and 16 show changes during therapy of the submodule GS over time, based on an aggregated dataset.

Hippotherapy Submodule (H)
For the submodule H, confirmatory factor analysis confirmed a two-dimensional factor structure based on previously calculated bivariate correlations on a different sample (Scale 1: Movement functioning, Scale 2: Motor control functioning). The sample size of the initial sample for EFA was too small to fulfill the requirements to execute EFA; therefore, bivariate correlation gave indications on the conceptual model structure. Table 17 indicates that the proposed conceptual model needs improvement to fit the data. According to Brown, the factor loadings were acceptable, while the global fit statistics needed modification [15]. In the first step, modification indices showed a residual correlation for Items H2 and H5 since both items represent mobility of body structures. Since both items represent different aspects of movement related functions, an error correlation was added and both items were retained (Model 2), (e7 <-> e8, MI= 32.8). Factor loadings remain robust. Fit indices improved slightly (see Model 2, Table A4, Appendix A). Furthermore, modification indices showed a residual correlation between Items H19 and H21 (e1 <-> e2 MI = 20.8) because both items thematize different aspects of the specific function "walking" as a movement-related aspect. As a result, an error correlation was added and Model 4 was run. Modification indices of model 4 indicate high cross-loadings of Item H22 with the movement functioning (MI = 6.7) and motor control functioning scales (MI = 6.8). Item H22 was therefore reduced to clearly distinguish the scales. In Model 5, global model fit increased, especially chi-squared (χ 2 = 240.772). Despite the increased model fit of Model 5, modification indices showed a residual correlation between Items H11 and H16 (e12 <-> e17 MI = 11.1), since both items thematize aspects of muscle activation; therefore, an error correlation was added, and the model was run again. Model 6 modification indices showed a residual correlation between Items H8 and H18 (e6 <-> e14 MI = 9.6), both of which thematize the range of movement functioning; therefore, an error correlation was added. Model 7 showed a noticeable improvement in model fit (CFI 0.906), but still needed improvement for a suitable data fit. Modification indices showed a high residual correlation for Items H7 and H19 (e2 <-> e11 MI = 8.5), since both items thematize functions of motion sequences responsible for movement patterns, and as such an error correlation was added. Model 8 made further progress in the global model fit (CFI = 0.913): modification indices showed a residual correlation between Items H10 and H23 (e4 <-> e16 MI = 7.9), and therefore an error correlation was added, and the model run again. The overall model fit of Model 9 was acceptable and represents the best fitting model for the data, including all parameters that are meaningful and relevant (χ 2 = 196.049, df = 96, CFI < 0.90, RMSEA = 0.133). The RMSEA could still be improved, since the limiting factor to further increasing the global model fit was the small sample size, which showed few modification indices, and would not improve global model fit of Model 9 significantly (N = 60). Therefore, to prevent overfitting of the model to this dataset, Model 9 represents the best model fit (see Table 18).  The reliabilities of the submodule H scales are in the excellent range (total scale α = 0.97; movement functioning scale α = 0.95; motor control functioning scale α = 0.94). Due to the high reliability, a further reduction in the instrument is possible for an improved temporal-economic implementation. McDonald's omega (ML) confirmed scores for the total scale (ω = 0.97) and subscales (movement functioning scale ω =0.96; motor control functioning scale ω = 0.95), and via this procedure also considers the factor loadings and error correlations of the model. The dataset for retest reliability was too small to compute meaningful results for submodule H (N = 2). ICC showed significant values for time effects from the first to the third measurement  (Tables 19 and 20). A normal distribution test of the residuals confirms the results for the total scale and subscales. Random effects for the total scale and the motor control functioning scale could not generate meaningful results due to the small sample size.   Tables 21 and 22 show changes during therapy of submodule H over time, calculated using linear mixed models based on aggregated data. The means in Table 21 show that changes over time result in negative values (total scale start-end: M = −0.02, motor control functioning scale start-end: M = −0.13). The developed assessment tool for the measurement of functioning in EAT contains 63 items in total (the complete tool can be found in Appendix B). The general module contains 25 items. Five items were reduced to ensure a targeted elaboration of the most economical model. The IS and GS submodules each contain 11 items. In the submodule IS, one item was dropped in the model fitting process. In the submodule GS, two items were dropped. Submodule H contains 16 items. Seven items of the submodule H were reduced based on content-related misfits in close consultation with therapist reviewers before the execution of model fit analyses. The submodule H contained more items than the other submodules, due to the absence of EFA modification indications in the pre-stage of this study. One item was reduced in the model fitting process to increase the global model fit statistics of Submodule H.

Discussion
In this study, the psychometric properties of the ICF-based digital assessment tool were analyzed via simultaneous confirmatory factor analyses (CFA) and reliability and sensitivity tests. The results of the general EAT module show that the three-factor model structure of the final model (Model 14) represents a suitable measure to assess the rehabilitation impacts of EAT. In further statistical research, short scales of the general module could be conducted to reduce items with error correlations. Alternatively, a more complex equivalent model (bifactor model) could be developed for the error-correlated items to ensure a more differentiated item structure. Regarding internal validity and plausibility, the item bank of the general EAT model proves itself to be discriminant in the proposed three-factor structure model fit via the simultaneous CFA. Convergent validity is given because of high factor loadings on the proposed factors. Divergent validity is given by the correlations between the factors, which is significantly different from 1.0; therefore, the factors measure clearly distinguishable aspects of patient functioning in EAT interventions.
Multi-level modelling confirmed the change sensitivity of the module. Retest reliability proved test stability over time. The rater agreement of the therapists was generally in need of improvement for the main module and all submodules. The rater variance of the therapists in the functional measurement can be explained on the one hand by the general error-proneness of rating scales due to typical judgment errors (e.g., strict-mild errors or halo effect), and on the other hand by the absence of a consistent recurring test situation in therapeutic practice to constantly survey specific performance characteristics [16]. Furthermore, therapists did not know the assessed patients and their medical history. A study by Cronley, Marchant, and Caldarella showed for teachers that their assessments were highly reliable in estimating the frequency of occurrence of behavior problems in pupils when the teachers knew how these behaviors presented themselves [16,17]. Accordingly, the assessing therapists should be intensively involved in the individual therapy to conduct assessments that are more reliable. However, this increases the risk of detection bias, in which a therapist's own intervention may be assessed as more effective because the therapist is not blinded and is convinced of the effectiveness of their intervention.
Additionally, for the assessment of hippotherapy, seasonal influences were a cause of variance, since participants wore jackets in winter, which made the assessment of their physical functioning status particularly difficult to accurately assess. The 10-step Likert scale represents another cause of variance, given that too many response categories can negatively affect the measurement properties of the items since the high degree of differentiation can overburden assessors [13]. Aside from reasons of the measurability of change sensitivity, the 10-step scale represented the most appropriate one for the developed assessment tool, since a more differentiated response format provides more opportunities to make distinctions between patients [13]. A promising approach for future research would be the test for inter-rater reliability using generalizability theory [16,18]. According to this theory, the causes of measurement errors can be estimated and their conditions simulated, so that relative and absolute comparisons of progress-diagnostics can be examined precisely [16,19]. Thus, according to the numerical invariance concept, it would not be problematic that therapeutic professionals do not assess identically as long as they assess trends in the same way, which is more realistic in behavioral observations over time [16,18].
Causes for negative and non-significant values in the submodule analyses of therapy progress over time can also be explained by the functional capacities of the therapies' target groups. Patients with chronic degenerative diseases or multiple disabilities may face mental and motor functioning decreases over time (for example patients with multiple sclerosis). In these contexts, therapies succeed by delaying the degenerative progression of the diseases. An improvement in functioning cannot always be achieved in therapy and therefore cannot be detected by assessment tools. Assessed patients in the ICC for submodule H were diagnosed with Huntington's chorea, and Angelman's syndrome; where maintaining the current functional state can already represent therapeutic success for EAT. A study by Goudy et al. (2019) came to similar conclusions. Positive effects of an hippotherapy simulator program for older adults with Parkinson's Syndrome were found with regard to an increase in balance and cognitive impairment [20]. The authors argue in favor of slowing the natural progression of the disease by improving symptoms such as an increased balance and posture through these types of interventions [20]. Fizkova et al. (2013) point out that hippotherapy is effective in suppressing pathological stereotypy of muscle groups and promotes postural reflex mechanisms in children with cerebral palsy [21]. Vermöhlen et al. (2017) showed positive effects of hippotherapy, alongside treatment as usual, in the improvement of balance, fatigue, spasticity, and quality of life in patients with multiple sclerosis [22]. EAT interventions aim to influence components of balance and motor control, which then promote motor problem solving skills that improve ambulation and sitting and other life-related activities [23]. Furthermore, positive effects in regard to post-traumatic stress disorders (PTSD) and psychosocial functioning have been found by Johnson et al. (2018) and Gabriels et al., 2015 [24,25]. The study results discussed thus far show that these therapies are effective, but that instruments may not be accurately and adequately targeting their holistic therapy approaches. Researchers furthermore emphasize that the field of EAT would benefit from future research testing different assessment tools and intervention protocols to precisely assess therapy effects and evaluate therapeutic outcomes [23]. In this respect, the general module of the assessment tool satisfactorily fulfills psychometric requirements and can directly be applied in therapeutic practice.
Overall, the developed digital assessment tool represents a standardized ICF application suitable for targeted assessment of functioning in EAT. As previous studies indicate, the ICF defines a promising framework for the identification of beneficial aspects of holistic EAT interventions on human functioning combined with the possibility of quantification of therapy effects. Therefore, the assessment tool developed and validated herein contributes to the joint efforts of the international scientific community to increase evidence of the effects of EAT in international healthcare through systematic assessment strategies [5][6][7]. The assessment tool can sensitively measure therapy progress change and also precisely depict the effect factors of the therapy.
For the submodules, the small sample sizes for performing CFA and reliability and change sensitivity tests were a limiting factor. The final CFA models of the submodules IS and GS still show deviating values in the global fit statistics, while modification indices did not provide further options to increase model fit. Test stability (retest and inter-rater reliability) could not be estimated in a target-oriented manner to depict significant effects. A cause of the small sample sizes was patient acquisition problems based on restrictions due to the COVID-19 pandemic in Germany during the data collection period from August 2020-2021. Further research should determine reliability scores based on larger sample sizes. In respect to reliability, the possibility of distortive effects based on the assessments carried out by EAT therapists can also not be ruled out. In the future, the developed ICFbased assessment tool could be trialed in a controlled randomized study with a group of patients homogeneous with respect to age, gender, and disability or chronic disease. Thus, the effects of the EAT interventions and conclusions about reliability could be assessed in a more precise and differentiated manner.
Regarding process orientation, the developed assessment tool builds the basis to monitor rehabilitation impacts and guide the therapy progression of EAT within the WHO and ICF frameworks, which represent the international standard of global health systems. As such, it refers to the innovative resource-orientated functioning approach, which operationalizes a person's "lived health" in addition to the biological health status to perform activities and participate in social situations. In the future, this assessment tool could be used in close coordination with physicians and multi-professional rehabilitation teams for a targeted overall rehabilitation process and electronic data interchange within institutions [26]. Furthermore, for goal setting with clients, the assessment tool could be a promising instrument to align therapy goals related to desired life skills, as other studies indicate [26]. Past research has shown that the implementation of the ICF in the rehabilitation team had a positive impact that manifested in a more systematic work approach, greater interdisciplinary cooperation, and a participatory orientation in the clinical setting [26]. In addition, for funding agencies and stakeholders involved in the rehabilitation process, transparent planning of rehabilitative processes and services via the ICF could provide a better insight into processes and their financial aspects, which may lead to increased cost-reimbursements from public financiers and insurance companies [23,26]. For the field of EAT, the prospective usage of the developed assessment tool builds a basis for increased comparison and joint collection of EAT findings across languages and countries [27]. It could encourage the usage of a common international terminology based on the WHO language and thereby help incorporate EAT into global health systems. Therapy contents and outcomes could be assessed more precisely and linked back into the ICFclassification system to help gather scientific evidence about EAT effect factors through international collaboration.

Conclusions
The validated assessment tool provides a nuanced framework for evaluating the therapy outcomes and effect factors of EAT interventions in the common language of the ICF. Through its connection to concrete ICF categories, EAT effects prospectively could be evaluated in a more standardized manner in multicultural and multi-professional health care teams. Therapy effects could also be compared to other functioning-related intervention outcomes in international health systems. This could enable economic costeffectiveness evaluations and therefore affect a targeted outcome measurement in the context of formal therapy evaluations based on functioning capabilities. Furthermore, common international efforts to create scientific databases with large-scale data from multicenter studies and international research collaborations could be implemented in a targeted manner through unified study variables and specific therapy insights based on ICF parameters.

Patents
A legally protected trademark is declared in 2022 by Research Institute for Inclusion through Physical Activity and Sport. Further information can be obtained from the authors.   Table A2. Goodness-of-Fit statistics Submodule IS.  Table A3. Goodness-of-Fit statistics for Submodule GS. Note: Items were directly translated from the validated German version of the assessment tool with minor corrections.