Measuring Work-Related Functioning Using the Work Rehabilitation Questionnaire (WORQ)

The assessment of work-related functioning is a key process in vocational rehabilitation to identify specific domains of disability that can be considered within return to work strategies. The Work Rehabilitation Questionnaire (WORQ) was developed to evaluate work-related functioning based on the International Classification of Functioning, Disability, and Health (ICF) framework and is available in different languages. The aim of this study was to assess the French version of the WORQ using item response theory to further validate the scale. Rasch analysis of WORQ and the WORQ-BRIEF (a brief version of the WORQ) was performed using a calibration sample of 221 persons with musculoskeletal injuries. A four-testlet solution indicated the unidimensionality of WORQ, with no differential item functioning for age, education, physical job demands, and injury severity. Reliability was 0.969 and 0.918 for WORQ and WORQ-BRIEF, respectively. The minimal detectable change was calculated to be 4.2% of its operational range for WORQ and 8.5% for WORQ-BRIEF. Consequently, the French version of WORQ can be considered a good measure of work-related functioning in musculoskeletal conditions. WORQ can be used in rehabilitation practice to comprehensively identify the disability and guide clinical decision making and intervention planning. Further studies are needed to evaluate the psychometric properties of WORQ in other health conditions.


Introduction
Work participation is considered as a major indicator of social participation for persons in working age [1,2]. In general, and particularly for those with disabilities, work is associated with economic self-sufficiency and enhances psychological well-being and self-worth, personal identity, and quality of life [3]. Work participation also assures social integration [4]. From a societal perspective, particularly with an aging workforce and, at the same time, a demanding and rapidly changing work environment that requires high work performance and flexibility, assuring employment for persons with disabilities becomes increasingly important but also challenging [5]. Associated increases in sickness absence, a decline in work productivity, along with an increase in long-term disability allowances further adds to social security costs [6,7]. As a countermeasure, in the last decades, national and private insurances along with governmental social security systems increased their efforts to provide appropriate management of disability to support timely return to work and prevent workers with disability from early retirement [8][9][10]. the reliability, and the extent of invariance of scales, which also allows for transformation of the raw scores from ordinal data into interval data or scores [33].
Hence, the overall objective of this study was to examine the scale validity of the French version of the WORQ and the WORQ-BRIEF using Rasch analysis. The specific aims of this psychometrics study were (1) to determine if the items of WORQ-French and WORQ-French-BRIEF measure a common overarching concept-work-related functioning-and therefore can be presented in total sum-scores, (2) to determine if the four clinical domains of WORQ (cognition, emotion, dexterity, and mobility) are reliable and valid to offer greater granularity of interpretation and, (3) to provide interval transformation tables for the total-scores and the domains of WORQ and of the WORQ-BRIEF.

Subjects, Setting, and Eligibility
We collected the data from 221 participants in a longitudinal observational study in a single rehabilitation teaching hospital in Switzerland. Participants with musculoskeletal injuries (MSK) undergoing in-patient rehabilitation were recruited from French-speaking Cantons of Switzerland. The study (CCVEM 005/15) was approved by the ethical committee of the canton Valais (ICHV) and conducted according to the principles outlined in the Declaration of Helsinki.
The full version of WORQ-French was collected at five time points from February 2015 to December 2016: admission to the center of VR (T0), seven days after admission (T1), at discharge (T2), three months post-discharge (T3), and six months post-discharge (T4).
Participants were recruited consecutively using convenience sampling, i.e., all patients were checked for their inclusion criteria when they entered the department of vocational rehabilitation. To be eligible for this study, the participants (1) had to be receiving at least one vocational rehabilitation intervention irrespective of their type of MSK, (2) were able to speak and read French fluently, (3) were between 18 and 65 years old, (4) were employed or looking for a new job, and (5) signed an informed consent prior to participate in the study.

Outcome Measure: WORQ French Version and WORQ-French-BRIEF
The full version of WORQ and the BRIEF version have two parts, where part one is identical for both versions. This part contains 17 sociodemographic and work-related questions. The work-related questions include profession, work status, work demands, VR interventions, and the amount of support received by family, employer, and labor and employment services. Part two of the full version consists of 40 items on functioning (18 body functions and 22 activities and participation items), whereas part two of the BRIEF version consists of 13 items (6 body functions and 7 activities and participation items). WORQ is freely available from www.myworq.org [25]. A sum-score of the items was calculated.
In addition, four single domains (a domain is any meaningful aggregation of ICF categories) were identified in the full version of WORQ based on a previous factorial analysis [34,35]. These domains are: Emotion, Cognition, Dexterity, and Mobility (Table 1). These domains are intended to identify a specific underlying pattern of functioning, which we expect will aid clinical decision making and intervention allocation in VR. Ten items are not assigned to any domain; four items are assessing sensory functions and pain, and two items relate to energy and sleep, where the remaining four items relate to skin problems, transport, relationship, and covering costs of living. These items were considered relevant in the development process of WORQ to complement the picture of work-related functioning and to take into account the different needs of the patients.

Hypothesis
We hypothesized that WORQ full version and WORQ-BRIEF assess a unidimensional concept of "work-related functioning", and therefore a total sum-score could be calculated for both versions accordingly. We also hypothesized that the four domains would also be confirmed in a similar manner.

Statistical Testing
Descriptive statistics was used to describe the sample characteristics including age, gender, family status, education, injury location, and physical job demands (rated qualitatively by a health professional as high, moderate, or low physical job demand). Descriptive analyses were performed with the software package IBM SPSS Statistics for Windows, Version 24.0 (Armonk, NY, USA) [36].
The Rasch model was used to test the scale validity and to create interval measurement scales for WORQ, WORQ-BRIEF, and the domains [37]. Using the partial-credit Rasch model [37] with the RUMM 2030 software (RUMM Laboratory: Perth, WA, Australia) [38], a good fit of the data to the Rasch model was sought, confirming unidimensionality and making it possible to transform ordinal scale observations into interval scale measures.

Dataset
Initially, we constructed a dataset called "calibration sample" that contained the 40 questions from part two of the full version of WORQ. No person was entered into the calibration sample with more than one time point (T0-T4). This approach corrects for potential time effects, and applying a "calibration sample" legitimizes the use of a questionnaire in longitudinal studies [39]. The calibration sample was used to analyze WORQ, WORQ-BRIEF, and the clinical subscores.

The Rasch Model
By applying the data to the Rasch partial credit model, four principal requirements were tested, proof of which would deliver a valid Rasch-based scale with interval-scale properties. These requirements are monotonicity, local item independence, unidimensionality, and invariance of groups, both of those groups with different levels of functioning and also of groups defined by various contextual factors such as gender or health [40][41][42]. These requirements were tested through fitting data to the model, a process referred to as Rasch analysis and described in detail elsewhere [28]. Ideal values for fit are given at the bottom of the fit table. If all requirements of the Rasch model were satisfied, it produced an interval scale latent estimate upon which both item difficulties and person difficulties (in this case, their level of functioning) were placed. In addition, the reliability of WORQ was examined by the person separation index (PSI), and the targeting of the scale (e.g., floor and ceiling effects) was examined.
Certain aspects of Rash analysis were recently updated and adjust the methodology given above. It was shown that the test for the (conditional) independence of items using the standardized residuals should set the threshold for a breach of that independence at 0.2 above the average residual correlation. Where local item dependencies are observed, they can be aggregated to enter the analysis as testlets (if based upon existing domains) or "super-items" otherwise [43][44][45]. When testlets are used, the RUMM 2030 software enacts a bi-factor equivalent solution and reports the proportion of common variance retained in the data in order to provide a unidimensional latent estimate [46]. This proportion should be 0.9 and above if the scale is to be considered unidimensional [47].
Given adequate fit to the model, transformation tables convert the ordinal sumscores to interval scores. To transfer the raw scores to an interval-level scale of the same range as the original raw scores, the metric (in logits) derived from the Rasch analysis was linearly transformed.

Standard Error of Measurement
Based on the transformed Rasch scores, the standard error of measurement (SEM) was calculated, representing the amount of error that indicates the amount of variability in a test administered to a group caused by measurement error. We calculated SEM using the Rasch reliability estimate (PSI) as the reliability coefficient Rx; SEM = SD √ 1 − Rx [48]. The minimal detectable change (MDC), meaning the minimum amount of change in a patient's average score that is not the result of measurement error, was calculated on the 95% probability as MDC95 = 1.96 × SEM × √ 2 [49].

Results
Two hundred and twenty-one patients participated in the study. The majority of participants (89%) were male with a mean age of 43 years (SD 10.9); 52% of the participants had finished high school, 24% primary school or less, and 24% reported to have a college or university degree including post graduate education. Of the patients, 30% reported light physical work, 47% reported moderate, and 23% reported severe physical work. The majority of participants suffered from an injury to the extremities (39% upper extremity, 37% lower extremity). Only 11% of the participants suffered from a back problem and 13% from poly-trauma ( Table 2). The final calibration sample contained data from all five administration time points: 50 datasets from T0, 26 from T1, 49 from T2, 46 from T3, and 50 from T4. A first Rasch analysis was performed for the full version of WORQ with all 40 items. Substantial local dependency (LID) was found with 107 pairs of items with residual item correlations >0.2. The LID led to a considerable deviation from the model, and multidimensionality was detected.
Threshold disordering was also evident, predominantly in items with LID. A subset of items was chosen and fit to the model, where disordered thresholds mostly disappeared irrespective of LID in the full item set. Therefore, it was assumed that LID was causing the distortion of threshold ordering and that recoding the categories would only confuse the users. No further action was taken other than to proceed with dealing with the LID.
In a first step, two testlets were created to account for two pairs of interdependent items. The first pair was Q21 " . . . lifting and carrying objects weighting up to 5 kg" and Q22 " . . . lifting and carrying objects weighing more than 5 kg", and the second pair was Q30 " . . . walking a short distance" and Q31 " . . . walking a long distance." In a next step, four testlets were created, including the two testlets from step one and all remaining 36 items. Thus, all four testlets scored 0-100. Then, the data were reanalyzed. The four-testlet solution satisfied the Rasch model with strict unidimensionality (3.8% of significant t-tests) ( Table 2). No further LID was present, and no differential item function (DIF) by age, level of education, physical job demands, injury severity, or time point of data collection was detected. DIF was not analyzed for sex because of the disproportionate data (only 11% of the participants were female).
Unidimensionality and fit to the Rasch model based on testlet solutions were also validated for the total sum-score of WORQ-BRIEF and for all four clinical subscores (Table 3). DIF for age, physical job demands, and education remained present only in the Mobility subscore. This finding may have been due to having only four items in the score. With a PSI above 0.9 for the full version of WORQ as well as WORQ-BRIEF and a PSI above 0.85 for the four clinical subscales, all scales could be used for individual measurement. The fact that the A-value stayed close to one showed that all the variance in the data was retained. Detailed information on the testlet solutions can be found in the Supplementary Digital Material: Table S1 + S2.
The targeting of items to the population could be considered as good to very good, with a person location of −0.506 (SD 0.613) logits for the WORQ scale and −0.286 logits (SD 0.643) for the WORQ-BRIEF scale, where both scales had an average location of zero logits. Hence, the average for the person location lying below zero indicated that the participants in this study had slightly fewer problems than in a perfectly targeted population. The slight deviation could be expected because WORQ, designed as a generic questionnaire, contains items that are relevant to cover the overall spectrum of work-related functioning in diverse populations but are not necessarily specific to a single health condition. See Figure 1.
Moreover, no ceiling or floor effect was detected in any of the scales ( Table 4). As all tested scores fulfilled the requirements of the Rasch model, transformation tables with the raw scores and the corresponding interval scores were created. They can be accessed in the Supplementary Digital Material: Table S3.
The minimal detectable change (MDC) was calculated based on the interval scale-metric (Supplementary Digital Material: Table S3). With an MDC95 of 4.23% or 0.42 points on the 0-10 scale, the error attributed to WORQ was minimal compared to other patient reported questionnaires [50]. The small MDC might have been attributed to 40 items of which a substantial number scored zero in our population. WORQ-BRIEF, on the other hand, had 8.47% or 8.5 points on the 0-10 scale, a comparable MDC to most other questionnaires. The subscores, with MDCs between 13% and 20%, indicated that a substantial change in the scale is needed to take into account a real difference.

Discussion
This study examined the scale reliability and the validity of WORQ, WORQ-BRIEF, and four WORQ subscores regarding fit to the Rasch model. The findings support that WORQ is assessing an overarching concept, which is work-related functioning. All tested scales proved very good to acceptable measurement properties in terms of DIF and unidimensionality. A reliability (PSI) of more than 0.86 for all the scales confirms that we can strongly relate on the fit characteristic and confirms the use of WORQ not only for clinical decision making but also for measurement of change in work-related functioning [51].
From the beginning, WORQ aimed to elaborate the complex problems in work-related functioning of persons in VR, which led to selecting WORQ items from the ICF core set for VR based on statistics, literature search, and clinical expertise. When applying the Rasch model with a testlet solution, after LID was removed, the data fully satisfied model requirements, including invariance of contextual factors [52]. As such, most of the variance in WORQ was still retained, strongly supporting the analysis approach, which crucially avoided the need to first rescore the 0-10 scales of items with disordered thresholds that were most likely artificial, being caused by LID [53]. Retaining the original 0-10 numeric rating scale scoring for all items also enhances the intuitive understandability of WORQ in clinical practice and rehabilitation [54].
Further, the additional information gained from the four clinical subscores may allow clinicians and researchers to better understand underlying functioning patterns and help to identify subgroups with specific areas of needs within a sample [55].
The consistent response patterns across different levels of injury severity, education, physical job demands, or age for all WORQ scores demonstrate that WORQ is not prone to DIF for the tested factors in our mixed musculoskeletal (MSK) group. The finding supports measuring and comparing similar groups of patients based on work-related functioning measured with WORQ. Nevertheless, evaluating the invariance of WORQ in persons with other than primarily physical impairments, for example, those with mental health conditions, will need to be undertaken in relevant populations.
We confirmed the scale reliability of WORQ and WORQ-BRIEF, which indicates that clinicians can utilize the WORQ items [ordinal (raw) values] in a total sum-score to monitor progress in a single person; however, when calculating change scores, only the direction of change is obtained, not its magnitude. However, transforming ordinal values to interval-based latent estimates (as shown in the Supplementary Digital Material: Table S3) is a natural and fast procedure. The direct transformation of the original (WORQ raw score: 0-400) scale to an interval scale of the same range (WORQ interval: 0-400) enables a direct comparison of the effect of the transformation. Our results thereby confirm the findings of Forrest and Andersen, who noted as early as 1986 [56] that the values in cumulative patient reported questionnaires are tied to an artificial range (e.g., WORQ summary score 0-400) that causes distortion of intervals towards the margins. Transforming ordinal scores to interval scores may not only be relevant when calculating change scores but may be even more so when comparing changes in functioning between individuals or groups of individuals in different settings. A further benefit of interval transformed scores is that researchers who want to include data on work-related functioning into their analyses are now able to use parametrical statistics (given appropriate distributions). However, since the "specific objectivity" of the Rasch model indicates that the results are specific to the sample, the transformation tables in this article refer specifically to MSK problems [57]. Consequently, it is an empirical mater as to whether or not the existing transformation holds in other conditions (thus making a true generic scale).

Limitations
This study has some weaknesses. Geographical and cultural nuances of the study sample may limit the transferability of the results to other MSK populations, although individuals with multiple cultural backgrounds and various injury types, time points since injury, or professions were included. Nevertheless, the more significant concern may be that less than 10% of the population were female, as gender is known to lead to DIF in a substantial part of self-reported questionnaires [58].
Hence, clinicians and researchers should consider biological or gender differences when applying WORQ in a mixed or predominantly female population. The sample was recruited with convenience sampling, which also limits the representativeness of the results to the population.
Although the final testlet solution for WORQ satisfied all Rasch model requirements, the four-testlet strategy for WORQ was somewhat forced by the restrictions of the data analysis software RUMM 2030 that only allows testlets with a maximum number of 100 thresholds, e.g., ten items rated with a 0-10 rating scale. This software restriction also prevented us from testing a two testlet solution for the full version of WORQ and for a testlet solution including the clinical subscores (which would have provided more robust fit statistics) [59]. Luckily, this restriction proved unimportant for the current analysis, as the final solutions showed excellent model fit and no marginal drop in PSI for WORQ or WORQ-BRIEF, hence no further testing was required.

Clinical Use of WORQ, WORQ-BRIEF, and Clinical Subscores
While WORQ was initially designed as a questionnaire to detect and understand problems in work-related functioning from the client's perspective, the findings of this study suggest that WORQ may also be used to measure reliable change in work-related functioning of individuals and groups in MSK VR [60]. The subscores may be used to better explore the extent of one of the underlying traits: cognition, emotion, dexterity, or mobility. They may also help to disentangle the complex construct of work-related functioning in the decision-making process, help to closer monitor change in specific areas of functioning, and therefore allow one to address the needs of these groups more specifically (Figure 1b).
WORQ-BRIEF, on the other hand, can be used as a fast screening questionnaire of the major areas of WORQ related problems. It is debatable if WORQ-BRIEF is more suitable in research than WORQ. On the one hand, it is much shorter, but WORQ-BRIEF is also less sensitive according to the MDC and the reliability measured by the PSI.
Although WORQ was initially designed for the use in vocational rehabilitation, newer findings show that WORQ is also reliable in detecting the needs of patients with small levels of work-related functioning in physiotherapy outpatient settings [61]. Based on this finding, it should be further evaluated whether WORQ could also be used in public health in the context of health prevention and to support sustainable employment. Sustainable employment characterizes a person-job-workplace match that enables a person to stay healthy and satisfied at work over time, with a work performance that meets the expectations of the person and the employer [62]. A focus on prevention of work related disability is becoming increasingly important, as an increasing number of workers experience a decline in work-related functioning while aging. An instrument that detects change in functioning early may help to identify those who need support and to prevent reduced work ability or early exit from labor market. Using a generic instrument such as WORQ to monitor the level of functioning in workers at risk may also follow the notion of the Organisation for Economic Co-operation and Development (OECD) that patient reported outcome measures are key to facilitating high-value care to gain a complete picture of what happens to people across the pathway of care [63].

Conclusions
In conclusion, the French version of WORQ, WORQ-BRIEF, and the clinical subscores are reliable and well targeted measures of work-related functioning. WORQ can be used in clinical practice to reliably measure change and comprehensively identify problems in work-related functioning in MSK. The clinical subscores may help to guide clinical decision making and intervention planning in VR or occupational health. Transformation tables (Table S3) enable the calculation of change scores to document reliable change in functioning, while clinical subscales may help to detect underlying patterns in functioning. With only 13 items, WORQ-BRIEF may be more practical to be used in research or to compare populations across health conditions compared to WORQ. Further studies are needed to evaluate psychometric properties of WORQ in other health conditions with a specific focus in populations with mental problems or multiple comorbidities and in prevention activities.  Funding: This research was funded by a grant from suva (the Swiss accident insurance).