Psychometric Performance of a Condition-Specific Quality-of-Life Instrument for Dutch Children Born with Esophageal Atresia

A condition-specific instrument (EA-QOL©) to assess quality of life of children born with esophageal atresia (EA) was developed in Sweden and Germany. Before implementing this in the Netherlands, we evaluated its psychometric performance in Dutch children. After Swedish–Dutch translation, cognitive debriefing was conducted with a subset of EA patients and their parents. Next, feasibility, reliability, and validity were evaluated in a nationwide field test. Cognitive debriefing confirmed the predefined concepts, although some questions were not generally applicable. Feasibility was poor to moderate. In 2-to-7-year-old children, 8/17 items had >5% missing values. In 8-to-17-year-old children, this concerned 3/24 items of the proxy-report and 5/14 items of the self-report. The internal reliability was good. The retest reliability showed good correlation. The comparison reliability between self-reports and proxy-reports was strong. The construct validity was discriminative. The convergent validity was strong for the 2-to-7-year-old proxy-report, and weak to moderate for the 8-to-17-year-old proxy-report and self-report. In conclusion, the Dutch-translated EA-QOL questionnaires showed good reliability and validity. Feasibility was likely affected by items not deemed applicable to an individual child’s situation. Computer adaptive testing could be a potential solution to customizing the questionnaire to the individual patient. Furthermore, cross-cultural validation studies and implementation-evaluation studies in different countries are needed.


Introduction
As more newborns with esophageal atresia (EA) survive, long-term morbidities and quality of life become more relevant. Reported morbidities include gastrointestinal and pulmonary problems (e.g., dysphagia, gastroesophageal reflux, and/or recurrent airway infections) [1,2], growth retardation [3], reduced exercise capacity [4], and impaired motor function [5]. A significant long-term aspect is the burden of disease, reflected in patientreported outcome measures (PROMs) on health status (HS) and quality of life (QoL). HS describes a patient's well-being in terms of functioning [6], while QoL focuses on perception

Methods
This is a cross-sectional study consisting of three phases (translation, cognitive debriefing, and field testing), similar to the other EA-QOL© validation studies and following international guidelines regarding the criteria for feasibility, reliability, and validity [14,17,18,22]. The study has been approved by the participating institutional review boards (IRB) (MEC-2019-0521, MEC-20-564/C, MEC-2020-6961 and MEC-2019-631). See Supplementary File S2 for a detailed description of methods and IRB-related data.

Translation and Cognitive Debriefing
A Swedish-Dutch forward-backward translation was conducted [23] and reviewed by the Swedish developer to ensure conceptual equivalence. To ensure that all items were understood as intended by the Dutch target population [24], cognitive debriefing (Supplementary File S3) was conducted in three groups, stratified by severity of complaints (Table S1), as described previously [15]: (A) parents of 2-to-7-year-old children (proxyreport); (B1) parents of 8-to-17-year-old children (proxy-report); and (B2) 8-to-17-yearold children (self-report). Participants were interviewed face-to-face during an annual meeting of the Dutch patient support group in September 2019. Participants of groups B1 and B2 were related and interviewed separately and simultaneously. Participants filled out the questionnaire on paper while giving feedback on the clarity and adequacy of the items. The results were analyzed using content analysis. If necessary, we slightly rephrased instructions and items, after having consulted the Swedish developer. We obtained consensus on the final questionnaires for the field test.

Field Testing
A nationwide field test was conducted between August 2020 and April 2021 in the Netherlands. Participants without known intellectual disability who had sufficient command of the Dutch language were identified from the databases of four university hospitals that cover care for approximately 80% of the Dutch EA population. They were invited through a personal letter, containing a personal access code to fill out the questionnaires online (LimeSurvey GmbH version 2.06lts, Hamburg, Germany). Parents who were supposed to be the child's primary care taker and/or children ≥8 years old filled out age-appropriate proxy-reports or self-reports of the EA-QOL© questionnaire [14] and the Pediatric Quality of Life Inventory™ 4.0 (PedsQL) questionnaire [25] (Supplementary File S1). A general questionnaire on sociodemographic information and on digestive symptoms, feeding difficulties, and respiratory symptoms in the past four weeks was obtained as proxy-report in children <12 years old and as both self-report and proxy-report in children ≥12 years old. Parental educational level was classified according to the International Standard Classification of Education [26].
To examine the test-retest reliability, participants were invited to fill out the EA-QOL© questionnaire a second time three weeks after the initial response. If needed, reminders were sent twice maximally. The final reminder also included the questionnaire on paper with a pre-stamped envelope for reply.
The following data were retrieved from the patient records: sex, gestational age, birth weight, type of EA [27], presence of VACTERL (vertebral, anorectal, cardiac, tracheoesophageal, renal, and limb anomalies) association [28], type of primary surgery, postoperative complications (anastomotic leakage, pneumothorax, sepsis, wound infection, or recurrent fistula), history of gastrostomy, and history of esophageal dilatation. EA was considered long gap if staged repair had been performed. Small for gestational age was defined as birth weight <10th percentile [29]. Pneumothorax was defined as the need for a chest tube, sepsis as a positive blood culture, and wound infection according to the surgical site infection criteria of Centers for Disease Control and Prevention [30].

Statistical Analysis
Data are presented as number (%), median (interquartile range), or mean ± SD. Items were answered on a 5-point Likert scale and reversed linearly transferred to a 0-100 scale, with 100 as best possible score. Subscales and total scores were computed by the mean, with a maximum of 30% missing items per subscale. Items were described as mean ± SD (range). Feasibility (percentage of items with >5% missing values [15]) and psychometric criteria (skewness and kurtosis <2.0) were evaluated [31]. Feasibility was considered poor (>30%), moderate (10-30%), or good (<10%). Subscales and total scores were described as median (IQR) with floor and ceiling effects (percentage of respondents reporting, respectively, the minimum and maximum possible score <15%) [32].
Construct validity was determined through known-groups validity; Mann-Whitney U tests served to assess differences in total scores between clinical subgroups: patients with and without primary repair, with gastrostomy or ≥1 esophageal dilatation in history, with and without digestive symptoms, with feeding difficulties, and with respiratory symptoms in the past four weeks. We applied a Bonferroni correction to account for multiple comparison. As we assessed differences for 20 different variables, alpha was set at 0.05/20 = 0.0025. Effect sizes (ESs) were calculated by converting z-scores of the Mann-Whitney U tests (r = z/ √ n) [34] and were considered to strengthen the validity if moderate (>0.30) or large (>0.50). Children in clinical subgroups were hypothesized to have lower total scores. Convergent validity was examined by correlating the proxy-reported and self-reported total scores with the concomitant PedsQL scores [25,35] using Spearman's rho (r s ), and concluded as poor (<0.40), moderate (0.40-0.59), good (0.60-0.79), or excellent (>0.80). Statistical analyses were performed using SPSS V.24.0 (IBM, Chicago, IL, USA), with a significance level of p < 0.05.

Cognitive Debriefing
A review of the translations confirmed the intended conceptual content. Twenty-nine participants (19 parents and 10 children) were recruited for cognitive debriefing. Group A consisted of nine parents (11% male, age range 32-44 years) of children with mild (n = 2), moderate (n = 5), or severe (n = 2) complaints. Group B1 contained ten parents (30% male, age range 41-61 years), and group B2 contained ten children (40% male, age range 9-17 years) with mild (n = 4), moderate (n = 4) or severe (n = 2) complaints. Table S2 summarizes the cognitive debriefing results. Overall, participants understood the items correctly according to the predefined concepts. Parents considered two items (Can your child eat at the same pace as other children his/her age? & Does your child need to think of drinking a lot when he/she eats?) multi-interpretable. We rephrased those items as suggested. Although participants considered some items burdensome, none were rejected. Some items, e.g., questions on oral feeding in case of full dependency of (par)enteral feeding or questions on small stature while having physical height within normal ranges were repeatedly considered not applicable and unable to answer properly. To keep the translated version in line with the original, we did not adjust the response scale but modified the instructions. In the field test, participants were instructed to omit questions if not applicable. Some participants indicated that they had missed certain topics (Table S3). To preserve the original structure of the questionnaire, we continued to the field test with the questionnaire in its current form.

Internal and External Reliability
Internal reliability was good for the total scores, but the Cronbach's alpha for 'Health and well-being' was <0.7. For the proxy-self comparison, 128 child-parent couples were available, with good correlation for all subscales (Supplementary File S5, Table S6) and the total score (ICC 0.81). In the retest, 70 parents (69% of the original sample) of 2-to-7-year-old children, 82 parents (60%) of 8-to-17-year-old children, and 71 8-to-17-year-old children (55%) responded. Basic characteristics did not differ between respondents and non-respondents. Respectively, 6%, 17%, and 16% of the questionnaires were returned on paper. Clinical symptoms of none of the children differed from that in the initial test. Test-retest agreement was good for the total scores and most of the subscales (Table 3).
Agreement was moderate for 'Social isolation and stress' in the 2-to-7-year-old proxy-report and 'Social relationships' and 'Body perception' in the 8-to-17-year-old self-report.

Construct Validity
Total scores of the 2-to-7-year-old proxy-report were lower for symptomatic children, with moderate to large ESs-except for children with heartburn, chest tightness, and airway infections. Total scores of the 8-to-17-year-old proxy-report were lower for children avoiding certain food, adjusting their portions, or increasing their fluid intake during meals, with moderate ESs. Total scores of the 8-to-17-year-old self-report were lower for children with dysphagia or dyspnea during physical activity and for children adjusting their portions or increasing their fluid intake during meals, with moderate ESs. See Table 4.   Table 4. (a) Comparison between clinical subgroups of total scores of the EA-QOL© questionnaire for children aged 2-7 years old (proxy-reports). Only patients for whom both clinical data and total scores EA-QOL© were available were included in the analyses. Asterisks indicate Bonferroni-adjusted significance p < 0.0025. (b) Comparison between clinical subgroups of total scores of the EA-QOL© questionnaire for children aged 8-17 years old, proxy-reports (left) and self-reports (right) between clinical subgroups. Only patients for whom both clinical data and total scores EA-QOL© were available were included in the analyses. A Only children ≥12 years old reported these items. Asterisks (*) indicate Bonferroni-adjusted significance p < 0.0025.

Convergent Validity
Total PedsQL scores showed a strong correlation with total EA-QOL© score of the 2-to-7-year-old proxy-report (n = 100, r s = 0.64, p < 0.001), a weak correlation with total score of the 8-to-17-year-old proxy-report (n = 135, r s = 0.39, p < 0.001), and a moderate correlation with the total score of the 8-to-17-year-old self-report (n = 130, r s = 0.54, p < 0.001). See Table S7 for complete subscale and total PedsQL scores (Supplementary File S6).

Discussion
In this nationwide validation study of a condition-specific PROM for children with EA, we evaluated the psychometric performance of the Dutch-translated EA-QOL© questionnaires. Cognitive debriefing confirmed good understanding of the items according to the predefined concepts, but not all questions were deemed applicable for each individual child. Overall, the field test showed good internal and retest reliability for the total scores and most of the subscales. The construct validity was slightly discriminative. The convergent validity was variable, from weak to strong correlations.
In general, Dutch participants reported higher EA-QOL© scores than those in the Swedish-German validation study. From a clinical perspective, this could be explained by differing perceptions of symptoms-or perhaps fewer comorbidities. In our population, 2-to-7-year-olds tended to have fewer airway infections, and 8-to-17-year-olds had fewer complaints of heartburn and vomiting than in the Swedish-German study population. None of the Dutch children required parenteral nutrition in the field test, in contrast to 4 out of 124 Swedish-German children [14]. Given the response rates (39-51%), it may imply that patients with a larger symptom burden did not respond, which may influence the study findings. Considering the psychological distress of parenteral feeding [38], this difference could have contributed to the higher EA-QOL scores in the Dutch population.
Another clinical explanation is the potential influence of the COVID-19 pandemic. The second lockdown in the Netherlands overlapped with the field test period. Closure of primary and secondary schools for three and six months, respectively [39], significantly impacted children's social life. Reduced social activities may have resulted in less negative confrontation with impairments of their chronic condition and leading to items being less applicable, while healthy children's QoL was negatively affected by COVID-19 [40].
One could argue that the higher Dutch scores are related to test characteristics. Ceiling effects were present in both study populations, but floor effects (all <15%) were observed only in the Swedish-German population. However, validation of the well-established DISABKIDS, CHQ-CF87, and PedsQL instruments showed similar results, with rare floor effects but ceiling effects up to 86% [25,31,41]. Validation of the Dutch version of the CHQ-CF87, cross-culturally adapted from the United States, even showed no floor effects at all [41], like in our study. Moreover, the high-level child-parent agreements favor a clinical rather than a technical explanation for the differences between the Dutch and Swedish-German populations.
Next, the proportion of items with missing values in our study (up to 32.7%) was larger than that in previous studies [14,17], though missing values were not specifically reported for the Polish cohort [18]. Considering the cognitive debriefing results in our study (Table  S2), this was anticipated. We instructed participants in the field test to omit the questions they considered not applicable and noted that the omitted questions corresponded with those commented on during cognitive debriefing. Soyer and coworkers-who performed the Turkish field test of the EA-QOL© questionnaires in the outpatient clinic-did not share data on cognitive debriefing [17]. Rozensztrauch et al. reported a positive perception during in Polish cohort but did not specify who conducted the interviews [18]. Differences in study design could explain the abovementioned differences.
The wide-slightly skewed-age range within the groups could explain this poor feasibility. Toddlers' perception of potential problems in daily functioning differs from that of school-aged children, and toddlers may be less capable of expressing their burden verbally. Moreover, not every toddler attends daycare, which may differ amongst countries.
In the Netherlands, daycare attendance was even less during the COVID-19 pandemic. In a longitudinal cohort study, we showed that growth is slightly below the norm in younger children with EA but normalizes at 12 years [3]. This may explain the frequently omitted question on perception of having a short stature and emphasizes the need for cross-cultural adaptation of the questionnaire.
One could argue the rationale of examining the feasibility, reliability, and validity of an instrument for each country separately. However, our results support the current guidelines of PROM that this is essential before implementing a translated QoL questionnaire in research or clinical practice.
Differences in clinical presentation and follow-up care in different centers could impact the rating of a child's QoL. Furthermore, one's health perception might be subject to cultural differences between countries [42]. Culture is multi-aspect concept that requires further exploration in this context. For example, adequate coping skills may lead to positive illusionary bias [43] and, hence, to considering chronic healthcare problems and concomitant lifestyle factors such as dietary restrictions normal. This phenomenon as well as differential item functioning [44]-measuring different aspects in subgroups of participants due to perceptional differences-should be taken into account when implementing the EA-QOL© questionnaires in clinical practice and during cross-cultural evaluation.
Moreover, small sample sizes and heterogeneity (to which cross-cultural differences contribute) are known challenges for the soundness of PROMs in rare diseases. A possible solution might be computer adaptive testing (CAT), enabling customization of a questionnaire to an individual's situation by using skip patterns that are based on the individual's prior responses to administer items from an item bank [45]. For generic PROMs, the Patient-Reported Outcome Measure Information System contains item banks for physical, mental, and social health in adult and pediatric populations, selected from the literature and tested through various extensive item-response theory (IRT) models [46]. CATs have been developed to measure HRQoL in children with chronic conditions [47] but not in rare diseases such as EA. Generating an item bank for condition-specific items requires large sample sizes recruited from multiple countries [22]. Given the strong correlation of condition-specific scores with generic PedsQL scores, the added value of implementing the EA-QOL© questionnaires in clinical practice should first be established. A possible approach is to correlate scores to clinical outcomes, like has been carried out for the PedsQL and DUX-25 [48]. A next step could be to combine the internationally obtained validation results into an IRT model, using the original EA-QOL© items available before item reduction [16], with the addition of topics brought up during cognitive debriefing in multiple countries. However, further research in additional countries is needed to evaluate the potential of CAT for the EA-QOL© questionnaires in daily practice.
One of the strengths of this study is the relatively large sample size, considering that EA is a rare condition. Furthermore, participants were recruited nationwide. Some limitations should be addressed. We recruited participants from hospital databases and not only those who participated in follow-up programs. We did not collect data from non-participants; therefore, selection bias cannot be ruled out. Moreover, due to the relatively low response rates between 39 and 51% despite the two reminders we sent to nonresponders, it remains possible that patients with the most severe symptoms and, therefore, worst QoL did not respond. Furthermore, the parental educational level was higher than in the general population. Although this is a common finding in the EA population [12,48] and in psychometric evaluation in general [49], it should be taken into account that this could lead to bias. Moreover, the online study set-up and some statistical assumptions differed from earlier EA-QOL© validation studies. Next, investigating sex-specific EA-QOL© scores was beyond the scope of this study. Still, it is recognized that females report lower QoL than do males [35,50]. In future cross-cultural evaluation, sex should therefore be considered as a potential confounder [51]. Lastly, the COVID-19 pandemic could have influenced the field test results.

Conclusions
The Dutch-translated EA-QOL© questionnaires showed good reliability and validity. Feasibility was most likely affected by items not deemed applicable to an individual child's situation, as the cognitive debriefing made clear. Leading from this, CAT could be a potential solution to making the questionnaires more suitable for clinical practice in the Netherlands. Cross-cultural evaluation of the validation results obtained in multiple countries should further explore this.  Table S1: Stratified sample size for the cognitive debriefing; Table S2: Summary of the results of the cognitive debriefing; Table S3: Overview of topics parents and children missed in the EA-QOL© questionnaire; Table S4: Feasibility of the EA-QOL© questionnaire for 2-7 year old children (proxy-report, n = 101); Table S5: Feasibility of the EA-QOL© questionnaire for 8-17 year old children (proxy-report, n = 136 and self-report, n = 130); Table S6: Comparison reliability (child-parent agreements) between proxy-reports and self-reports of the EA-QOL© questionnaire for 8-17 years old; Table S7: Subscale and total scores of the previously validated PedsQL questionnaire.  Acknowledgments: This research was generated within two Dutch centers represented in the European Reference Network for rare Inherited and Congenital Anomalies (ERNICA) (not financially supported). We thank all participants for taking part in the cognitive debriefing interviews and for filling out the questionnaires during the field testing. We thank the Dutch patient support group VOKS (Vereniging voor Ouderen en Kinderen met een Slokdarmatresie) for their cooperation. Marinde van Lennep en Maartje Singendonk conducted part of the cognitive debriefing interviews. Ko Hagoort provided editorial advice.

Conflicts of Interest:
The authors of this manuscript have no conflict of interest to disclose.