Differences in Center for Epidemiologic Studies Depression Scale, Generalized Anxiety Disorder-7 and Kessler Screening Scale for Psychological Distress Scores between Smartphone Version versus Paper Version Administration: Evidence of Equivalence

The use of electronic patient-reported outcomes has increased recently, and smartphones offer distinct advantages over other devices. However, previous systematic reviews have not investigated the reliability of the Center for Epidemiologic Studies Depression Scale (CES-D), Generalized Anxiety Disorder-7 (GAD-7), and Kessler Screening Scale for Psychological Distress (K6) when used with smartphones, and this has not been fully explored. This study aimed to evaluate the equivalence of the paper and smartphone versions of the CES-D, GAD-7, and K6, which were compared following a randomized crossover design method in 100 adults in Gunma, Japan. Participants responded to the paper and smartphone versions at 1-week intervals. The equivalence of paper and smartphone versions was evaluated using the intraclass correlation coefficient (ICCagreement). The mean participant age was 19.86 years (SD = 1.08, 23% male). The ICCagreements for the paper and smartphone versions of the CES-D, GAD-7, and K6 were 0.76 (95% confidence interval [CI] 0.66–0.83), 0.68 (95% CI 0.59–0.77), and 0.83 (95% CI 0.75–0.88), respectively. Thus, the CES-D and K6 scales are appropriate for use in a smartphone version, which could be applied to clinical and research settings in which the paper or smartphone versions could be used as needed.


Introduction
The use of patient-reported outcomes (PROs) is neccessary because of several advantages [1][2][3]. Previous studies have shown that the use of PROs to systematically monitor patient symptoms improves patient-physician communication, symptom oversight, and gaps in patient health, quality of life, and clinician perception of symptoms [1][2][3]. PROs are also widely used in the mental health field, and mental health clinicians suggest that the use of PROs in patient consultations can help in making treatment decisions and severity assessment [4,5]. Depressive symptoms, anxiety, and psychological distress are particularly common in the field of mental health, and it has been indicated that these symptoms may coexist and affect each other [6][7][8][9][10][11][12]. As a result, it is crucial to thoroughly evaluate utilizing PRO not just one symptom but also depressed symptoms, anxiety symptoms, and psychological distress. Currently, many PROs exist to measure depressive and anxiety symptoms and psychological distress. For example, the Center for Epidemiologic Studies Depression Scale (CES-D) [13,14], Generalized Anxiety Disorder-7 (GAD-7) [15,16], and

Study Design
This study was conducted using a randomized crossover design to assess the format equivalence of the paper and smartphone versions of the CES-D, GAD-7, and K6. Figure 1 depicts the process of the randomized crossover design used in this investigation. The study was conducted in accordance with ISPOR guidelines [42] and was approved by the Ethical Review Board for Medical Research Involving Human Subjects of Gunma University (Approval no. HS2022-109). Written informed consent was obtained from each participant before study participation. aimed to examine the measurement equivalence of the paper and smartphone versions of CES-D, GAD-7, and K6 based on ISPOR guidelines.

Study Design
This study was conducted using a randomized crossover design to assess the format equivalence of the paper and smartphone versions of the CES-D, GAD-7, and K6. Figure  1 depicts the process of the randomized crossover design used in this investigation. The study was conducted in accordance with ISPOR guidelines [42] and was approved by the Ethical Review Board for Medical Research Involving Human Subjects of Gunma University (Approval no. HS2022-109). Written informed consent was obtained from each participant before study participation.

Participants and Procedure
The study participants were recruited between October 2022 and December 2022 from Gunma University in Gunma, Japan. The recruitment was made by posting posters at Gunma University. Study participation was also encouraged via e-mail and social networking services. Individuals aged ≥18 years who were native Japanese speakers and had a smartphone were considered eligible for this study. Participants who met the eligibility criteria were asked to complete the CES-D, GAD-7, and K6 scales (paper and smartphone versions) after answering demographic information (age and sex) and lifestyle characteristics (i.e., drinking, exercise, and smoking habits). The order in which the PROs were filled out (paper version first or smartphone version first) was randomly determined. To reduce potential recall and carryover effects, the interval between the completion of the two questionnaires was 1 week.

Participants and Procedure
The study participants were recruited between October 2022 and December 2022 from Gunma University in Gunma, Japan. The recruitment was made by posting posters at Gunma University. Study participation was also encouraged via e-mail and social networking services. Individuals aged ≥18 years who were native Japanese speakers and had a smartphone were considered eligible for this study. Participants who met the eligibility criteria were asked to complete the CES-D, GAD-7, and K6 scales (paper and smartphone versions) after answering demographic information (age and sex) and lifestyle characteristics (i.e., drinking, exercise, and smoking habits). The order in which the PROs were filled out (paper version first or smartphone version first) was randomly determined. To reduce potential recall and carryover effects, the interval between the completion of the two questionnaires was 1 week.

Randomization
Participants were randomly assigned in a 1:1 ratio to complete either the paper version first or the smartphone version first before answering the questionnaire (CES-D, GAD-7, and K6). The randomization list was generated by a permuted block method (block size 4) using a computer (Microsoft Excel) by a third party unrelated to the study. The randomization list was sent to the Central Registry Center at Kurashiki Heisei Hospital in Okayama Prefecture, Japan, for random assignment.

Sample Size
The ISPOR guidelines report that 43 participants with no missing data are needed to declare an ICC of ≥0.7 at 80% power and 95% confidence level if the ICC observed in two measurements is expected to be 0.85, using the approximation used by Walter et al. [42,55]. Conversely, the Consensus-based Standards for the Selection of Health Measurement Instruments initiative suggests that a sample size of ≥100 is necessary to obtain statistical power when evaluating test-retest reliability [56]. Taken together, these findings suggest a target sample size of 100 study participants.

K6
The K6 is a 6-item self-report questionnaire used to measure psychological distress, using a 0-4 Likert scale (0 = none of the time, 1 = a little of the time, 2 = some of the time, 3 = most of the time, and 4 = all of the time) [17,18]. Total scores range from 0 to 24, with higher scores indicating greater psychological distress. Previous studies have reported the reliability and validity of the K6 score [10,17,18,26,27].

Software
Electronic versions of CES-D, GAD-7, and K6 were provided on participants' smartphones using Google Forms. The questionnaires were presented in the order CES-D, GAD-7, and K6. The questions, answer choices, and order of questions in the electronic version are the same as those in the paper version of the three scales. Each questionnaire was presented on a separate page; however, all the questions for each questionnaire are displayed on the screen. Scrolling down the screen allows the user to move to the next answer. After answering all the questions in the questionnaire, the next questionnaire can be answered by pressing the "Next" button (specifically, the 20 questions in the CES-D are displayed on a single page, and after answering all of them, the "Next" button is pressed to move to the GAD-7 questionnaire page). Participants can select their answers by tapping the radio buttons on the screen. It is not possible to move to the next page without answering a question item or to select two answers to the same question. However, it is possible to change a previous answer by pressing the "Back" button.

Statistical Analysis
In this study, the switch from the paper version to the smartphone version corresponds to the light to moderate adjustment suggested by the ISPOR guidelines [42]. As a result, to confirm the equivalence of each scale between the paper and smartphone versions, the intraclass correlation coefficient (ICC agreement ) and its 95% confidence interval were calculated based on the two-way random-effects model, one of the most commonly used statistical measures in equivalence studies of this kind [42,58]. Unlike the Pearson and Spearman correlation coefficients, the ICC agreement is more appropriate for assessing agreement because it considers not only chance errors but also systematic errors [56,59]. ICC is expressed as a value between 0 and 1, with values >0.70 indicating adequate reliability [56,58]. The internal consistency between the paper and smartphone versions of each questionnaire was calculated using Cronbach's alpha and McDonald's omega. Furthermore, 95% confidence intervals (CIs) for these indices were calculated; values of Cronbach's alpha and McDonald's omega were denoted as 0-1. The alpha and omega values increase with the degree of correlation between the objects [60]. Good internal consistency is defined as Cronbach's alpha and McDonald's omega values of 0.7 or above [59,60]. In addition, linear mixed models (LMM) were used to confirm the carryover effect of each scale score [61]. In the LMM, the questionnaire administration format (paper or smartphone version), order of administration (paper or smartphone version first), and interaction between questionnaire administration format and order of administration are considered fixed-effect factors, whereas participants were considered random-effect factors. Statistical significance was set at p < 0.05 with a two-tailed test. All analyses were performed in R (version 4.0.2 for Windows; The R Project for Statistical Computing; Vienna, Austria).

Characteristics of the Study Participants
Of the 100 participants who met eligibility, 100 completed the paper and smartphone versions of the questionnaire and provided complete data. In the paper-first group, 50 participants first completed a paper-version questionnaire. In the smartphone-first group, 50 participants first completed the smartphone version questionnaire. The mean age of the study participants was 19.86 years (SD = 1.08, 23% male), 9 (9%) had a drinking habit, 1 (1%) had a smoking habit, and 37 (37%) had an exercise habit (Table 1).

Mean and LMM Results
The mean values for each group and the LMM results are shown in Table 2. The interaction of questionnaire format and order of administration on the CES-D score was not significant (p = 0.96; 95% CI −1.71 to 1.79). The interaction of a questionnaire format and order of implementation on GAD-7 scores was not significant (p = 0.96; 95% CI −0.82 to 0.78). The interaction of a questionnaire format and order of implementation on the K6 score was not significant (p = 0.17; 95% CI −1.31 to 0.23). Based on these results, no carryover effects were observed.

Discussion
This study evaluated the equivalence of the embodiments to the CES-D, GAD-7, and K6 evaluated in smartphone and paper versions. The results suggest that CES-D and  [42]. Considering the ICC and Cronbach's alpha criteria, the smartphone versions of CES-D and K6 are at least considered suitable for use at the group level. In other words, the smartphone versions of the CES-D and K6 may not be suitable for use on an individual level. However, it is crucial to remember that the ICC agreement's 95% CI for K6 was 0.75-0.88 and for CES-D was 0.66-0.83. This 95% CI indicates that, with a 95% probability, the true value of ICC agreement for CES-D is 0.83 in the best case and 0.66 in the worst case [62]. Therefore, while the smartphone and paper versions of the CES-D reveal better agreement, they may also indicate lower agreement, below the threshold of 0.7, which is considered good. However, the ICC being below 0.7 may not necessarily be due to a low degree of agreement on the scale but also to issues of study design, such as low inter-subject variability sampled and sample size [63]. The low variability among sampled patients probably had an impact on the accuracy of the ICC agreement estimations because our study had a large enough sample size to assess the ICC suggested by the Consensus-based Guidelines for the Selection of Health Measuring Instruments initiative [63]. We were restricted to a relatively young population (18-22 years old) in our sample. As a result, further investigation in a broader age population is required to provide more accurate estimates of ICC agreement and its 95% CI.
The Cronbach's alpha for the GAD-7 on smartphones was 0.80 (95% CI 0.75-0.86), indicating that it has the same internal consistency as the GAD-7 on paper (0.80; 95% CI 0.75-0.86). McDonald's omega values for the GAD-7 on a smartphone were also 0.83 (95% CI 0.76-0.88), and they were 0.83 (95% CI 0.76-0.88) for the GAD-7 on paper, indicating strong internal consistency. However, the ICC agreement for GAD-7 was 0.68 (95% CI 0.59-0.77), suggesting a low concordance between the smartphone and paper versions. This low ICC agreement could be attributed to the changes following the transition from the paper version to the smartphone version. In this study, participants scrolled the screen to answer the items in each of the smartphone versions of the questionnaire. In addition, the questions and their response items were displayed in different positions in the paper and smartphone versions. These changes are defined as a moderate level of modification in the ISPOR guidelines, which is the level of modification that requires equivalence assessment [42]. In GAD-7, these changes from paper to smartphone versions may not have been suitable. Future studies should create a smartphone version of the GAD-7 with a display format more similar to the paper version to evaluate equivalence. It is also essential to note that the 95% CI for ICC agreement in GAD-7, as in CES-D, was 0.59-0.77. This 95% CI means that the true value of ICC agreement for GAD-7 is 0.77 in the best case and 0.59 in the worst case, with a 95% probability [62]. As a result, even while the GAD-7 on a smartphone or piece of paper would finally surpass the 0.7 criterion, they might still exhibit inferior agreement. However, even in the ICC agreement for the GAD-7, the effect of the low sample variability in this study cannot be ignored [63]. Hence, similar to the CES-D, more research in a larger age range is required to more precisely estimate the ICC agreement and its 95% CI.
As far as we could find, no studies have tested the equivalence of the electronic and paper versions of the K6 and GAD-7. However, previous studies have examined the equivalence of electronic and paper versions of the CES-D. A study of 2400 teachers in Taiwan, which tested the equivalence of the Internet-based CES-D and paper-based CES-D, found little difference in potential means and concluded that Internet-based CES-D is a promising alternative to paper-based CES-D [53]. In addition, the equivalence of the paperand tablet-based methods was tested in 79 patients with low back pain, and the ICC was 0.75 (0.64-0.83), which is comparable to our results [52]. On the contrary, previous study have tested the equivalence of PC-and paper-based CES-Ds and suggested correlation coefficients ranging from 0.96 [64]. However, the Pearson and Spearman correlation coefficients are not extremely rigorous parameters for assessing equivalence because they do not account for systematic errors [42,59]. Considering the characteristics of the results of these previous studies and the potential advantages of smartphones (easy and ubiquitous accessibility), at least a smartphone version of CES-D may be a promising alternative strategy for PC-and tablet-based CES-D.
This study has several limitations. First, the study participants were a relatively young population, aged 18-22 years. Therefore, the results of this study may not apply to other age groups. Second, the influence of the carryover effects cannot be ignored. In a crossover design, a carryover effect may occur if the interval between the first and second evaluations is short. We tried to reduce the carryover effect as much as possible by keeping the interval between the first and second evaluations to 1 week. In fact, no statistically significant differences in the carryover effects were found in this study. However, given the lack of consensus on the ideal implementation interval when testing the equivalence of PROs [65], the influence of carryover effects must be carefully considered. Third, the smartphone and paper versions of the PROs were administered in the same room under the supervision of the researcher. If participants responded to the smartphone version of the PRO without meeting the researcher face to face, they may have been more anonymous than in our study and could have responded in a more natural setting. Therefore, the presence or absence of a supervisor and the effect of locations such as the clinic or home setting, should be fully considered. On the contrary, responding in the same room with the researcher made it possible to prevent omissions in the paper version and control for test conditions that would reduce the general likelihood of noise, distraction, mood fatigue, etc. [66]. Fourth, due to the difficulty of the participant burden in completing the questions, this study did not examine cognitive debriefing or usability testing, which are classified as minor alterations. Future studies should incorporate cognitive debriefing and usability testing of the smartphone versions of the CES-D, GAD-7, and K6, as these characteristics may considerably alter their usefulness in research and clinical contexts. Fifth, the smartphone versions of the CES-D, GAD-7, and K6 employed in this study could not be completed until all items were answered. The equivalency results reached in this study may have been impacted if participants were made to complete tasks they could have skipped in the paper version. Consequently, the effect of forced responses in the smartphone version of this study should be properly considered. Future studies should explore the equivalence of the paper and smartphone versions of the CES-D, GAD-7, and K6 by including a "choose not to answer" or "skip question" option. Sixth, participants replied to the CES-D, GAD-7, and K6 in that order on both the paper and smartphone versions of the survey. Thus, it was impossible to rule out the impact of ordering effects. As a result, the effects of order effects should be taken into consideration while interpreting the findings of this study.

Conclusions
This study demonstrates the equivalence of the paper and smartphone versions of the CES-D and K6. Accordingly, both the CES-D and K6 scales are appropriate for use in a smartphone version, which could be applied to clinical and research settings in which paper and smartphone versions could be selected as needed. However, the paper and smartphone versions of the GAD-7 should not be used interchangeably, as the paper and smartphone versions did not show equivalence because of low ICC agreement ; thus, further research is needed.  Informed Consent Statement: Informed consent was obtained from all participants involved in the study. Written informed consent has been obtained from the participants to publish this paper.

Data Availability Statement:
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.