The Use of Single-Item Ratings Versus Traditional Multiple-Item Questionnaires to Assess Mood and Health

Collecting real-world evidence via ‘at home’ assessments in ambulatory patients or healthy volunteers is becoming increasingly important, both for research purposes and in clinical practice. However, given the mobile technology that is frequently used for these assessments, concise assessments are preferred. The current study compared single-item ratings with multiple-item subscale scores of the same construct, by calculating the corresponding Bland and Altman 95% limits of agreement interval. The analysis showed that single-item ratings are usually in good agreement with assessments of their corresponding subscale. In the case of more complex multimodal constructs, single-item assessments were much less often in agreement with multiple-item questionnaire outcomes. The use of single-item assessments is advocated as they more often incorporate assessments of all aspects of a certain construct (including the presence, severity, and impact of the construct under investigation) compared to composite symptom scores.


Introduction
During the COVID-19 pandemic, collecting real-world evidence via 'at home' assessments in ambulatory patients or healthy volunteers became increasingly important, both in research and in clinical practice. Often, mobile technology is applied to collect data in real-time, without requiring hospital visits which may be more demanding in pandemic times and a burden to patients and healthcare providers [1,2]. There are, however, limitations to the design of mobile assessments [2], and the restrictions of small screens advocate for concise assessments.
For many constructs such as sleep quality and hostility, there are no readily and easily applicable biomarkers available. Researchers, therefore, have to rely on patient-reported outcome measures (PRO) via questionnaires. It can be questioned if traditional, often lengthy, questionnaires to assess mood, quality of life, and health correlates are either necessary or practical. The use of these multiple-item questionnaires may be a burden to patients, and increase the risk of noncompletion of clinical and research assessments. Single-item assessments could be an alternative to prevent this.
The development of a valid and reliable PRO is essential and should capture all important aspects of a condition under investigation (see Figure 1). As illustrated in Figure 1, a good PRO should evaluate the presence, severity, and impact of a condition. As such, the development of a valid PRO may not always be a straightforward venture [3]. Psychological constructs, psychological states, and diseases are often characterized by more than one aspect (in Figure 1 referred to as A, B, and C). For example, attention deficit hyperactivity disorder (ADHD) is characterized by inattention, hyperactivity, Psychological constructs, psychological states, and diseases are often characterized by more than one aspect (in Figure 1 referred to as A, B, and C). For example, attention deficit hyperactivity disorder (ADHD) is characterized by inattention, hyperactivity, impulsivity; and insomnia is characterized by sleep initiation problems and/or sleep maintenance problems that have a negative impact on daytime functioning. More complex constructs such as schizophrenia, alcohol hangover, or irritable bowel syndrome may require the assessment of more individual aspects to describe the full syndrome. When a questionnaire comprises the assessment of the presence and severity of the characteristic symptoms of a disease, this either results in a lengthy questionnaire to capture all of them or in a short questionnaire that omits several symptoms that may have been of importance to individual patients. Moreover, in such questionnaires the impact on daily activities is usually not assessed.
Given this, the US Food and Drug Administration suggested that single-item assessments may even be preferred, as they incorporate the subjects' evaluation of the presence, severity, and impact of a condition, with greater subject-focused information value than the specific symptom-based sum score of multiple-item questionnaires can provide [4]. The question, however, arises as to whether single-item assessments are equally effective as their corresponding multiple-item questionnaires. If they are equally effective, then single-item assessments would be a cost-effective and time-reducing strategy to examine mood, quality of life, and health correlates in clinical monitoring and experimental research following patients over time. The single-item assessment provides a real-time, directly available outcome. Single-item assessments could, for example, be implemented in randomized clinical trials where assessment windows relative to treatment intake are often limited, with patients for which the completion of lengthy questionnaires may be a burden (e.g., elderly), or in clinical practice to determine if a more thorough evaluation of a patient is warranted. The purpose of the current analysis was to evaluate whether assessments of single-item scales are in agreement with their corresponding multiple-item scales.

Materials and Methods
To compare single-items and multiple item scales, data from Balikji et al. [5] were re-evaluated. In this study, N = 2489 participants (83.4% women) completed an online survey on 'food and health'. The Dutch participants, aged 18 to 30 years old, were recruited via Facebook. Their mean (standard deviation, SD) age was 21.3 (2.1) years old. For the current analysis, we re-evaluated assessments of mood, mental resilience, insomnia, and irritable bowel syndrome (IBS), which are described below.

Profiles of Mood States-Short Form (POMS-SF)
The short version of the Profiles of Mood States (POMS-SF) was completed to assess mood [6][7][8]. The Dutch version comprises 32 items that are scored on a 5-point Likert scale (0 = not at all, 4 = extremely) and has five subscales assessing tension (6 items), depression (8 items), anger (7 items), fatigue (6 items), and vigor (5 items). Higher scores on the scales imply more psychological distress. In addition to the subscales, a total psychological distress score was computed as the sum score of all 32 items. For the comparison between scale scores and single-items, the items "angry" (anger), "blue" (depression), "tense" (tension), "vigorous" (vigor), "fatigued" (fatigue) were selected.

The Depression Anxiety Stress Scales (DASS-21)
The Dutch version of the Depression Anxiety Stress Scales (DASS-21) was completed to assess stress, anxiety, and depression [9][10][11]. The scale consists of 21 items that can be scored on a 4-point Likert scale, (0 = not at all, 3 = very much or most of the time). The sum scores are computed for three scales assessing depression (7 items), anxiety (7 items), and stress (7 items). A higher scale score is associated with a greater level of depression, anxiety, and/or stress. In addition to the subscales, a total psychological distress score was computed as the sum score of all 21 items. For the comparison between scale scores and single-items, the items "I felt down-hearted and blue" (depression), "I felt scared without any good reason" (anxiety), and "I found it difficult to relax" (stress) were selected.

Brief Resilience Scale (BRS)
Mental resilience was evaluated utilizing the Brief Resilience Scale [12]. The BRS comprises 6 items that assess one's ability to recover from stress, i.e. to bounce back. The 6 items can be rated on a 5-point Likert scale ranging from 'strongly disagree' to 'strongly agree'. Scores range from 1 to 5, and reversed scoring is applied to some of the items. The mean score of the six items was computed to represent the level of mental resilience. A higher score implies a higher level of mental resilience. Previous research revealed that mental resilience scores correlated significantly with personality characteristics, psychological coping strategies, and health correlates [12,13]. For the comparison between scale scores and single-items, the item "I tend to take a long time to get over set-backs in my life" was selected.

SLEEP-50 Insomnia Subscale
The 9-item insomnia subscale of the SLEEP-50 questionnaire [14] was completed. Each item can be scored on a 4-point scale ranging from 1 (not at all), 2 (somewhat), 3 (rather much), and 4 (very much), and the total insomnia score is computed by adding up scores on the individual items. A total insomnia score was computed as the sum score of all 9 items. For the comparison between scale scores and single-items, the items "I have difficulty in falling asleep" (sleep initiation problems) and "After waking up during the night, I fall asleep slowly" (sleep maintenance problems) were selected.

Birmingham IBS Symptom Questionnaire
The presence and severity of IBS symptoms were assessed with the Dutch version of the Birmingham IBS Symptom Questionnaire [5,15,16]. The questionnaire consists of 11 items, with 6-answer possibilities, ranging from 0 ('none of the time) to 5 ('all of the time'). Scores on three symptom specific scales representing the factors 'diarrhea' (5 items), 'constipation' (3 items), and 'pain' (3 items) were calculated. Highly scores imply more complaints. In addition to the subscales, a total IBS score was computed as the sum score of all 11 items. For the comparison between scale scores and single-items, the items "How often have you had discomfort or pain in your abdomen?" (pain), "How often have you been troubled with diarrhea?" (diarrhea), and "How often have you been troubled by constipation?" (constipation) were selected.

Single-Item Selection
The single-item was part of the respective subscale and was selected by the authors (J.C.V. and G.B.) based on the representativeness of an overall construct and if scores on the item had the highest correlation with the respective construct. For most subscales, this was a straightforward selection as the subscales comprised an item that was the same as the overall scale construct (e.g., the item "fatigued" of the POMS-SF fatigue scale). For other scales, based on the investigators' judgment, an item was chosen that best represented the overall assessed construct (e.g., "I felt it difficult to relax" for the DASS21 stress subscale). The insomnia scale has no subscales. For this scale two items instead of one item were selected, to reflect that insomnia can be characterized by sleep initiation problems, sleep maintenance problems, or both.

Statistical Analyses
Statistical analyses were conducted with SPSS (IBM Corp. Released 2013. IBM SPSS Statistics for Windows, Version 25.0. Armonk, NY, USA: IBM Corp.). First, Spearman's rank correlation coefficients (Rho) were computed between single-item and corresponding fullscale scores. Correlations were considered statistically significant if p < 0.05 (in a two-tailed test). Second, to demonstrate whether the outcomes of the single-item assessment were not different from their corresponding subscales, the 95% limits of agreement method by Bland and Altman [17] was applied. In order to make the two assessments directly comparable, the single-item scores were multiplied by the number of items of the corresponding subscale. Difference scores (DIFF) of the two outcomes (full-scale score-single-item score) and the corresponding standard deviation (SD DIFF ) were computed. The 95% limits of agreement method states that agreement between methods (in this case single-item versus full-scale assessment) can be concluded if 95% of the DIFF scores lie between (DIFF − 1.96 × SD DIFF ) and (DIFF + 1.96 × SD DIFF ). In other words, the outcomes of the two assessments are significantly different if more than 5% of the differences scores lie outside the 95% limits of agreement interval. Bland and Altman plots were produced which show the difference between the single-item assessments (multiplied by the number of subscale items) and subscale assessments for each subject against the mean score of the two assessments [17]. If the assessments are identical, datapoints are close to the line of equality (zero) and 95% of the data points is lies between the lower and upper limit of the 95% limits of agreement interval. The plots also illustrate the presence of possible extreme or outlying observations. Third, applying the same Bland and Atman methodology [17] it was evaluated whether the outcomes of the single-item assessment are in agreement with the corresponding full-scale outcomes of the POMS-SF, DASS21, Birmingham IBS scale, and SLEEP-50 insomnia scale. For this comparison, the single-item scores were multiplied by the number of items of the corresponding full-scale. Table 1 summarizes the correlations between the single-item assessments and corresponding subscale and full-scale scores. All correlations were highly significant. Correlations between single-item scores and subscales were considerably higher than correlations between single-item scores and the full-scale scores. Results of the comparisons between single-item scores and subscale scores are summarized in Table 2, and the corresponding Bland and Altman plots in Figure 2.  It is evident from Table 2 that for most assessments, the single-item was equally effective as the corresponding subscale, as the percentage of difference scores outside the 95% limits of agreement was below 5%. No agreement between single-item and subscale scores was found for POMS-SF anger ( Table 2). Table 3 summarizes the results for the comparisons between single-item scores and full-scale scores. No agreement was found for IBS-constipation and diarrhea; for POMS-SF-anger, depression, and tension; for DASS-21 for anxiety and depression and for sleep initiation and maintenance problems.

Discussion
To enable mobile, user-friendly assessments, it is desirable that these are short, reliable, and valid. The comparisons between single-item assessments and their corresponding multiple-item subscales in the current study reveal that the vast majority of single-item assessments are in agreement with their lengthier multiple-item subscales. With one exception (POMS-SF anger), it was consistently shown that single-item assessments of unipolar constructs yield similar results to their corresponding more elaborate multipleitem scale.
Agreement between single-item ratings and the full-scales was usually not observed. This is understandable as these scales are composed of several subscales (e.g., POMS-SF and DASS-21) or the scale assesses different constructs that were not represented by a single-item (e.g., sleep initiation and sleep maintenance problems in the SLEEP-50 insomnia subscale). This observation reflects the importance of the fact that a single-item assessment should fully describe all aspects of the construct under investigation. When selecting items from the original scales, the aim was to select the item that best described the construct. For the subscales, this objective was usually achieved, and the single-item univocally described the construct under investigation (e.g., depression and anxiety). However, when assessing constructs that have multiple components (e.g., insomnia or ADHD), assessing only one of these characteristics is usually insufficient to achieve agreement between the single-item rating and multiple-item scale.
Assessing a construct by composing a multiple-item scale may be problematic in itself. This has been observed in studies that aim to assess a construct by combining ratings of symptoms associated with the construct under investigation. For example, in alcohol hangover research, 47 different hangover symptoms have been identified [18], which have a differential impact on cognitive functioning, physical functioning, and mood [19]. There are currently three validated scales that are commonly used, which assess overall hangover severity by calculating the sum of individual items that rate the frequency and/or severity of specific hangover symptoms such as fatigue, nausea, and headache [20][21][22]. Given the large number of known hangover symptoms, it can be questioned whether the composite scale score adequately represents the overall hangover experience. Indeed, recent research revealed that outcomes of single-item assessments of hangover severity are not in agreement with multiple-item assessments [23].
This observation underlines the fact that single-item ratings may be superior over multiple-item ratings, as these are thought to assess the complete experience of the construct under investigation. That is, all aspects deemed relevant to the patient are included in this assessment, including the presence, severity, and impact of symptoms and experiences, opposed to multiple-item scales which may be at risk of not consulting about certain issues that may be relevant to the subject's evaluation. Therefore, especially in case of complex constructs, assessments are preferred to be made via direct single-item assessments. Such single-item assessments are ideal for quick time assessments in real-time, which is a clear advantage when using mobile technology, but also in clinical practice when quick results are required, and in clinical trials with time constraints. If more in depth information about a construct is needed, multiple-item assessments or interviews can be conducted at a later stage. Alternatively, a single-item assessment can advise if a more detailed assessment is needed at this point of time.
According to Cohen [24], most correlations between single-item assessments and fullscale outcomes listed in Table 1 can be considerate as moderate to high (r > 0.5). However, the fact that correlations are highly statistically significant is no proof that two assessments are in agreement, i.e., measuring the same construct with an identical outcome. There are many examples of high and significant correlations between measures that actually assess a different construct. Eminent examples are the correlations between bodyweight and height, or between alcohol consumption and smoking [25]. Therefore, one should never rely on correlations to determine if two measures assess the same construct. Instead, the 95% limits of agreement method by Bland and Altman is considered as gold standard to determine if two assessments are in agreement or not [16].
A limitation of the current single-item assessments is the fact that these were all conducted with the Likert scales used in the corresponding full subscales, allowing limited differentiation between the answering possibilities. To allow more differentiation in scoring, for future use, it is proposed to apply an 11-point rating scale for the single-item assessments, ranging from 0 (absent) to 10 (extreme severe).
Bland and Altman [16] stated that "the decision about what an acceptable agreement is a clinical one; statistics alone cannot answer the question". We concur with this viewpoint. It should always be judged whether a single-item rating provides sufficient information in a specific context. For example, it can be argued that complex constructs such as schizophrenia or ADHD can never be accurately represented by a single-item assessment (or single biomarker), and therefore require more thorough and elaborate assessments per se, including assessments with traditional multiple-item questionnaires. Similarly, the single-item human immunodeficiency virus (HIV) risk stage-of-change measure has been shown to disagree with a conventional measure and thus was not reflective of the change of HIV risk [26]. In this study, the conventional measure and the single-item rating did not correspond. It was concluded that the measures do not assess the same construct, and that the single-item assessment needed to be revised or its use should be abandoned. Additionally, a survey from 21 countries compared single-item self-assessment against a multi-item health measure, only the multi-item measure showed a significant correlation with life expectancy [27]. However, the conclusions of this study were based on correlational analysis and the Bland and Alman 95% limits of agreement method was not applied. Nevertheless, clinical judgement based on the purpose of the study should guide the choice of single-item versus multi-item instrument.
Finally, there is a likelihood that people would respond differently to a single item presented in isolation than to the same item presented in the context of a larger scale composed of conceptually overlapping items. As stated in the introduction, single-item assessments incorporate the subjects' evaluation of the presence, severity, and impact of a condition, with greater subject-focused information value than the specific symptombased sum score of multiple-item questionnaires can provide [4]. On the other hand, other literature suggests that multipole-item assessments should be favored as they have a significantly greater accuracy when screening or diagnosing patients [28]. In addition, psychometric theory has suggested that multiple item scales should be preferred as possible measurement error averages out when multiple item scores are summed to obtain a total score [29]. However, this was not evident from the current analysis, as the Bland-Altman comparisons usually showed agreement between single-and multiple item assessments. Taken together, it depends on the purpose of the research study or clinical assessment made whether single-item or multipole item measurements should be preferred.

Conclusions
In conclusion, the current analysis shows that single-item ratings for irritable bowel syndrome, depression, mood, mental resilience, and insomnia adequately represent the outcomes of traditional multi-item assessments.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest: E.S. is employee of Nutricia Danone Singapore. J.G. is part-time employee of Nutricia Research and received research grants from Nutricia research foundation, Top Institute Pharma, Top Institute Food and Nutrition, GSK, STW, NWO, Friesland Campina, CCC, Raak-Pro, and EU. Over the past 3 years, J.C.V. has received grants/research support from Janssen, Danone, and Sequential Medicine, and has acted as a consultant/advisor for More Labs, Red Bull, Sen-Jam Pharmaceutical, Toast!, Tomo, and ZBiotics. G.B. has nothing to declare.