Measuring Subjective Sleep Quality: A Review

Sleep quality is an important clinical construct since it is increasingly common for people to complain about poor sleep quality and its impact on daytime functioning. Moreover, poor sleep quality can be an important symptom of many sleep and medical disorders. However, objective measures of sleep quality, such as polysomnography, are not readily available to most clinicians in their daily routine, and are expensive, time-consuming, and impractical for epidemiological and research studies., Several self-report questionnaires have, however, been developed. The present review aims to address their psychometric properties, construct validity, and factorial structure while presenting, comparing, and discussing the measurement properties of these sleep quality questionnaires. A systematic literature search, from 2008 to 2020, was performed using the electronic databases PubMed and Scopus, with predefined search terms. In total, 49 articles were analyzed from the 5734 articles found. The psychometric properties and factor structure of the following are reported: Pittsburgh Sleep Quality Index (PSQI), Athens Insomnia Scale (AIS), Insomnia Severity Index (ISI), Mini-Sleep Questionnaire (MSQ), Jenkins Sleep Scale (JSS), Leeds Sleep Evaluation Questionnaire (LSEQ), SLEEP-50 Questionnaire, and Epworth Sleepiness Scale (ESS). As the most frequently used subjective measurement of sleep quality, the PSQI reported good internal reliability and validity; however, different factorial structures were found in a variety of samples, casting doubt on the usefulness of total score in detecting poor and good sleepers. The sleep disorder scales (AIS, ISI, MSQ, JSS, LSEQ and SLEEP-50) reported good psychometric properties; nevertheless, AIS and ISI reported a variety of factorial models whereas LSEQ and SLEEP-50 appeared to be less useful for epidemiological and research settings due to the length of the questionnaires and their scoring. The MSQ and JSS seemed to be inexpensive and easy to administer, complete, and score, but further validation studies are needed. Finally, the ESS had good internal consistency and construct validity, while the main challenges were in its factorial structure, known-group difference and estimation of reliable cut-offs. Overall, the self-report questionnaires assessing sleep quality from different perspectives have good psychometric properties, with high internal consistency and test-retest reliability, as well as convergent/divergent validity with sleep, psychological, and socio-demographic variables. However, a clear definition of the factor model underlying the tools is recommended and reliable cut-off values should be indicated in order for clinicians to discriminate poor and good sleepers.


Introduction
The term sleep quality is commonly used in sleep medicine and can refer to a collection of sleep measures including Total Sleep Time (TST), Sleep Onset Latency (SOL), sleep maintenance, Total Wake Time (TWT), Sleep Efficiency (SE), and sometimes sleep disruptive events such as spontaneous arousal or apnea [1]. Moreover, sleep quality appears to be orthogonal to the term sleep quantity. For example, the presence of sleep complaints has been reported even when SOL, Wakefulness After Sleep Onset (WASO), TST and awakening were similar to those reported in normal non-complaining individuals [2]. Complaints of disturbed (or poor quality) sleep have been confirmed in almost every country [3] and among patients in all specialties of medicine [4][5][6][7][8][9][10][11][12]. Untreated sleep disorders may lead to potentially life-threatening symptoms, considering that sleep disorders are not only a consequence of medical illness but are also primary drivers of other illnesses [13]. It is now recognized that sleep disturbances are associated with neurocognitive dysfunctions, attention deficits, impaired cognitive performance, depression, anxiety, stress, and poor impulse controls [11]. Poor sleep can severely affect daytime performance, both socially and at work, and increases the risk of occupational and automobile accidents, poor quality of life and poor overall health [14]. Thus, the assessment of sleep quality appears to be relevant for epidemiological and clinical studies.
Sleep quality can be assessed using both objective and subjective methods. Objective methods such as polysomnography (PSG) and actigraphy demonstrate high reliability in obtaining information on sleep parameters [1]. However, these objective methods, such as PSG (see also Multiple Sleep Latency Test or MSLT for the assessment of daytime sleepiness), are not readily available to most clinicians in their daily routine, and are expensive and time-consuming [15]. Even if the actigraph has several advantages (e.g., it is not costly), the recorded activity is only a proxy for sleep and is not sleep itself. Moreover, there are a variety of devices and scoring algorithms available that limit the comparability between different actigraphic devices [1].
Among the subjective methods, the sleep diary is the most widely-used assessment [16]. The sleep diary requires the client to record daily morning estimates for the parameters of their sleep pattern, and, as such, yields information concerning a number of relevant metrics such as SOL, WASO, TST, total time spent in bed (TIB), SE, and satisfaction as a subjective global appraisal of each night's sleep [16]. However, it is clear that its successful use relies heavily on daily (prospective) recordings as soon as individuals wake up in the morning, a task that may be difficult for older people to remember to do consistently, limiting the utility of the sleep diary for screening or epidemiological studies. In contrast, retrospective self-report measures, such as questionnaires, can be widely used in both routine care and clinical trials considering that they have several advantages including their low cost, and their potential to be administered to several types of populations over the Internet [17], as these measures are self-explanatory and do not require supervision. In addition, self-rating questionnaires have the advantages of high patient compliance, ease of administration, and reduced demand on medical specialists' time.
Given the important diagnostic role of rating scale questionnaires, it is beyond doubt that the psychometric properties of these tools need to be established. Specifically, in the present review, we consider dimensionality, reliability, and construct validity [18][19][20]. Dimensionality is generally evaluated by factor analysis (e.g., exploratory factor analysis, or EFA, and confirmatory factor analysis, or CFA), which attempts to discover patterns in a set of variables based on shared variance. In particular, this analysis tries to identify the simplest and most parsimonious means with which to interpret and represent the observed data, in order to infer the smallest number of unobserved or latent variables that can still account for the observable variables. Indeed, EFA is used to find the smallest number of common latent factors, accounting for the correlations [20], while CFA is further used to test the relationship between the observed data and the hypothesized latent factors [20]. Reliability reflects the extent to which the measure is reliable, that is, free from errors in the scores that are not due to the true state of the construct measured. It can be expressed as internal consistency (Cronbach's alpha or α), test-retest (e.g., intraclass correlation coefficient or ICC), inter-rater (degree of agreement between the scores given by different raters for the same respondent) or intra-rater (degree of agreement between scores given by the same respondent or rater at one time and those given at another time) reliability. Cronbach's α ranges from 0 to 1 and a higher score indicates greater internal consistency [18]. Test-retest reliability calculates the absolute changes in a measure assessed independently on two distinct occasions [19]. The ICC is the preferred method to assess test-retest reliability and it is a measure of the agreement between two (or more) raters or evaluation methods in the same set of participants. The ICC ranges from 0 to 1 and higher values represent greater agreement between raters or evaluations [19]. Construct validity indicates the degree to which the measure scores reflect the hypotheses, including convergent (the degree of relatedness between two or more constructs hypothesized to be related), divergent (the degree of relatedness between two or more constructs hypothesized to be different) and known-group (the ability of the measure to discriminate between a group of individuals known to have a particular trait and those who do not have that trait). Finally, when available, the sensibility to change is an important piece of information related to how much the sleep questionnaire is able to detect the improvement or the effect of a specific therapy on sleep disorders.
The importance of reviewing the reliability, validity, and dimensionality of questionnaires assessing sleep quality for research, epidemiological, and clinical studies is shown by the strong relationship between reliability (i.e., whether the items of a scale measure the same construct) and validity (i.e., whether or not a scale measures what it is intended to measure). Although reliability is important for a study, internal consistency is not sufficient if it is not combined with validity. Thus, for a test to be reliable, it also needs to be valid. The capacity of a sleep questionnaire to screen for poor and good sleepers is related to its construct validity. The Receiver Operating Characteristic (ROC) curve analysis is usually used to determine a cut-off value. In the ROC curve analysis, the sensitivity (i.e., the probability of a positive screen result, that is, the proportion of accurately classified individuals who report poor sleep quality) and specificity (i.e., the probability of a negative screen result, that is, the proportion of accurately classified individuals who have good sleep quality) are plotted against each other. The Area Under a ROC Curve (AUC) provides a measure of the discriminatory power of a screening test at a single threshold that separates poor and good sleepers. The AUC ranges from 0 to 1 and, thus, a value of 0.5 indicates a lack of effectiveness while a value that is very close to 1 indicates a very efficient tool.
Moreover, the dimensionality of the questionnaires reflects whether the items are all correlated and representative of factors. Indeed, the consequence of a questionnaire being heterogeneous or multidimensional may be a possible attenuation of its practical application in clinical diagnostics. The dimensionality of a questionnaire directly influences the reporting of its intended measures. For example, if a questionnaire is supposed to be described with one single factor, suggesting the practical usefulness of its total score for screening individuals, but the factor analysis shows that a 2-or 3-factor model obtains a better fit with data, then the diagnostic use of the total score is in question. In the present review, we decided to report information, for each article included, regarding these three psychometric features: dimensionality, reliability, and validity.
As reported by Buysse and colleagues [21], sleep quality represents a complex construct that is difficult to define. In line with the clinical sleep dysfunction evaluation [11], the main complaints of a patient are the inability to obtain an adequate nighttime sleep even when there is the opportunity for sleep (i.e., insomnia disorder), negative daytime consequences due to poor sleep (e.g., daytime sleepiness), episodic nocturnal movements or behaviors (e.g., snoring, kicking of legs, bruxism, sleep walking, or talking), or a combination of these concerns. Thus, the self-reported questionnaires assessing poor sleep may incorporate all items (or a combination of them) of the aforementioned aspects, or may be selective for the assessment of specific sleep problems (e.g., insomnia or daytime sleepiness).
In line with these assumptions, to the best of our knowledge, no reviews have been published that concurrently consider dimensionality, reliability, and validity of the tools assessing subjective sleep quality. At the end of 2007, Martoni and Biagi [22] published a review reporting 26 possible sleep quality questionnaires (see Table 2, p. 323) [22], while the majority of the published reviews focuses on a few tools [e.g., [4][5]7,[11][12], limiting any comparisons. For example, Wells et al. [4] considered four questionnaires, Hoey et al. [6] took into account three subjective sleep measurement scales, while Mollayeva et al. [11] focused on one tool only. However, the review by Martoni and Biagi [22], while more comprehensive, was published in Italian and an update of this review is needed. For this reason, we decided to review the psychometric properties and the dimensionality of all sleep questionnaires reported by Martoni and Biagi [22] in studies published within the temporal range of 2008 to 2020, in adults, and in clinical and non-clinical populations.

Identification of Eligible Studies
PubMed and SCOPUS were searched from 1 January 2008 to 30 June 2020 for each questionnaire presented in [22]. Filters were applied to the search, limiting the selection to those studies involving humans with age > 18 years and published in English. Of the papers located, reference lists were also scanned for further papers and a search was undertaken to discover any papers related to the aim of the present review.
The following descriptors and medical subject reading (MeSH) terms were used as search terms in the databases: "the extended name of the questionnaire" (e.g., Mini-Sleep Questionnaire) OR "acronym form" (e.g., MSQ) AND "reliability" OR "reliable" OR "testretest" OR "validity" OR "validation" OR "valid". This procedure was adopted for all questionnaires reported in Table 2 (p. 323) of [22]. A total of 5734 articles were initially identified ( Figure 1). Through the process of article screening, 49 articles, referring to eight questionnaires for the assessment of sleep quality, were included in the final analysis. These articles respected our inclusion criteria: (1) the study population was composed of adults with age > 18 years; (2) the articles were published in English in the temporal range of 2008 to 2020; (3) the study was original research reporting reliability (Cronbach's alpha and/or test-retest and/or split-half reliability), validity (convergent/divergent correlations and/or known-group differences), and dimensionality of a specific sleep tool ( Data extraction: Potentially relevant papers were selected by (1) screening the titles, (2) screening the abstracts, and (3) retrieving and screening the full article to determine whether it met the inclusion criteria when the abstract did not provide sufficient data or was not available. The literature screenings were performed by two authors (F. M. and M. D.) independently using a pre-defined study extraction form and the results were compared. When a disagreement occurred, the article was evaluated by a third author (M. M.) blinded to the issue of the disagreement. The publication data included study characteristics: authors, tool name and its acronym, publication year, population type, sample size, number of females, and mean age in years of the general sample or of specific groups involved in the study. The definition of the construct, the structure (items, response format, etc.) of the questionnaire, the temporal period assessed, and the any translated versions were also recorded. Finally, the reliability coefficients, test-retest values, construct validity (convergent and divergent correlations with other measures, as well as known-group differences), the ROC curve analysis and eventually cut-off values were extracted. Finally, summaries of the exploratory, confirmatory, or principal component analysis (PCA, that is, the extraction of linear composites of observed variables) were indicated. A descriptive analysis of the articles was performed for the measures extracted.

The Most Commonly Used Tool: PSQI
The Pittsburgh Sleep Quality Inventory (PSQI; [21]) is the most commonly used measure of subjective self-report sleep quality for two main reasons. It was not only developed to quantify sleep quality [21] but also, in the majority of studies that validate a sleep questionnaire, the PSQI has been used as convergent validity, suggesting that the PSQI can be considered as an accepted reference or gold standard for self-perceived sleep quality. In addition, it is the most widely used sleep health assessment tool in both clinical and non-clinical populations [11]. In the present review, it was the questionnaire with the highest number of studies investigating its psychometric properties, beyond factor structure. The PSQI consisted of 24 questions or items to be rated, relating to the past month (0-3 for 20 items while 4 items were open-ended), 19 of which were self-reported and 5 of which required secondary feedback from a room or bed partner. Only 19 items (15 rated 0-3 and 4 open-ended) were used for the evaluation of sleep quality as perceived by the individuals. The open-ended items were also scored as categorical values (rated 0-3) as per the range of values reported by the patients. These 19 self-reported items were then used to generate scores, which ranged from 0 (no difficulty) to 3 (severe difficulty), representing the PSQI's seven components: sleep quality (C1), sleep latency (C2), sleep duration (C3), habitual sleep efficiency (C4), sleep disturbance (C5), use of sleep medications (C6), and daytime disturbance (C7). The scores for each component were summed to get a total score, also termed the global score (range 0-21), providing an efficient summary of the respondent's sleep experience and quality. Panayides et al. [23] not only revised the original 4-point Likert scale with a more optimal 3-point Likert scale that is more appropriate for a non-clinical sample, but also proposed a 16-item version following two calibrations using the Rasch model [24]. In contrast, Chien et al. [25] proposed a revised PSQI: short form Chinese version or SC_PSQI with nine items (sleep latency, habitual sleep efficiency, sleep disturbances, sleep interruptions, use of sleeping medication, daytime dysfunction, days of insomnia, fatigue upon awakening in bed, and earlier awakening). In addition, scoring of answers was changed from 0-3 to 0-2, and the score total amounted to 18.
Among different sample types (from non-clinical individuals to different medical populations), with a vast range of numerosity (from 50 to 3.742) comprising a wide age range (18-80+) and different language versions (English, Chinese, Greek, Korean, Italian, Spanish, Sinhala, European Portuguese, Malay, Kurdish, and Arabic), the most interesting result is related to the factor structure underlying the PSQI, using different factorial analyses (Table 1). In the present review, six papers [23,[25][26][27][28][29] reported unidimensionality, six studies indicated a 2-factor model [30][31][32][33][34][35] and two investigations found a three-factor model [36,37]. The remaining articles in Table 1 did not show a unique factor model [38][39][40][41]. In the twofactor solution, the C2, C3, and C4 loaded on one factor (i.e., a sort of sleep efficiency factor) whereas C5 and C7 loaded on the other factor (i.e., a version of the daytime disturbance factor). The C1 was often an added component of the factor containing C2-C4 while C6 either added to a factor comprising C5 and C7 or was deleted because of low (<0.40) loading value [30][31][32][33][34]. The only exception was the study conducted in cancer patients [35] with a factor labelled Perceived Sleep Quality with C1, C2, C5, C6, and C7, and the other factor labelled Sleep Efficiency with C3 and C4. Inter-factor correlation was on average 0.476. By contrast, the three-factor solution indicated that the Sleep Efficiency factor included C3 and C4, the Perceived Sleep Quality factor included C1, C2 and C6, and the Daily Disturbance factor included C5 and C7, with correlations between first and second factors (mean 0.465), between second and third factors (mean 0.58), and between first and third factors (mean 0.415) [26,33,36,39]. Alternative models included the same factors with the exclusion of C6. Importantly, a single study reported a different three-factor structure for male and female adults [41]. Specifically, for men F1 was determined by C1, C3, and C4, F2 by C5 and C7, and F3 included C2 and C6; for women F1 was determined by C1, C5, and C7, F2 included C3 and C4, and F3 was loaded by C2 and C6. This result could indicate the presence of gender differences in sleep quality.
As shown in Table 1, the Cronbach's alpha was on average equal to 0.76 [42]. It should be noted that the Cronbach's alpha increased, on average, by 2 points, excluding C6 [30,39,40], supporting the fact that this component is problematic when defining the global score. The reliability of the PSQI was also shown by all corrected item-to-total or component-to-total correlations which ranged on average between 0.31 and 0.66, that is, from moderate to high correlations. Only in [38] were the corrected component-tototal correlations low (0.10-0.40). Only six studies tested the reliability of the tool over the time, suggesting that the PSQI was reliable over different periods (from 2 weeks to 14-16 weeks between administrations). The test-retest correlations were high (mean correlations = 0.64) and no difference between administrations was reflected in the mean value scores. The PSQI global score correlated with other sleep measures, such as the ISI, Ford Insomnia Response to Stress Test, and Glasgow Sleep Effort Scale [29,31,38], but not with ESS or Snore, Tired, Observed, Pressure, Body mass index, Age, Neck, Gender (STOP-BANG) [29,31]. The PSQI global score correlated with different tools measuring well-being from different perspectives (e.g., Beck Depression Inventory or Health Survey Short Form 36) with coefficient correlations ranging from −0.40 to 0.72 [30,33,34,37,38]. The correlations between the PSQI score and objective measures of sleep appeared to be more problematic, with a small number of exceptions such as the correlation between global score and Stage 2 latency (r = 0.294), Slow Wave Sleep latency (r = 0.524), Stage 1% (r = 0.327) and Stage 2% (r = −0.349) obtained by PSG [36], between sleep latency measured by both PSG and PSQI (r = 0.225) or between sleep efficiency measured by both PSG and PSQI (−0.331) [32].  In the original paper [21] a cut-off score of 5 was proposed to distinguish between poor and good quality sleepers. This cut-off was used in four papers, supporting knowngroup validity [31,34,38,39]. Using ROC curve analysis, the reviewed papers did not systematically confirm this cut-off value [25,36,40], but cut-offs greater than 5, such as 6, 7, 8.5 or 11 were more useful in balancing between sensitivity and specificity. Most probably, this differentiation reflects the different populations taken into account and the clinical use of the global PSQI score. The cut-off of 11 was a very severe cut-off used for detecting insomnia patients [25], even if a value equal to 8.5 was a sufficient cut-off for detecting the severity of symptoms in a sample of insomnia patients (211 out of 261) [32]. According to a specific population (university students [40] or sleep-disorder adult patients [29]), a cut-off of 6 or 7 seemed to be able to distinguish insomniacs. Even if further research is needed to clarify the application and the use of specific cut-off points, Table 1 shows that the PSQI had a good construct validity as demonstrated by known-group differences on the basis of proposed cut-off values or other sleep disorder assessments. It is worth noting that the group comparisons were performed according to different cut-off parameters, such as depression level, suggesting the association between poor sleep and different psychological or medical variables [23,26,28,30,31,33,35,36]. Finally, four studies performed regression analyses in order to detect which variable(s) could predict poor sleep; depression, anxiety, and stress were able to predict poor sleep quality [34,38]. As regards gender, females appeared to report a higher PSQI score (and also for C1 and C5 components; [39,41]) even if this result was not confirmed in other studies [27,35]. Only [27] reported the role of age and literacy in predicting sleep quality but these results need further research.

The Sleep-Disorder Scales
Sleep disorders are among the most prevalent complaints in primary medical care and in the general population [4][5][6][7][8][9][10][11][12]. Epidemiological data indicate that insomnia is the most frequent sleep complaint [3]. Insomnia disorder is characterized by difficulty falling asleep, difficulty staying asleep, early morning awakening, and clinical distress or impairments of daily activities [3,[13][14][15]. Sleep disturbance and associated daytime symptoms occur at a frequency of three nights or more per week for at least three months. In addition, sleep disorders compromise the sleep-wake cycle and they can affect sleep (hyposomnia) and/or wake (hypersomnia). In this section, we decided to group altogether all scales evaluating more specific alterations of sleep, such as insomnia or different sleep disorders, and complaints of the sleep-wake cycle.
The Athens Insomnia Scale (AIS) [43] is a self-reported questionnaire designed to measure the severity of insomnia based on the diagnostic criteria of the International Classification of Diseases, 10th revision (ICD-10) [44]. There are two versions of the scale: the AIS-8 and the AIS-5. For the eight-item scale, the first five items (assessing difficulty with sleep induction, awakening during the night, early morning awakening, total sleep time, and overall quality of sleep) correspond to criterion A ("complaint of difficulty falling asleep, maintaining sleep or non-refreshing sleep") for the diagnosis of insomnia according to ICD-10 [44], while the last three items pertain to the consequences of insomnia the next day (problems with sense of well-being, functioning, and sleepiness during the day) according to criterion C of ICD-10 ("the sleep disturbance results in marked personal distress or interference with personal functioning in daily living"). The brief 5-item version is made up of the first five items. In both versions, participants are asked to score each item from 0 (no problem at all) to 3 (very serious problem) if they have experienced any difficulty sleeping at least three times a week during the past month. The total score of the AIS-8 ranges from 0 to 24 while that of the AIS-5 ranges from 0 to 15. Within the selected papers that regard AIS, a great quantity of the studies was performed in Asiatic countries, without modification of the original version. As regards the factor structure of the AIS-8 (Table 2), three studies provided support for unidimensionality [45][46][47] (a mean variance of 67.58% explained; an average range of factor loadings of between 0.54 and 0.85), while three other studies reported a better fit with the two-factor model [48][49][50] with a Nocturnal factor (items 1-5; mean factor loadings ranging between 0.55 and 0.87) and a Daytime Dysfunction factor (items 6-8; mean factor loadings ranging between 0.54 and 0.93), with a mean inter-factor correlation equal to 0.70. Two studies using the AIS-5 found a 1-factor model [45,49] with an average of 57.16% of variance explained, confirming the latent presence of the Nocturnal factor in the full AIS version. The sample size and the type of population assessed (generally trauma patients in studies assessing the two-factor model) may be responsible for these divergent results in the factor analysis. The mean reliability of AIS-8 was 0.86 while that of AIS-5 was 0.84 (Cronbach's alphas for the two supposed factors were on average higher than 0.80 for the Nocturnal factor and above 0.70 for the Daytime Dysfunction factor). A good internal homogeneity was also demonstrated (mean item-total correlation range 0.56-0.80). With different temporal intervals (from 1 week to about 3 months), the mean ICC of the AIS-8 was 0.78 and that of the AIS-5 was 0.68, suggesting a good test-retest reliability. Finally, both versions of AIS showed convergent/divergent validity (many correlations of a moderate level > 0.30; from −0.53 to +0.85 in range) with different sleep scales, such as PSQI and ISI, and with different psychological variables, such as anxiety or depression [45][46][47][48][49][50], but not with Alcohol Use Disorders Identification Test (AUDIT) or socioeconomic status [46]. The validity of the AIS was also confirmed by known-group differences between patients (psychiatric, insomnia or cancer patients taking opioids) and control or non-insomnia groups ( Table 2). Age (older adults) and gender (women) differences were found in the total AIS score as well as for both factors [50]. When a specific cut-off score was proposed, we observed that for AIS-8 values in the range between 5 and 9 [46,48,49] could discriminate between insomnia and non-insomnia groups ( Table 2) with a mean sensitivity equal to 80% and a mean specificity equal to 82%, in line with the proposed cut-offs in the original study [44,51]. The different cut-off values reflected the insomnia patients involved in the study and the severity of their insomnia symptoms [46,48,49]. It is worth noting that Enomoto et al. [49] reported a cut-off of 4 for AIS-5.
Similar to the AIS, the Insomnia Severity Index (ISI) measures perceived insomnia severity, focusing on the level of disturbance to the sleep pattern, consequences of insomnia, and the degree of concern and distress related to the sleep problem [52]. Its content corresponds in part to the diagnostic criteria of insomnia outlined in the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) [53]. The ISI comprises seven items that assess the severity of sleep-onset and sleep maintenance difficulties (both nocturnal and early morning awakening), satisfaction with current sleep pattern, interference with daily functioning, noticeability of impairment attributed to the sleep problem and degree of distress or concern caused by the sleep problem. Each of these items is rated on a five-point Likert scale (0 = not at all; 4 = extremely) and the time interval is "in the past 2 weeks". Total scores range from 0 to 28 with high scores indicating greater insomnia severity. This tool is available in three different versions: patient (self-administered), significant other, and clinician. All included papers referred to the patient's version. In the original validation study, different categories were provided: 0-7, no significant insomnia, 8-14, subthreshold insomnia, 15-21, moderate insomnia, and 22-28, severe insomnia [52]. Concerning the factor structure, we found four studies proposing a 1-factor model [47,[54][55][56][57] (mean 62.03% of total variance and mean factor loadings ranging from 0.47 to 0.83). It should be noted that Dragioti et al. [54] proposed a four-item version (items 2, 4, 5, 7), especially for patients with chronic pain. However, three studies reported the 2-factor solution [58][59][60] (Severity of sleeping difficulties with items 1-4 or alternatively items 1a, 1b, 1c, and 2; Impact of insomnia with items 5-7 or, alternatively, items [3][4][5]. This solution generally explained 61.80% of variance with mean factor loadings ranging from 0.50 to 0.90 for both factors and mean inter-factor correlation of 0.50 (Table 2). In medical patients, Dieperink et al. [61] considered the 3-factor model, in which Severity of Nighttime Sleep Difficulties (items 1-3 with factor loadings > 0.59), Impact of Insomnia (items 5-7 with factor loadings > 0.72), and Sleep Dissatisfaction/Satisfaction (items 1, 4, 7 with factor loadings > 0.36) correlated with each other with values greater than 0.80. This solution was confirmed in other studies [56,62,63] (Table 2), even if Dieck et al. [56] reported a different composition of factors (F1: items 2, 4, 7; F2: item 1; F3: items 5, 7) and a single correlation between F1 and F3 (0.794). The different numerosity of the samples and the specific characteristics of recruited participants might explain these different results regarding the latent structure of the ISI. The reliability of ISI was good with a mean Cronbach's alpha of 0.82 and mean corrected item-to-total correlations ranging from 0.47 to 0.66. Importantly, the testretest reliability was significant in clinical and nonclinical populations [56]. In general, test-retest reliability after 2 weeks was satisfactory (mean ICC = 0.82) [47,56,61] and the ordinal alpha remained above the critical value of 0.70 after a CBT-I treatment [63]. The ISI exhibited significant correlations with several sleep questionnaires such as AIS and PSQI (but low correlation coefficients with ESS) and with different psychological, health, and psychopathological questionnaires. The range of all correlations was between −0.58 and 0.79 (Table 2). Sadeghniiat-Haghighi et al. [59] reported a specific correlation pattern with PSG variables. Indeed, the first three items were associated with PSG variables to a greater extent than the total ISI score was, which correlated only with WASO and SE, even if the correlation coefficients for the first three items were small (<±0.30). In addition, Castronovo et al. [63] reported how the first three items of the ISI were associated with quantitative estimates of sleep parameters ( Table 2) obtained from the sleep diaries with moderate correlations, supporting the premise that the first three items have a diagnostic role. As regards validity, the studies demonstrated known-group differences on the basis of different criteria, such as PSQI or depression [47,58,59,61]. In one study [55] women had a higher ISI score than men, and in another study [54] sex was a predictive factor of ISI score, but this gender effect was not systematically confirmed [54,55,61]. Importantly, Castronovo et al. [63] found that ISI was sensitive to change after a specific CBT-I treatment, with a reduction of the higher scores in each item. In our selected papers, only three studies performed a ROC curve analysis; these reported that ISI cut-off was in the range between 9 and 11 with a mean sensitivity of 86% and mean specificity of 80% [47,56,57], depending on the population considered and PSQI cut-off used [47,55,57,59,62]. Taking into account that one study reported an agreement of about 85% between ISI ≥ 8 and PSQI > 5 [62] in the detection of "poor sleepers", the cut-off values proposed in reviewed articles were in the subthreshold insomnia categories within the 8-14 range [52].
The Mini Sleep Questionnaire (MSQ) [64] is a short questionnaire that can be used to screen sleep disorders in the population and considers complaints regarding both sleep and wake at the same time. The original version was composed of seven items that evaluate symptoms of hypersomnia, and one item on sleep maintenance. Subsequently, three items regarding symptoms of insomnia were added. Thus, the final 10-item version assesses both insomnia and excessive daytime sleepiness. Each item is scored on a sevenpoint Likert scale ranging from 1 (never) to 7 (always), and takes into account the past seven days. The total sum of scores is divided into four levels of sleep difficulties: 10-24, good sleep quality; 25-27, mild sleep difficulties; 28-30, moderate sleep difficulties; ≥31, severe sleep difficulties [65]. The total score offers an estimate of sleep quality, with higher scores reflecting more serious sleeping problems. However, Natale et al. [66] found two factors explaining about 50% of total variance with loading values higher than 0.50, with the exclusion of item 6 (snoring) ( Table 2). The authors labeled Wake (items 4,5,8,9) and Sleep (items 1, 2, 3, 7, 10) dimensions. Thus, the MSQ could be considered a good tool for screening sleep disorders in the population because it consists of two subscales that investigate sleep quality and daytime sleepiness [66]. By contrast, Kim [67] assessed the psychometric properties of MSQ-Insomnia which is composed of four items (difficulty falling asleep, awakening early in the morning and unable to sleep again, taking sleeping pills and tranquilizers, and waking up during the night) with factor loading values higher than 0.50, with the exception of item 3 in the single factor. As reported in Table 2, Natale et al. [66] reported higher Cronbach's alphas for both factors, with a good internal homogeneity (0.44 for wake factor and 0.37 for sleep factor) while Kim [67] reported a Cronbach's alpha of 0.69 with an increase of alpha if item 3 was deleted (0.73). The item-total correlation was ≥0.30. Furthermore, the Korean version of MSQ for insomnia subscale [67] correlated with both PSQI (0.22-0.71 range) and Perceived Stress Scale or PSS (0.11-0.31 range), while Natale et al. [66] reported that healthy participants obtained lower scores in the wake dimension of the MSQ in comparison to participants, a result that was compatible with excessive daytime sleepiness; they also found that healthy participants obtained lower scores in the sleep dimension of the MSQ in comparison to participants, compatible with impaired sleep quality (Table 2). Finally, Natale et al. [66] indicated that Wake > 14 and Sleep > 16 were optimal values for detecting hypersomnia and insomnia problems, respectively. Kim [67] evaluated the predictive validity of the MSQ-Insomnia for poor sleepers determined by the diagnostic cut-off on the Korean PSQI score (>8.5 points, [32]), and concluded that it gave a good level of predictive validity.
Finally, in this section we included three tools used for the diagnosis of sleep disorders including insomnia. The first questionnaire was the Jenkins Sleep Scale (JSS) [68]. The JSS is an efficient instrument for the evaluation of the most common symptoms in the general population [68]. JSS is a simple, self-reported, and non-time-consuming scale to be used in daily practice, clinical research, and epidemiologic studies. The questionnaire consists of four items that assess sleep problems over the preceding 4 weeks, with questions regarding trouble falling asleep, trouble staying asleep, frequent awakenings during the night, and subjective feelings of fatigue and sleepiness despite having had a typical night's rest [69]. The respondents answer the questions using a 6-point Likert-type scale from 0 (not at all) to 5 (22 to 31 days). The total scores range from 0 to 20, and higher scores indicate a greater number of sleep problems [68,69]. In a large representative German sample [69], JSS-4 showed and confirmed the 1-factor solution, explaining a large variance with high factor loading values ( Table 2). The JSS-4 proved excellent reliability and it demonstrated good construct validity with regard to mental health, suggesting that sleep problems and psychological distress comprising anxiety, depression, and somatization were moderately related to each other. In addition, the JSS-4 total score was associated with sex, age, education, household income, cohabitation, and employment [69], not only with correlations but also with multivariate analysis. Interestingly, normative data of sleep problems were provided with the percentile rank of each value of the total score provided, allowing comparisons of the JSS-4 scores obtained with different groups of the general population stratified by sex and age. It is worth noting that Tibubos et al. [69] indicated that, in the total sample, a sum score equal to 2 corresponded to 51 in percentile rank, in line with the recommended cut-off of ≥2 to detect sleep disturbances, which corresponds to at least one troubled night per week [68]. The second self-report scale is the Leeds Sleep Evaluation Questionnaire (LSEQ) [70]. The LSEQ comprises ten self-rating 100 mm line analogue questions concerned with sleep and morning behavior and is relatively simple in its use. In the original study [70], factor analysis revealed four independent domains that pertained to sleep latency (or getting to sleep GTS: items 1-3), quality of sleep (QOS: items 4-5), awakening from sleep (AFS: items 6-7), and behavior following wakefulness (BFW: items 8-10). For each item of 100 mm visual analogue, 0 indicated the worst sleep condition and 100 suggested a normal state, and therefore lower scores of the LSEQ indicated poor sleep. In our review, we found an adapted LSEQ that had been administered to Ethiopian university students [71]. In this version (LSEQ-M), not only did the authors modify some items or word expressions (e.g., "usual" was replaced with "normal"), but also the reported score for each item was divided by 10 to determine an individual item score between 0 and 10. Such scores (between 0 and 10) for each item were added to obtain a LSEQ-M global score with a range of 0-100. Interestingly, the authors found 1-, 2-factor and 4factor models according to different criteria (i.e., eigenvalue > 1, scree plot and cumulative variance > 40). The significant lower values of the LSEQ-M global score as well as those relating to all the items (with the exclusion of item 9) among students with a moderate level of anxiety established the diagnostic known-group validity of the tool. At the cut-off score of 52.6, the sensitivity and specificity of the LSEQ-M were 94% and 80%, respectively ( Table 2) [71]. The third self-administered questionnaire is the SLEEP-50 [72], assessing a range of sleep complaints and disorders, including sleep apnea, insomnia, narcolepsy, restless leg syndrome/periodic limb movement disorder, circadian rhythm sleep disorder, sleep walking, and nightmares, in addition to factors which may disrupt sleep, and the impact of sleep complaints on daily functioning. Items are rated on a 4-point Likert scale, from 1 (not at all) to 4 (very much), and each item refers to the past 4 weeks. Items are summed to yield subscale totals and an overall total score, with higher scores indicative of poorer sleep functioning. Spoormaker et al. [72] reported that cut-off scores for each sleep subscale in conjunction with the impact subscale (i.e., greater than or equal to a score of 15) were used to establish the presence of a sleep disorder (i.e., whether symptoms reached a diagnostic threshold). Ricketts et al. [73] added psychometric properties of the SLEEP-50 in two medical conditions (Trichotillomania and Excoriation disorder) in which sleep problems may occur and influence disorder severity. As shown in Table 2, a similar 9-factor model was found for both Trichotillomania and Excoriation Disorder samples, with low factor loadings for item 35 on Factor 7 and item 39 on Factor 8. As far as internal consistency is concerned, values were similar between both groups of patients and comparable to those found in the initial investigation [72]. The internal consistency for the full SLEEP-50 scale was excellent in Trichotillomania and good in Excoriation Disorder, with moderate to strong convergent validity in the association with PSQI global score (from 0.25 to 0.75 and from 0.17 to 0.65 in Trichotillomania and Excoriation Disorder, respectively). The study showed that Trichotillomania and Excoriation Disorder groups exhibited sleep complaints and met a clinical threshold at higher rates (i.e., 63.6% for Trichotillomania and 66.5% for Excoriation Disorder) compared to the control group (39%), suggesting that the SLEEP-50 is a valid self-report tool which may serve to facilitate and standardize screening of multiple sleep complaints among individuals with hair-pulling and skin-picking disorders [73].

Based on Structured Clinical
Interview for the DSM-IV-TR and a structured clinical interview for insomnia, participants were divided into non-insomnia, participants with insomnia symptoms (group 1), individuals with disturbed daily functioning (group 2) and those in group 2 who had symptoms even during off-duty periods.Non-insomnia group    85% of subjects classified as insomniacs in the ISI total score ≥8 were classified as poor sleepers in the PSQI total score > 5 and 33% of non-insomniacs were classified as poor sleepers Spanish         Similarly, correlations were found between SLEEP-50 overall complaints (0.44) and impact (0.60) subscales and PSQI overall sleep quality subscales (ranging from 0.10 to 0.65)

Excessive Daytime Sleepiness
Daytime sleepiness is defined as difficulty in maintaining the alert awake state during the wake phase of the 24-h sleep-wake cycle. Daytime sleepiness is an important manifestation of sleep disorders and it impacts the patient's social life and threatens public health and safety [5]. Excessive daytime sleepiness (EDS) can be caused by disorders such as Obstructive Sleep Apnea (OSA), narcolepsy, and idiopathic hypersomnia [5,8]. Although MSLT and the maintenance of wakefulness test (MWT) [1] exist as objective measurements of daytime sleepiness, in sleep research and the clinical setting the Epworth Sleepiness Scale (ESS) is the most commonly used measure of sleepiness [74]. In the development of the ESS, the sleep disorders experienced by participants were primary snoring, OSA syndrome, narcolepsy, idiopathic hypersomnia, insomnia, and periodic limb movement disorder [74]. Thus, the ESS appears to be a convenient, standardized, and cost-effective way to measure sleepiness in patients who suffer from sleep disorders.
The ESS requires people to rate their likeliness of falling asleep in eight different situations (i.e., reading, watching TV, sitting in public, being a car passenger, resting in the afternoon, talking to someone, sitting quietly after lunch, stopping in traffic), chosen to represent different levels of somnificity that most people encounter as part of their daily lives [74]. Subjects base their ratings on the recent past, using a 4-point Likert scale (0: would never doze; 3: high chance of dozing). The term somnificity refers a general characteristic of a posture, activity, and situation with a capacity to facilitate sleep-onset in the majority of participants [74,75]. Each ESS item is scored as a number from 0 to 3 according to the individual's response, and these scores are summed to determine a total ESS score, which ranges between 0 and 24. The higher the score, the higher the person's level of daytime sleepiness. A cut-off value of 10 (ESS total score > 10) is usually considered to detect EDS [75]. In selected papers, the major concerns regard item 8. For example, in the Japanese version of the ESS there was some misunderstanding as to whether the question referred to being in a car as the driver or as a passenger [76]. Thus, the authors tested different alternatives and replaced item 8 with the following item: "while sitting and writing by hand". The same procedure was adopted for item 1 ("while sitting and reading a book") because this item remained largely without response. Thus, it was replaced by the following item: "reading something while sitting in a chair (newspapers, magazines, books, documents, etc). In a similar way, Rosales-Mayor et al. [77] proposed two versions: one for drivers and another for non-drivers. In the last case, item 8 was replaced by the item "standing and leaning or not on a wall or furniture". By contrast, the 4-point Likert scale was adopted by all the papers, and the adequacy of this scale was also confirmed with the Rasch model [78,79].
In general, a 1-factor model (mean factor loading range 0.54-0.73) was confirmed [76,[78][79][80]. The unidimensionality was also found in patients with sleep-disordered breathing [81] with factor loadings greater than 0.60, and was confirmed using Rasch analysis (explained variance of 39.92% and factor loading greater than 0.50) [82]. However, this factor structure was challenged by four articles supporting the 2-factor solution, with a mean explained variance equal to 56.06% [77,81,83,84], even if these two factors were loaded by different items (e.g., F1 with items from 1 to 4; F2 with items 5 to 8). Observing Table 3, the factor structure of the ESS seems to depend on the sample involved in the study, with different organization of items within both factors [77,81,83,84]. For example, [82] found that items 1, 2, 3, 4, 5, and 7 were grouped into Factor 1 and items 1, 3, 4, 6, and 8 were grouped into Factor 2, suggesting that several items loaded in both factors. Finally, in a very large sample, Lapin et al. [85] proposed a different factor solution with an improvement of the variance explained from the 1-factor (63.4%) to 3-factor (75.4%) models, with an improvement of the factor loadings from 0.505 to 0.995. In the 3-factor solution, Factor 1 was composed of items 3, 4, 6, and 8, Factor 2 comprised items 2 and 3, and item 5 made up Factor 3. As reported before, the factor structure of the ESS was related to the numerosity of the sample, the type of the sample, and the statistical method used, putting into question the reliability of the total score for detecting EDS [75]. Daytime dysfunction domain score of the PSQI ranged from 0 to 3: JESS total score (6.8) for the group with 0 < JESS total score (13.6) for the group with 3. Patients (10.8) had a higher JESS score than did the healthy people (8.4). In 12 patients, a CPAP improvement was associated with a 3.67 point improvement in JESS score Japanese    while education (respondents with a medium level of education = 6.00) and BMI (18.5-23.9 = 6.00) did show differences, with lower ESS scores compared to those with education level and BMI above or below these levels. EDS prevalence (ESS > 10) was lower in respondents with a secondary or high school education, compared with those who had education levels of only primary school or below. EDS (ESS > 10) was more prevalent in respondents with BMI of 28.0 and above than those with normal BMI. The prevalence of EDS (ESS > 10) was higher among respondents with chronic diseases than in those not suffering from such diseases. EDS (ESS > 10) had lower scores in SF-36 subscales than non-EDS    As regards reliability, the ESS demonstrated a mean Cronbach's alpha of 0.82, and the mean corrected item-total correlations ranged from 0.38 to 0.69, considering both original and modified ESS versions [76][77][78][79][80][81][82][83][84][85]. Even if in the majority of studies the critical item was item 8, only in one study was an increase of Cronbach's alpha reported, with the exclusion of item 7 [83]. The internal consistency of the questionnaire was further found when analyzing test-retest reliability (mean ICC = 0.84), with a temporal interval from 7 to 35 days. Wu et al. [80] reported a split-half reliability coefficient in line with the reported ICC. The ESS appeared to be moderately associated with different PSG variables, such as AHI (Apnoea-Hypopnoea Index), lowest SaO 2 (Oxygen Saturation), and mean SaO 2 [78,81]. Single items correlated with PSG parameters, as reported in Table 3 [84]. Importantly, the ESS negatively correlated with another sleep questionnaire, the Functional Outcomes of Sleep Questionnaire (FOSQ; from −0.22 to −0.92), and with a healthy related quality of life (from −0.12 to −0.24), but not with the Life Orientation Test [78,80,83]. Finally, the total ESS score was associated with Body Mass Index (BMI) [78,80], even if this was not systematically confirmed [81].
In general, the articles reviewed reported known-group differences on the basis of AHI criteria (with an increase in total score as AHI increased) [78,84] and PSQI daytime dysfunction score (increase in ESS score as PSQI score increased) [76]. The OSA patients reported a higher total ESS score compared to healthy controls [79,81,82]; insomnia patients also reported a higher EDS than normal individuals [79]. Individuals with a medium level of education and with a normal BMI value reported a lower total ESS score than individuals with a lower or higher level of education and with a higher BMI value [80]. Even if, in our selected articles, no studies performed a ROC curve analysis to test the sensibility and specificity of the proposed cut-off for detecting EDS or proposed clinical cut-off values, four articles reported a clear responsiveness given that patients with OSAS who underwent treatment with Continuous Positive Airway Pressure (CPAP) for different durations (from 1 month to 6-9 months) improved in their ESS score with a relevant drop in the total score after treatment compared to before [76,77,81,84]. The changes in the total ESS score were greater in patients with severe OSAS.

Discussion
This comprehensive literature search identified 49 studies evaluating psychometric properties and latent factor structure of different questionnaires that measure sleep quality in adult populations. The studies were selected only when it was possible to extract the dimensionality (e.g., EFA, CFA, PCA, or Rasch model), the reliability (Cronbach's alpha and/or test-retest coefficients), construct validity (convergent/divergent correlations and/or known-group differences), and eventually other information such as ROC curve analysis or responsiveness to clinical treatment. In the present paper, we review studies from 2008 to 2020 featuring the sleep questionnaires reported by Martoni and Biagi [22], including only those questionnaires (PSQI, AIS, ISI, MSQ, JSS, LSEQ, SLEEP-50, and ESS) with the aforementioned criteria. After observing the reliability and the validity of the selected sleep questionnaires, we can confirm that sleep quality is a complex construct which covers both sleep difficulties and daytime impairment. Although all selected tools, with the exception of the ESS, contain items assessing night-time sleep to a different extent (e.g., SLEEP-50), the MSQ [66] is the only self-report scale with a clear distinction between nighttime sleep disorder and daytime functioning/problems. Indeed, the remaining questionnaires have shown single components or factors assessing daytime impairment due to poor sleep quality (e.g., C7 for PSQI [21,25,26,[30][31][32][33][34]36,38,39], daytime dysfunction factor for AIS [43,[48][49][50], Impact of Insomnia factor for ISI [56,[58][59][60][61][62][63], Behavior following awakening factor for LSEQ [70], or impact of sleep complaints on daily functioning for SLEEP-50 [72,73]) or single items assessing the subjective feelings of fatigue and sleepiness despite receiving a typical night's rest for JSS-4 [68]. By way of contrast, the ESS is intended to differentiate individuals with EDS from alert people and, thus, it measures sleep propensity in eight different daily situations or different levels of somnificity [74,75]. Even if Martoni and Biagi [22] also reported two other subjective measures of sleepiness, such as the Stanford Sleepiness Scale (SSS) [86] and the Karolinska Sleepiness Scale (KSS) [87], not only is the ESS one of the most used tools for the EDS [88] (as also demonstrated by the lack of articles from 2008 to 2020 with our inclusion criteria for SSS and KSS), but it is also different from the SSS and KSS. Indeed, the SSS and KSS are usually used to assess "state" sleepiness, that is, these scales are used to measure short-term changes in sleepiness [89] and are considered a measure of subjective feeling of drowsiness (or fatigue) [88]. Indeed, the SSS is based on a Likert-type scale with seven vigilance levels and people are asked to indicate which level best describes their current state. The KSS is based on a 9-point scale measuring the subjective level of sleepiness at a particular time during the day, in which individuals indicate which level best reflects the situational sleepiness. In contrast, the ESS measures a global level (daily life soporific situations) of sleepiness, that is a "trait" aspect of sleepiness [74,75,89]. Thus, sleep quality should be evaluated using a combination of the different tools reviewed, in order to obtain a complete picture of both sleep and daytime impairments.
Taking into account the importance of dimensionality, reliability, and validity for research, epidemiological, and clinical studies, we discuss the results below under the domains of factor structure, reliability, and construct validity. In general, a clear and unique factor structure was not defined for any of the self-reported questionnaires included, with the exceptions of MSQ [66,67], JSS [69], and SLEEP-50 [73]. As reported in a previous review [90], the structured categorical data of the PSQI could be sensitive to the specific model (method of extraction) being applied. The lack of a defined factor structure for the PSQI could be related to the fact that parsimony is not applied and EFA or CFA are not used (with a concomitant lack of reporting of relevant details). In addition, the heterogeneity of sample analyzed (with highly variable characteristics in terms of societal stressors, medical pathology, sleep medication use, pain, etc.), the small size of the sample, the proposed modified versions, and the cultural translation limited any interpretation of the different factor structure, and called into question the value of the global PSQI score in detecting poor and good sleepers. In a similar way, the factor analyses performed on the ISI scale demonstrated a multidimensionality of the questionnaire. In this case, it is worth noting that the 3-factor solution was found in European (Spanish, Italian, German, and Danish) individuals, probably reflecting a more typical characteristic of the clinical disorder in these cultures [56,[61][62][63]; the 2-factor solution was more widespread in different cultures from Africa [60] to Asia [58,59], through Europe [61]. However, unidimensionality was found in a large sample (in total 2887 participants), including both patients (cancer, chronic pain, and traumatic brain injury; [47,54,57]) and community-dwelling individuals such as students and employees [55,56], putting the real dimensionality of the self-report questionnaire into question. In the 2-factor solution it was constantly confirmed that items 6 and 8 seemed to indicate a daytime dysfunction, reflecting the association between sleep and wake (i.e., a poor night's sleep impacts on subsequent daytime functioning and a stressful daytime impacts on the subsequent night's sleep). In a similar way, the ESS mainly showed different factor solutions. Moreover, many concerns regarded item 8 ("in traffic"), probably due to a general misunderstanding of whether it means being in the car as a driver or a passenger [77]. Finally, the single reviewed study concerning the LSEQ did not confirm any factor solution because several solutions showed a best fit with some indices but not with others, bringing into question the real latent structure underlying the LSEQ [71]. Altogether, these results shed light on the necessity of the procedural details (e.g., EFA and CFA in the same article) and application of standard practices in order to streamline the debate on the heterogeneity of self-report questionnaires, with the main goal being to disentangle the meaning of a total (or more than one) (sub-)score for clinicians. In contrast, the MSQ, JSS, and SLEEP-50 showed a clear factor structure. These findings seemed to suggest the applicability of these various factor structures with a clinical application, although the SLEEP-50 seemed to be more time-consuming for both respondents and the scorer. However, studies investigating the factor structure of the MSQ, JSS, and SLEEP-50 in the past 12 years have been scarce and thus we recommend further studies aimed at confirming the unidimensionality of the JSS, the 2-factor solution of the MSQ. and the multidimensionality of the SLEEP-50.
As regards the psychometric properties, the present review showed that all selfreport questionnaires reported good reliability, with a Cronbach's alpha greater than 0.70 (Tables 1-3) [18,19], even if in eight articles (six for PSQI, one for AIS, and one for MSQ) the Cronbach's alpha was <0.70. In general, the reviewed findings indicated that the included questionnaires showed a positive rating for within-and between-group comparisons. Importantly, five studies [57,61,69,73,85] reported Cronbach's alpha within the ideal range for comparison of individual scores (i.e., 0.90-0.95). These results could support the notion of sleep quality being based on both reflective and formative models [91]. The consistency of the selected scales was also confirmed by analysis of the internal homogeneity of the questionnaire. Indeed, all self-reported questionnaires demonstrated a corrected item-total correlation in the range of 0.30-0.86. In addition, the test-retest showed good reliability of all tools in different temporal intervals (from 1 week to 6/9 months), supporting the stability of the dimension being measured between the test and the retest. However, we observed that 19 out of 49 selected studies (roughly less than 40%) performed a test-retest correlation and thus we recommend further research to establish the test-retest of the self-reported questionnaires (especially for JSS, LSEQ, and SLEEP-50) over different time periods, considering that the time periods (past month, past 2 weeks, past week, or habitual experience) covered by selected tools differ.
The basic principle of construct validation is that hypotheses are formulated about the relationship between scores of the instrument of interest and scores of other instruments measuring a similar or different construct. The published findings we assessed outlined high correlations of the PSQI with other measures of sleep quality (e.g., ISI) or sleep measures [29,31,38], while it reported weak or no associations with other measures such as ESS [29,31]. As far as convergent validity is concerned, the PSQI showed moderate associations with depression and general quality of life, supporting the similarity in constructs between scales, and reinforcing evidence from clinical epidemiological studies that document high degrees of comorbidity between sleep and psychiatric disorders [92]. It is possible that altered mood states (e.g., depression and anxiety) may influence the perception of one's physiological state, including somatic symptoms [13,93], and, hence, the associations observed which were based on participants' self-report may be subject to bias. However, the poor correlations among objective measures of sleep quality (e.g., PSG or actigraphy) and the PSQI (Table 1; [32,36]) confirm previous studies [21]. This reduced level of concurrent validity could be explained by different aspects. First, daily fluctuations of sleep cannot be described by a questionnaire investigating sleep quality over the past month. Second, the dissociation between objective and subjective measures of sleep could be due to different aspects such as sleep setting, personality traits, and constitutional factors [94]. Third, the self-report questionnaire is very vulnerable to memory processes (especially remembering information from the past 4 weeks) and misperception, with the tendency to exaggerate number and gravity of symptoms [1]. Among the sleep disorder scales (Table 2), the AIS in its two versions showed convergent validity and had moderate associations with different sleep scales (PSQI and ISI) and with anxiety and/or depression [45][46][47][48][49][50]. On one hand, these results confirmed that the selected self-report tools converged towards a similar construct of subjective sleep quality. On the other hand, the strong comorbidity between sleep and mood alterations, and sleep and psychological well-being is further confirmed [13,93,94]. A similar pattern, with moderate-to-strong correlations, was also found for the ISI, reflecting a true sleep disturbance construct. However, the ISI did not correlate with ESS [59,62], even if daytime sleepiness can be caused by overnight insomnia, and studies had shown that not only do insomniacs not sleep effectively at night but also have difficulty falling asleep in the daytime [95]. Moreover, the ISI score and the three specific items reported small correlations with PSG and sleep diary variables such as WASO, EMA (Early Morning Awakening), TST, TWT, and SE [59,63].
As before, the subjective and objective measurements of sleep do not show proper correlations, probably due to the discrepancy between the recall period of the ISI and the single night of a PSG recording [96]. In addition, the crucial role of the first three items could indicate how these items, which investigate three different subtypes of insomnia, are useful for the diagnosis and the assessment of insomnia, while the remaining items (5-7) are independent factors assessing daytime effects of insomnia. The remaining sleep disorder scales showed low-to-moderate correlations with PSQI and perceived stress or psychological distress [67,69,73], but the small number of studies that have used the MSQ, JSS, or SLEEP-50 in the past decade casts doubt as to their construct validity. Finally, the ESS showed a weak-to-moderate association [78,81,84] with different PSG variables (AHI, lowest SaO 2 , mean SaO 2 , mean latency to sleep, mean latency to REM sleep, and number of times the patient falls asleep), in line with what has been reported in a previous review [88]. These findings may reflect that a comprehensive evaluation of sleepiness may require multiple measures, high heterogeneity of studies included, and diversity within target populations [88]. The negative correlations between ESS and FOSQ or Short Form Health Survey-36 (SF-36) dimensions indicated that higher daytime sleepiness was related with functional impairments in a broad range of activities, with a main impairment in activity level, vigilance, and social outcome. Daytime sleepiness seemed to influence more or less all parts of life to such an extent that people with the condition perceive themselves as being generally more limited by their health than those without it [78,80,83]. However, it remains to state that few studies tested a priori hypotheses related to the relationship between sleep quality, insomnia or daytime sleepiness, and other constructs by stating the expected direction and magnitude of associations beforehand, based on what was known about the construct under study. Thus, we recommend an assessment of similarity and/or dissimilarity between constructs with formulated hypotheses set a priori. This formulation could be created with the content of utilized tools and a clear description of what is known about the population in the study [11].
Further evidence of construct validity can be observed in the results derived from known-group differences, with strong value for clinical practice. As shown in Table 1, the PSQI demonstrated robust known-group validity based on both proposed cut-off points and other sleep disorder assessments. In addition, the discrimination between good and poor sleepers was found according to different cut-off scores of psychological or medical variables. These results were further confirmed by regression analyses which revealed that depression, anxiety, and stress predicted poor sleep quality [34,38]. However, few studies performed ROC curve analysis, and future investigation should test the critical points for distinguishing poor and good sleepers, especially when a multidimensional factor structure is proposed. Table 2 also confirmed known-group validity of AIS in different target populations [45][46][47][48][49][50], of ISI for different criteria [47,58,59,61], of MSQ subscales [66] in detecting hypersomnia and insomnia problems (or compared to PSQI [67]), of JSS-4 in proposing normative values, and of LSEQ [71] and SLEEP-50 [73] as standardized tools for screening multiple sleep complaints. Importantly, for both AIS and ISI the proposed cut-off values allowed for the discrimination between insomniacs and non-insomniacs with an objective confirmation using actigraphic data [47]. However, only 9 out of the 21 studies included performed the ROC curve analysis, thus not only limiting the possibility of testing the sensibility and specificity of original cut-off points in different cultures and population but also of comparing different tools with each other in terms of validity. Future studies are needed to bridge this gap. In a similar way, the ESS demonstrated a strong ability to detect individuals with differences in daytime sleepiness, such as OSA and narcolepsy patients [76][77][78][80][81][82]84,85]. In addition, we found four articles which clearly demonstrate a responsiveness of the ESS to CPAP treatment with a significant drop in the total score, suggesting that the ESS is able to discern the severity of OSAS [76,77,81,84]. However, in our review we did not find any study that performed a ROC curve analysis in order to test the cut-off points for the detection of the EDS: these types of studies are recommended.
The present review has potential limitations. For example, the psychometric properties were not appraised using the "Consensus-Based Standards for the Selection of Health Status Measurement Instruments" (COSMIN) checklist, an instrument developed to evaluate the methodological quality of studies on measurement properties of health-status related questionnaires [97], even if the COSMIN approach may set standards that are too high to achieve a good rating on some of the criteria [98]. Another limitation could be related to the choice not to perform a meta-analysis, but the discrepancies made this almost impractical. Although we recommend that these limitations are taken into account in the future, we obtained and presented data in terms of factor structure, reliability, and construct validity in a similar way to the reviews from which the COSMIN method was adopted [7][8][9][10][11][12][88][89][90]. In addition, this present qualitative review allows for a picture of the psychometric properties of different tools to be obtained, taking into account the heterogeneity of the reviewed methods, populations, and statistical analyses, which could be considered a potential limitation for the meta-analysis. Finally, we have updated the review published in 2007 by Martoni and Biagi [22], in which 26 self-report questionnaires were reviewed. In this way, we have reported the factorial structures and psychometric properties of sleep quality questionnaires that have emerged during the past 12 years, suggesting their usefulness in epidemiological, research, and clinical settings.

Conclusions
In summary, sleep quality is a multifactorial construct that is difficult to define and measure objectively, considering the great variability between individuals and its subjective nature. In the present review, we focus on those which have been validated in terms of internal consistency, construct validity, and latent factor structure. The PSQI, the most widely used questionnaire for subjective sleep quality [21], demonstrated good reliability and validity, especially for known-group validity. However, questions regarding its factor model, the large recall period, and the scoring system challenge the value of the global PSQI score for distinguishing poor and good sleepers. Several sleep disorder questionnaires have been proposed; complaints about sleep quality are common and poor sleep quality can be an important symptom of many sleep disorders. These tools reveal good psychometric properties and easier administration and scoring (but not for the LSEQ and SLEEP-50). Additional studies are needed, on one hand, in order to clarify the factor structure of AIS and ISI and, on the other hand, to add to the evidence regarding the validation of both MSQ and JSS in epidemiological, research, and clinical use, considering that both are inexpensive, and easy to administer, complete, and score. Finally, the ESS showed good internal consistency and known-group construct validity was established. Questions remain concerning the factor structure of the ESS scale and about the definition of cut-off scores for clinical use. In conclusion, all self-report questionnaires assessing subjective sleep quality require further studies of high methodological quality to assess their measurement properties, not only in terms of reliability and construct validity but also in terms of latent structure and discriminability. In particular, studies are required in the area of testretest reliability as well as responsiveness to change. Within the subjective questionnaires included, the MSQ appears to be most suitable with which to ascertain the presence of sleep disorders (i.e., insomnia disorder) and the negative daytime consequences due to poor sleep (e.g., daytime sleepiness) with a complete assessment of the sleep-wake cycle. In addition, it is inexpensive and easy to administer, complete, and score, with good psychometric properties. A recently developed tool, the Sleep Condition Indicator (SCI) [99] with the diagnostic criteria of the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) [100] could also be considered in future studies. Indeed, the SCI is a short tool (8-item), and is easy to administer (each item is scored on a five-point scale) and easy to score (total score from 0 to 32, with higher values indicating better sleep). The SCI is reliable (Cronbach's α = 0.73, mean item-total correlation ranges from 0.42 to 0.50, ICC = 0.84) and valid (gender differences) with a specific cut-off of 7. The assessment of its dimensionality and a further investigation of its construct validity are recommended.