Abstract
Background/Objective: Existing abbreviated Geriatric Depression Scales (GDSs), derived via Classical Test Theory (CTT), often sacrifice accuracy for brevity and retain non-specific items. We aimed to develop a minimum-item GDS maintaining diagnostic performance equivalent to the full 30-item scale (GDS30) using Item Response Theory (IRT). Methods: This cross-sectional study employed rigorous 5:5 split-sample cross-validation. Participants included 6525 older adults (aged ≥60 years) from community-based (Korean Longitudinal Study on Cognitive Aging and Dementia) and clinical settings (geropsychiatry clinic). Depression was diagnosed through standardized clinical interviews based on DSM-IV criteria. Two-parameter logistic IRT models estimated item discrimination and difficulty parameters. Sequential item reduction with DeLong tests identified the minimum number of items required to maintain GDS30-equivalent area under the curve (AUC). Results: The 10-item IRT-optimized scale (GDS10-IRT) achieved an AUC of 0.856 (95% CI: 0.809–0.895) in the validation set, showing no significant difference from GDS30 (AUC = 0.883; p = 0.396). Conversely, the 15-item GDS (GDS15) demonstrated significantly lower AUC than GDS30 (p < 0.001) despite having more items. GDS10-IRT achieved a 234% improvement in efficiency ratio (AUC/items) over GDS30. Notably, Item 16 (“feeling downhearted and blue”), identified as the most discriminating symptom (a = 2.53), is absent from the GDS15 but included in GDS10-IRT. Conclusions: IRT-based item selection achieves GDS30-equivalent diagnostic accuracy with only 10 items, outperforming the widely used GDS15. By recovering high-discrimination items excluded by CTT, the GDS10-IRT offers a more efficient, specific screening tool for late-life depression.
1. Introduction
Late-life depression (LLD) is one of the most prevalent psychiatric disorders among older adults and is associated with increased morbidity, mortality, medical illness, and dementia [1]. Depression is the leading cause of disability measured by Years Lived with Disability and the fourth leading contributor to the global burden of disease [2]. However, LLD remains underrecognized and undertreated due to its subsyndromal features and complicated etiologies.
The 30-item Geriatric Depression Scale (GDS30), developed by Yesavage and colleagues, has become one of the most widely used depression screening instruments for older adults [3]. Unlike other screening instruments such as the Beck Depression Inventory [4] or the Center for Epidemiologic Studies Depression Scale [5], the GDS does not contain items regarding physical symptoms that are prevalent in older adults due to comorbid medical conditions. Instead, it uses a simple yes/no response format that enhances reliability and shortens administration time in elderly populations.
Despite these advantages, the length of GDS30 poses practical barriers in busy clinical settings, prompting the development of numerous abbreviated versions. The most widely adopted short form is the 15-item GDS (GDS15), developed using Classical Test Theory (CTT) methods on Western samples [6]. Subsequently, even shorter versions have been proposed, including a 4-item version [7] and a 10-item version [8]. However, all these abbreviated versions were derived from GDS15 using CTT methods, inheriting any limitations in its item selection.
CTT selects items based on item-total correlations, which favor moderately endorsed items while potentially overlooking items with low endorsement rates but high diagnostic specificity. Item Response Theory (IRT) offers a fundamentally different approach, estimating each item’s discrimination power and difficulty independently of endorsement frequency [9]. This allows identification of ‘quiet but powerful’ items (symptoms rarely reported but almost diagnostic when present) that CTT methods may systematically exclude.
A critical but often overlooked question in scale abbreviation is: how many items are actually necessary to maintain the full scale’s diagnostic accuracy? The widespread acceptance of 15- and 10-item versions assumes that significant item reduction necessarily compromises performance. However, if IRT can identify the most discriminating items from the complete item pool, substantially fewer items might suffice while maintaining or even improving diagnostic accuracy.
We hypothesized that IRT-based item selection from the complete GDS30 item pool would identify a minimum-item version that: (1) maintains diagnostic accuracy statistically equivalent to GDS30, (2) requires fewer items than existing CTT-derived short forms, and (3) recovers high-discrimination items that were lost in previous CTT-based reductions. Specifically, we suspected that the widely accepted 15-item threshold might be an artifact of CTT methodology rather than a true psychometric necessity. This study aimed to develop and validate the minimum-item GDS maintaining GDS30 equivalent performance, with rigorous cross-validation to ensure generalizability.
2. Methods
2.1. Study Design and Participants
Data were drawn from the Korean Longitudinal Study on Cognitive Aging and Dementia (KLOSCAD), an ongoing nationwide population-based prospective cohort study [10,11], and clinical samples from the Geropsychiatry Clinic of Seoul National University Bundang Hospital. KLOSCAD participants were randomly sampled from residents aged 60 years or older in 13 districts across South Korea, stratified by age and sex. The total sample comprised 6525 participants (community-dwelling: n = 5872; clinical: n = 653).
All participants provided written informed consent, and the study protocol was approved by the Institutional Review Board of Seoul National University Bundang Hospital.
2.2. Diagnostic Assessment
Standardized clinical interviews, physical examinations, and neurological examinations were administered to all participants using the Korean version of the Mini-International Neuropsychiatric Interview (MINI) [12] and the Korean version of the Consortium to Establish a Registry for Alzheimer’s Disease assessment battery (CERAD-K) [13] by geropsychiatrists with advanced training in geriatric psychiatry and dementia research. Axis I psychiatric disorders including major and minor depressive disorder were diagnosed according to the fourth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) criteria [14]. Participants with dementia, delirium, or other major psychiatric disorders that could affect mood assessment were excluded.
2.3. Geriatric Depression Scale
All participants completed the GDS30, which was developed through rigorous translation and back-translation procedures and has demonstrated excellent psychometric properties in Korean older adults (Cronbach’s α = 0.90; test–retest reliability = 0.91) [15].
2.4. Sample Splitting for Cross-Validation
To ensure robust validation and prevent overfitting, the sample was randomly divided into development (n = 3262) and validation (n = 3263) sets using 5:5 stratified sampling by depression status and enrollment source. IRT parameters were estimated in the development set, and diagnostic accuracy was evaluated in both sets independently. Cross-validation was assessed by comparing the area under the receiver operating characteristics (ROC) curves between development and validation sets for each scale using DeLong tests [16]; non-significant differences (p > 0.05) would indicate stable performance without overfitting.
2.5. Item Response Theory Analysis
Two-parameter logistic (2PL) IRT models were fit to all 30 GDS items in the development set using the mirt package in R (version 4.2.0). For each item, two parameters were estimated: (1) the discrimination parameter (a), indicating how well the item differentiates between depressed and non-depressed individuals; and (2) the difficulty parameter (b), indicating the depression severity level at which the item has a 50% probability of being endorsed. Items were ranked by discrimination, with the highest-discriminating items selected for the short form.
2.6. Sequential Item Reduction Analysis
To determine the minimum number of items maintaining GDS30 equivalent diagnostic performance, we performed sequential item reduction analysis. Starting from the top 10 discriminating items, we progressively removed items one at a time in order of lowest discrimination. At each reduction step, we calculated the AUC in both development and validation sets and compared each against GDS30 using the DeLong test [16]. The minimum-item version was defined as the fewest number of items showing no statistically significant AUC difference from GDS30 (p > 0.05) in the validation set, with the additional requirement of non-significance in the development set to ensure stability. This dual-criterion stopping rule guards against chance findings that might not replicate.
Rather than arbitrarily selecting a round number of items (e.g., 10 or 15), we employed a data-driven approach to determine the optimal scale length. Sequential reduction was performed by removing items one at a time in reverse order of discrimination. At each reduction step, DeLong tests compared the abbreviated scale’s AUC against GDS30. The optimal number of items was defined as the minimum that maintained statistical equivalence to GDS30 (p > 0.05) in the development set. Additionally, we examined the test information function to identify potential ‘elbow points’ where marginal information gains diminished substantially.
2.7. Statistical Analysis
Demographic and clinical characteristics were summarized using means and standard deviations (SD) for continuous variables and frequencies and percentages for categorical variables. Comparisons between development and validation sets were performed using independent samples t-tests for continuous variables and chi-square tests for categorical variables to confirm successful randomization.
ROC curve analyses were performed to evaluate diagnostic accuracy. AUC comparisons between scales were performed using the DeLong test for correlated ROC curves. Sensitivity and specificity were calculated at optimal cutoff scores determined by Youden’s index. Comparisons of sensitivity and specificity between scales within the same sample were performed using McNemar’s test for paired proportions, while comparisons between development and validation sets were performed using chi-square tests for independent proportions.
The efficiency ratio was defined as AUC divided by the number of items (AUC/items), quantifying diagnostic accuracy achieved per item. To compare efficiency ratios between scales within the same sample, we employed a bootstrap approach: in each of 1000 bootstrap iterations (the standard for bootstrap inference), we resampled subjects with replacement, calculated AUCs for both scales, computed the efficiency ratio for each, and obtained the difference. Statistical significance was determined using the percentile method; if the 95% bootstrap confidence interval for the difference excluded zero, the comparison was considered significant (p < 0.05). For cross-validation comparisons (development vs. validation sets), efficiency ratios and their bootstrap standard errors were computed independently in each set. The difference was tested using a z-statistic: z = (Efficiencydev − Efficiencyval)/√(SE2dev + SE2val), with significance assessed against the standard normal distribution.
To evaluate measurement invariance, we examined differential item functioning (DIF) across three grouping variables: sex (male vs. female), age group (young-old [60–74 years] vs. old-old [≥75 years]), and recruitment setting (community vs. clinical). DIF was assessed using the Mantel-Haenszel (MH) procedure, with effect sizes classified according to Educational Testing Service (ETS) criteria: Category A (|ΔMH| < 1.0; negligible), Category B (1.0 ≤ |ΔMH| < 1.5 with p < 0.05; moderate), and Category C (|ΔMH| ≥ 1.5 with p < 0.05; large). To assess scale-level impact (Differential Test Functioning), we compared AUC values across subgroups.
All statistical analyses were performed using R (version 4.5.2). IRT analysis was conducted using the mirt package for two-parameter logistic model estimation. ROC curve analyses, including AUC calculation with 95% confidence intervals and DeLong tests for correlated ROC curves, were performed using the pROC package. Internal consistency (Cronbach’s α) was calculated using the psych package. Statistical significance was set at α = 0.05 (two-tailed).
3. Results
3.1. Sample Characteristics
Table 1 presents the demographic and clinical characteristics of the total sample and comparisons between development and validation sets. The total sample (N = 6525) had a mean age of 72.4 years (SD = 7.2; range: 60–95) and was 58.0% female. Mean education was 7.8 years (SD = 5.2). The majority of participants were enrolled from the KLOSCAD, with the remaining 10.0% recruited from the clinic. All participants were community-dwelling. Depression prevalence was 3.8% overall (n = 248), with higher rates in clinic sample (13.0%) compared to the KLOSCAD sample (3.2%). The development (n = 3262) and validation (n = 3263) sets showed no significant differences in age, sex, education, recruitment source, GDS30 total score, or depression prevalence (all p > 0.05), confirming successful randomization.
Table 1.
Demographic and clinical characteristics of the study participants.
3.2. IRT Item Parameters and Cross-Version Comparison
Table 2 presents the IRT parameters for all 30 GDS items along with their inclusion status across four short form versions. Item 16 (‘Do you often feel downhearted and blue?’) demonstrated the highest discrimination (a = 2.53) and individual-item AUC (0.779) among all 30 items, yet it is conspicuously absent from GDS15 and all its derivative short forms. Notably, Item 16’s difficulty parameter (b = 0.60) indicates moderate endorsement frequency—not a rare symptom—suggesting its exclusion from GDS15 was likely due to perceived redundancy with other mood items rather than low endorsement. IRT analysis reveals this apparent ‘redundancy’ actually reflects Item 16’s role as a diagnostic anchor that maximally discriminates between depressed and non-depressed individuals.
Table 2.
Item-level psychometric properties, diagnostic utility, and composition across geriatric depression scale versions.
In contrast, GDS15 retains several items with notably low discrimination: Item 12 (‘prefer to stay at home’; a = 0.70), Item 14 (‘more memory problems than most’; a = 0.88), and Item 2 (‘dropped activities/interests’; a = 0.91). The mean discrimination of GDS15 items (a = 1.38) was 20% lower than that of GDS10-IRT items (a = 1.96), indicating that GDS10 comprises a more ‘elite’ set of high-performing items.
3.3. Sequential Item Reduction with Cross-Validation
Table 3 presents the sequential item reduction analysis with results from both development and validation sets, along with cross-validation statistics. The version including 10 items (GDS10-IRT) was identified as the minimum version maintaining GDS30 equivalent performance, showing non-significant differences from GDS30 in both development set (p = 0.576) sets. The cross-validation p-value (0.210) confirmed stable performance across samples without evidence of overfitting. At 9 items, statistically significant performance degradation occurred in the validation set (development: p = 0.003; validation: p = 0.298), confirming that 10 items represent the minimum threshold for GDS30 equivalence. While the 9-item version showed non-significant difference in the validation set (p = 0.298), the significant degradation in the development set (p = 0.003) indicated instability that could compromise generalizability, establishing 10 items as the robust minimum. The 10th-ranked item by discrimination was Item 6 (‘feeling that life is empty’; a = 1.71). The test information function revealed an elbow point at 10 items: the marginal decrease in information from 10 to 9 items (Δ = 0.12) was substantially larger than the decrease from 11 to 10 items (Δ = 0.08), indicating diminishing returns beyond 10 items.
Table 3.
Sequential item reduction analysis with cross-validation.
A notable finding emerged for GDS15 was that its performance was significantly inferior to GDS30 in both the development (p = 0.012) and validation (p < 0.001) sets. This pattern suggests that GDS15’s marginal performance disadvantage becomes apparent with independent validation, underscoring the importance of cross-validation in scale development studies.
3.4. GDS10-IRT Item Composition
The final GDS10 comprises Items 1, 4, 6, 10, 11, 16, 17, 21, 22, and 25 (see Table 2 for details). Internal consistency was excellent (Cronbach’s α = 0.83). The optimal cutoff score was ≥4, yielding sensitivity of 80.0% and specificity of 84.0% in the validation set. Four items not included in GDS15 (Items 6, 11, 16, 25) capture core dysphoria and the anxiety/agitation dimensions prominent in late-life depression.
3.5. Screening Performance and Efficiency Comparison
Table 4 presents the comparison of sensitivity, specificity, and efficiency ratio across scales with statistical testing. Both GDS15 and GDS10-IRT showed sensitivity and specificity statistically equivalent to GDS30 (all McNemar p > 0.05), indicating that item reduction did not significantly compromise classification accuracy at optimal cutoffs.
Table 4.
Comparisons of Screening Performance and Efficiency.
However, the efficiency ratio defined as AUC per item differed substantially across scales. GDS10-IRT achieved an efficiency ratio of 0.085 in the validation set, significantly higher than both GDS30 (0.029; bootstrap p < 0.001) and GDS15 (0.057; bootstrap p < 0.001). This represents a 234% improvement over GDS30 and a 70% improvement over GDS15. Importantly, all metrics showed no significant differences between development and validation sets (all cross-validation p > 0.05), confirming stable performance without overfitting.
3.6. Differential Item Functioning
DIF analysis examined measurement invariance across sex, age group, and recruitment setting (Table 5). By sex, 4 of 10 items showed meaningful DIF: GDS06 (‘afraid something bad will happen’), GDS11 (‘wonderful to be alive’), GDS16 (‘downhearted and blue’), and GDS25 (‘feel like crying’) were more readily endorsed by women. Item 25 exhibited large DIF (ΔMH = 2.27, ETS Category C). By age group, 3 items showed meaningful DIF: GDS10 (‘memory problems’), GDS17 (‘feel worthless’), and GDS22 (‘situation is hopeless’) were more readily endorsed by older adults (≥75 years). By recruitment setting, 8 items showed DIF, with clinical participants endorsing items more readily than community participants. Despite item-level DIF, scale-level diagnostic performance (Differential Test Functioning) was equivalent across sex and age groups. AUC values were comparable between males (0.856; 95% CI: 0.786–0.925) and females (0.817; 95% CI: 0.778–0.856), with ΔAUC = 0.039. Similarly, young–old (0.839; 95% CI: 0.801–0.878) and old–old (0.832; 95% CI: 0.762–0.902) groups showed equivalent performance (ΔAUC = 0.007). The larger difference between community (0.861) and clinical (0.695) settings (ΔAUC = 0.166) reflects population differences in depression characteristics rather than measurement bias.
Table 5.
Differential Item Functioning (DIF) and Scale-Level Performance.
4. Discussion
In this study, we challenged the conventional assumption that more items necessarily yield better diagnostic accuracy in depression screening. By applying Item Response Theory to a large population-based Korean cohort with rigorous cross-validation, we demonstrated that 10 optimally selected items achieve diagnostic accuracy statistically equivalent to the full 30-item GDS, which is a finding with substantial implications for clinical practice and scale development methodology. The GDS10-IRT achieved an efficiency ratio of 0.097 (AUC/items) in the validation set, representing a 234% improvement over GDS30 (0.029) and a 70% improvement over GDS15 (0.057). Critically, all performance metrics showed no significant differences between development and validation sets (all cross-validation p > 0.05), confirming stable performance without overfitting.
4.1. Overcoming the Specificity Pitfalls of CTT-Based Short Forms
The superior efficiency of GDS10-IRT over GDS15 stems from its systematic exclusion of items that function as diagnostic ‘noise.’ A critical limitation shared by CTT-derived scales including the GDS15 and GDS10 is their retention of items based primarily on item-total correlations, which inherently favors moderately endorsed but non-specific symptoms over highly discriminating core features.
Specifically, existing CTT-based short forms retain items such as Item 14 (‘Do you feel you have more problems with memory than most?’) and Item 12 (‘Do you prefer to stay at home rather than going out and doing new things?’). As identified in our previous factor analytic study [15], these items load heavily on ‘Cognitive Inefficiency’ or ‘Social Withdrawal’ factors rather than core mood disturbance. In geriatric populations, subjective memory complaints and social withdrawal are frequently confounded by mild cognitive impairment, normal aging, or physical frailty that are highly prevalent regardless of depression status. Consequently, CTT-based scales that retain such items tend to exhibit reduced specificity due to elevated false-positive rates. Indeed, while Almeida’s CTT-based GDS10 [8] maintained high sensitivity (84.8%), its specificity (70.9%) was notably lower than our IRT-based version (84.0%), a pattern consistent with the inclusion of these cognitively confounded items.
By utilizing IRT discrimination parameters, we systematically filtered out these non-specific items. Items 12 (a = 0.70) and 14 (a = 0.88) exhibited notably low discrimination values, confirming their poor ability to differentiate depressed from non-depressed individuals. Through prioritizing items reflecting core dysphoria rather than cognitive or behavioral symptoms, GDS10-IRT achieves robust diagnostic equivalence to the full GDS30 despite having 33% fewer items than the GDS15, demonstrating that the widely accepted 15-item threshold was an artifact of CTT methodology rather than a true psychometric necessity.
Furthermore, the pattern of item exclusion across all abbreviated versions reinforces the need to minimize cognitive confounds. Items universally excluded from GDS15, GDS10, and GDS4 (e.g., Items 20, 26, 29, 30) generally reflect cognitive inefficiency or apathy rather than mood disturbance. For instance, Item 29 (‘ease of decision making’) exhibited the lowest discrimination parameter in our study (a = 0.20). The consistent rejection of these items across widely divergent methodologies—both CTT and IRT—confirms that cognitive symptoms function primarily as diagnostic noise in older adults. Additionally, items retained only in the GDS15 but dropped in all shorter versions (Items 7 and 23) demonstrated relatively modest discrimination (a ≈ 1.0–1.3), suggesting they offer marginal diagnostic value compared to the ‘elite’ items retained in the 10-item versions.
4.2. The Rediscovery of Item 16 and Cultural Context
Our identification of Item 16 (‘Do you often feel downhearted and blue?’) as the single most discriminating symptom (a = 2.53; AUC = 0.779) challenges both methodological conventions and cross-cultural assumptions. Previous cross-cultural research suggested that East Asian older adults might minimize direct expressions of sadness due to cultural stigma, with a tendency to suppress positive affect endorsement and show heightened interpersonal sensitivity [18]. Given these observations, one might expect somatic items to outperform mood items in Korean populations.
However, Item 16’s dominance contradicts this expectation and suggests that when queried directly about ‘downheartedness’, translated in Korean as ‘울적하고 침울하다’ (wool-juk-ha-go chim-wool-ha-da), capturing a culturally resonant blend of melancholy and lethargy, it functions as a trans-cultural core marker of depression. The Korean translation may possess particular salience because it employs indigenous emotional vocabulary rather than direct Western psychiatric terminology, potentially reducing cultural barriers to endorsement.
CTT methods likely excluded Item 16 from the GDS15 due to high multicollinearity with other mood items, misinterpreting its diagnostic potency as statistical redundancy. IRT analysis corrects this historical oversight by demonstrating that Item 16’s high correlation with depression status (precisely what makes it ‘redundant’ in CTT terms) is exactly what makes it the most powerful diagnostic anchor. This distinction between statistical redundancy (CTT perspective) and diagnostic information (IRT perspective) represents a fundamental methodological insight with implications beyond this specific scale.
4.3. Capturing Agitated Depression in Late Life
The GDS10-IRT’s inclusion of Item 11 (‘Do you often feel restless and fidgety?’) and Item 25 (‘Do you frequently feel like crying?’), both absent from the GDS15, reflects an important phenomenological consideration. Our previous factor analytic work classified these items under a distinct ‘Sad Mood and Agitation’ factor [15], distinguishing them from items measuring cognitive symptoms or social withdrawal.
Late-life depression frequently presents with agitation, irritability, and emotional lability rather than the classic retarded, anhedonic presentation more typical of younger adults. This ‘agitated depression’ phenotype is particularly common in the context of vascular pathology, white matter disease, and executive dysfunction—conditions highly prevalent in geriatric populations [19,20,21]. By retaining items that capture psychomotor agitation and emotional dysregulation, the GDS10-IRT may enhance detection of these non-melancholic depression presentations that might be missed by scales emphasizing only sadness or anhedonia.
Furthermore, four items included in GDS10-IRT but absent from GDS15 (Items 6, 11, 16, 25) collectively capture core dysphoria, restlessness, and emotional lability—symptom dimensions particularly relevant in Asian older adult populations where somatic and anxiety presentations of depression are common. This item configuration may explain the GDS10-IRT’s robust performance despite containing fewer items than GDS15.
4.4. Measurement Invariance and Clinical Implementation
DIF analysis revealed item-level variation across sex, age, and recruitment setting. Notably, women more readily endorsed emotional expression items (e.g., ‘feeling like crying’), consistent with known sex differences in depression phenomenology. Older adults more readily endorsed memory complaints and feelings of worthlessness, likely reflecting age-related concerns. Critically, despite item-level DIF, scale-level diagnostic performance remained equivalent across sex and age groups. This pattern of ‘compensatory DIF’—where items favoring one group are balanced by items favoring another—indicates that GDS10-IRT total scores can be validly compared across these subgroups without adjustment. However, the larger AUC difference by recruitment setting reflects true population differences in depression characteristics rather than measurement bias.
While ≥4 was identified as the optimal cut-off based on Youden’s J index, clinical context should guide threshold selection. In low-prevalence community screening, a higher cut-off (≥5) may be preferable to reduce false-positive burden. Conversely, in high-prevalence settings such as geriatric clinics or nursing homes (15–30%), refs. [22,23] a lower cut-off (≥3) maximizes sensitivity to minimize missed cases.
4.5. Limitations
Several limitations warrant consideration. First, the depression prevalence in our community sample (3.2% in KLOSCAD) is lower than typically reported in Western cohorts (8–15%). However, this class imbalance actually provides a more stringent test of specificity, as any scale maintaining high AUC under low-prevalence conditions demonstrates robust discriminative ability rather than inflated performance from base-rate effects. The consistent results across our community and clinical subsamples (prevalence 13.0%) further support generalizability across prevalence spectra. Regarding predictive values, while AUC demonstrated stability across development and validation sets (p = 0.584), the GDS10-IRT’s estimated positive predictive value (PPV) is approximately 14%—a common characteristic of screening tests in low-prevalence populations. Importantly, the high specificity ensures excellent negative predictive value (>99%), supporting efficient case identification. In higher-prevalence settings such as geriatric clinics (20%), PPV would increase to approximately 56%. Second, the prominence of Item 16 (‘downhearted and blue’) as the highest-discriminating symptom may appear to contradict observations that Asian populations preferentially report somatic over mood symptoms [15,24,25]. However, our finding suggests that when directly queried using culturally adapted instruments, Korean older adults do endorse core dysphoric symptoms. This challenges oversimplified assumptions about cultural differences in depression expression and underscores the importance of psychometrically validated translations. Third, while our findings derive from Korean older adults, the identification of ‘downheartedness’ and ‘emptiness’ as core discriminating features aligns with cross-cultural psychiatric consensus on depression phenomenology, suggesting these parameters likely translate to other populations. Nevertheless, external validation in Western and other Asian samples is warranted to confirm generalizability. Implementation in non-Korean populations should follow a staged validation approach, first testing the fit of Korean-derived IRT parameters before proceeding to population-specific recalibration if significant misfit is observed. Given that our items assess universal depression constructs, we anticipate reasonable cross-cultural stability, though cultural differences in emotional expression may require adjustment of specific parameters. Fourth, diagnostic assessments utilized DSM-IV criteria; however, given the substantial continuity in core depressive symptom definitions across DSM editions [26], we anticipate minimal impact on the scale’s applicability to current diagnostic practice. Finally, item selection focused exclusively on discrimination; future work might incorporate content validity considerations. Notably, the deliberate exclusion of cognitively confounded items (e.g., ‘memory problems’) may enhance the GDS10-IRT’s utility in cognitively impaired populations by focusing on core dysphoric symptoms rather than cognitive complaints that overlap with dementia. However, direct empirical validation is needed; future studies should examine GDS10-IRT performance across the cognitive spectrum, including MCI and mild-to-moderate dementia.
5. Conclusions
IRT-based item selection achieves GDS30 equivalent diagnostic performance with only 10 items (a 67% reduction) with stable cross-validation confirming generalizability. GDS10-IRT outperforms the GDS15 while using 33% fewer items. The highest-discriminating item (Item 16) is absent from GDS15, illustrating the cost of CTT-based abbreviation. These findings suggest that population-specific IRT optimization from complete item pools should become the standard approach for developing efficient, culturally sensitive screening instruments.
Author Contributions
Conceptualization, K.W.K.; methodology, K.W.K.; validation, K.W.K. and J.W.H.; formal analysis, K.W.K.; investigation, J.W.H., D.J.O., T.H.K., K.P.K., B.J.K., S.G.K., J.L.K., S.W.M., J.H.P., S.-H.R., J.C.Y., D.Y.L., D.W.L., S.B.L., J.J.L. and J.H.J.; resources, J.W.H., D.J.O., T.H.K., K.P.K., B.J.K., S.G.K., J.L.K., S.W.M., J.H.P., S.-H.R., J.C.Y., D.Y.L., D.W.L., S.B.L., J.J.L. and J.H.J.; data curation, J.W.H. and D.J.O.; writing—original draft preparation, K.W.K. and J.W.H.; writing—review and editing, All authors; project administration, K.W.K. and J.W.H.; funding acquisition, K.W.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Korean Health Technology R&D Project, Ministry of Health and Welfare, Republic of Korea (HI09C1379[A092077]) and the Research of Korea Centers for Disease Control and Prevention (2019-ER6201-01), and by a grant of the Korea Dementia Research Project through the Korea Dementia Research Center (KDRC), funded by the Ministry of Health & Welfare and Ministry of Science and ICT, Republic of Korea (grant number: RS-2023-KH135260).
Institutional Review Board Statement
The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Seoul National University Bundang Hospital (IRB No. B-0912-089-010, Approval date: 14 January 2010).
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Blazer, D. Major depression in later life. Hosp. Pract. (Off. Ed.) 1989, 24, 69–76. [Google Scholar]
- Murray, C.J.; Lopez, A.D. Global mortality, disability, and the contribution of risk factors: Global Burden of Disease Study. Lancet 1997, 349, 1436–1442. [Google Scholar] [CrossRef]
- Yesavage, J.A.; Brink, T.L.; Rose, T.L.; Lum, O.; Huang, V.; Adey, M.; Leirer, V.O. Development and validation of a geriatric depression screening scale: A preliminary report. J. Psychiatr. Res. 1982, 17, 37–49. [Google Scholar] [CrossRef]
- Beck, A.T.; Ward, C.H.; Mendelson, M.; Mock, J.; Erbaugh, J. An inventory for measuring depression. Arch. Gen. Psychiatry 1961, 4, 561–571. [Google Scholar] [CrossRef]
- Radloff, L.S. The use of the Center for Epidemiologic Studies Depression Scale in adolescents and young adults. J. Youth Adolesc. 1991, 20, 149–166. [Google Scholar] [CrossRef]
- Yesavage, J.A.; Sheikh, J.I. 9/Geriatric depression scale (GDS) recent evidence and development of a shorter version. Clin. Gerontol. 1986, 5, 165–173. [Google Scholar] [CrossRef]
- D’Ath, P.; Katona, P.; Mullan, E.; Evans, S.; Katona, C. Screening, detection and management of depression in elderly primary care attenders. I: The acceptability and performance of the 15 item Geriatric Depression Scale (GDS15) and the development of short versions. Fam. Pract. 1994, 11, 260–266. [Google Scholar] [CrossRef] [PubMed]
- Almeida, O.P.; Almeida, S.A. Short versions of the geriatric depression scale: A study of their validity for the diagnosis of a major depressive episode according to ICD-10 and DSM-IV. Int. J. Geriatr. Psychiatry 1999, 14, 858–865. [Google Scholar] [CrossRef]
- Embretson, S.E.; Reise, S.P. Item Response Theory for Psychologists; Psychology Press: Mahwah, NJ, USA, 2013. [Google Scholar]
- Han, J.W.; Kim, T.H.; Kwak, K.P.; Kim, K.; Kim, B.J.; Kim, S.G.; Kim, J.L.; Kim, T.H.; Moon, S.W.; Park, J.Y.; et al. Overview of the Korean Longitudinal Study on Cognitive Aging and Dementia. Psychiatry Investig. 2018, 15, 767–774. [Google Scholar] [CrossRef] [PubMed]
- Han, J.W.; Oh, D.J.; Kim, T.H.; Kwak, K.P.; Kim, B.J.; Kim, S.G.; Kim, J.L.; Moon, S.W.; Park, J.H.; Ryu, S.H.; et al. Refining Western Dementia-Risk Paradigms: Evidence From a Decade of the Korean Longitudinal Study on Cognitive Aging and Dementia. J. Korean Med. Sci. 2025, 40, e326. [Google Scholar] [CrossRef]
- Sheehan, D.V.; Lecrubier, Y.; Sheehan, K.H.; Amorim, P.; Janavs, J.; Weiller, E.; Hergueta, T.; Baker, R.; Dunbar, G.C. The Mini-International Neuropsychiatric Interview (M.I.N.I.): The development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. J. Clin. Psychiatry 1998, 59, 22–33. [Google Scholar]
- Lee, J.H.; Lee, K.U.; Lee, D.Y.; Kim, K.W.; Jhoo, J.H.; Kim, J.H.; Lee, K.H.; Kim, S.Y.; Han, S.H.; Woo, J.I. Development of the Korean version of the Consortium to Establish a Registry for Alzheimer’s Disease Assessment Packet (CERAD-K): Clinical and neuropsychological assessment batteries. J. Gerontol. B Psychol. Sci. Soc. Sci. 2002, 57, P47–P53. [Google Scholar] [CrossRef] [PubMed]
- American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders; American Psychiatric Association: Washington, DC, USA, 1994. [Google Scholar]
- Kim, J.Y.; Park, J.H.; Lee, J.J.; Huh, Y.; Lee, S.B.; Han, S.K.; Choi, S.W.; Lee, D.Y.; Kim, K.W.; Woo, J.I. Standardization of the korean version of the geriatric depression scale: Reliability, validity, and factor structure. Psychiatry Investig. 2008, 5, 232–238. [Google Scholar] [CrossRef]
- DeLong, E.R.; DeLong, D.M.; Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
- Hanley, J.A.; McNeil, B.J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1983, 148, 839–843. [Google Scholar] [CrossRef]
- Lee, J.J.; Kim, K.W.; Kim, T.H.; Park, J.H.; Lee, S.B.; Park, J.W.; McQuoid, D.R.; Steffens, D.C. Cross-cultural considerations in administering the center for epidemiologic studies depression scale. Gerontology 2011, 57, 455–461. [Google Scholar] [CrossRef] [PubMed]
- Alexopoulos, G.S.; Meyers, B.S.; Young, R.C.; Campbell, S.; Silbersweig, D.; Charlson, M. ‘Vascular depression’ hypothesis. Arch. Gen. Psychiatry 1997, 54, 915–922. [Google Scholar] [CrossRef]
- Park, J.H.; Lee, S.B.; Lee, T.J.; Lee, D.Y.; Jhoo, J.H.; Youn, J.C.; Choo, I.H.; Choi, E.A.; Jeong, J.W.; Choe, J.Y.; et al. Depression in vascular dementia is quantitatively and qualitatively different from depression in Alzheimer’s disease. Dement. Geriatr. Cogn. Disord. 2007, 23, 67–73. [Google Scholar] [CrossRef]
- Park, J.H.; Lee, S.B.; Lee, J.J.; Yoon, J.C.; Han, J.W.; Kim, T.H.; Jeong, H.G.; Newhouse, P.A.; Taylor, W.D.; Kim, J.H.; et al. Epidemiology of MRI-defined vascular depression: A longitudinal, community-based study in Korean elders. J. Affect. Disord. 2015, 180, 200–206. [Google Scholar] [CrossRef]
- Blazer, D.G. Depression in late life: Review and commentary. J. Gerontol. A Biol. Sci. Med. Sci. 2003, 58, 249–265. [Google Scholar] [CrossRef]
- Seitz, D.; Purandare, N.; Conn, D. Prevalence of psychiatric disorders among older adults in long-term care homes: A systematic review. Int. Psychogeriatr. 2010, 22, 1025–1039. [Google Scholar] [CrossRef] [PubMed]
- Kerr, L.K.; Kerr, L.D., Jr. Screening tools for depression in primary care: The effects of culture, gender, and somatic symptoms on the detection of depression. West. J. Med. 2001, 175, 349. [Google Scholar] [CrossRef] [PubMed]
- Jang, Y.; Kim, G.; Chiriboga, D. Acculturation and manifestation of depressive symptoms among Korean-American older adults. Aging Ment. Health 2005, 9, 500–507. [Google Scholar] [CrossRef]
- Uher, R.; Payne, J.L.; Pavlova, B.; Perlis, R.H. Major depressive disorder in DSM-5: Implications for clinical practice and research of changes from DSM-IV. Depress. Anxiety 2014, 31, 459–471. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.