Automating Systematic Reviews in Clinical Psychiatry: Comparing Domain Experts and NLP-Based Text Mining

Ku, Cyril S.; Weiner, Daniel; Wells, Meera; Huang, Andrew; Peltier, Morgan R.

doi:10.3390/info17050463

Open AccessArticle

Automating Systematic Reviews in Clinical Psychiatry: Comparing Domain Experts and NLP-Based Text Mining

by

Cyril S. Ku

^1,*

,

Daniel Weiner

²,

Meera Wells

²,

Andrew Huang

²

and

Morgan R. Peltier

²

¹

Department of Computer Science, William Paterson University; 300 Pompton Road, Wayne, NJ 07470, USA

²

Hackensack Meridian School of Medicine, Jersey Shore University Hospital; 123 Metro Boulevard, Nutley, NJ 07110, USA

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 463; https://doi.org/10.3390/info17050463 (registering DOI)

Submission received: 8 February 2026 / Revised: 24 April 2026 / Accepted: 1 May 2026 / Published: 9 May 2026

(This article belongs to the Special Issue Natural Language Processing (NLP) with Applications and Natural Language Understanding (NLU), 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Objective: This study examines the potential of natural language processing and text mining to automate the systematic review process in clinical psychiatry, a field that traditionally relies on domain experts and can be time-consuming, prone to human bias and errors. The study compares the classification of review articles by domain experts with that facilitated by machine algorithms. Methods: Using data from PubMed, 160 abstracts related to “transcranial magnetic stimulation” and “autism” were classified into “treatment” and “non-treatment” categories by both human reviewers and a computer algorithm. The computer algorithm, employing topic modeling in text mining, was compared to human reviewers, including two psychiatrists, a biostatistician, and a medical student. Results: The accuracy of human classifications ranged from 68% to 85%, with inter-rater reliability (Kappa statistic) between 0.40 (fair to moderate) and 0.64 (substantial). Intra-rater reliability, tested by reclassification after three months, varied from 0.38 to 0.82. Conclusions: The findings highlight the consistency and reproducibility of computational approaches compared to human classification, which exhibited both inter-rater and intra-rater variability. Differences in reviewer performance were observed; however, these patterns should be interpreted cautiously, as the study was not designed to directly assess cognitive or decision-making processes.

Keywords:

document classification; natural language processing; psychiatry; systematic review; text mining; topic modeling

Graphical Abstract

1. Introduction

Review articles serve as essential resources for practicing psychiatrists by synthesizing the published literature to build conceptual models and identify prevailing trends, risk factors, disease mechanisms, and strategies for effective clinical care for mental illness. Their importance has grown significantly in the past two decades, coinciding with the exponential increase in scientific publications driven by internet proliferation. A persistent limitation of traditional review articles, however, is the potential for authorial bias—authors may selectively cite studies that align with their preconceived hypotheses or conceptual frameworks. To mitigate this, systematic reviews impose predefined protocols for querying bibliographic databases and use manual screening to determine article relevance. Despite these efforts, subjectivity in study inclusion/exclusion decisions remains a concern, potentially introducing implicit bias into the review.

Natural Language Processing (NLP) and text mining have emerged as powerful tools for analyzing large-scale unstructured textual data [1,2]. One of the text mining techniques, topic modeling [3], including Latent Dirichlet Allocation (LDA) [4], enables the automatic discovery of latent thematic structures from large text corpora, facilitating exploration, classification, and summarization of vast amounts of textual data.

Despite these advances, a key research gap remains. Existing studies primarily focus on applying NLP techniques to analyze the psychiatric literature. Still, relatively few have systematically evaluated the ability of NLP-based approaches to classify research articles based solely on abstract-level information and directly compared their performance with human reviewers of varying expertise levels. Furthermore, the variability among human reviewers and its implications for systematic review reliability have not been sufficiently explored.

To address this gap, this study investigates the use of topic modeling in text mining to classify psychiatric research articles related to transcranial magnetic stimulation and autism into treatment and non-treatment categories. Specifically, this study aims to answer the following research questions (RQs):

RQ-1: How consistent are human reviewers in classifying psychiatric research articles?
RQ-2: How does an NLP-based classification approach compare to human reviewers in terms of consistency?
RQ-3: How does the reviewer’s background influence classification performance?

We compare the performance of an automated classification approach with that of human reviewers from diverse professional backgrounds and experiences. By examining both inter-rater and intra-rater reliability, variability, and consistency across reviewers, we contribute new insights into the role of NLP in supporting and partially automating systematic review processes.

The remainder of this paper is organized as follows: Section 2 presents a review of related work; Section 3 describes the methodology and results; Section 4 presents interpretations of the results; Section 5 describes the limitations of the study; and Section 6 presents the conclusions and future research directions.

2. Background and Literature Review

NLP has seen widespread adoption across a range of disciplines. Notable applications include sentiment analysis of social media content [5,6], analysis of the medical literature [7], and evaluation of customer feedback in business contexts [8]. These diverse uses underscore NLP’s versatility in transforming complex textual datasets into actionable knowledge. LDA is a probabilistic model that represents documents as mixtures of latent topics. Its variants are in widespread use for a variety of purposes, including social media analysis, digital humanities, and systematic reviews of the scientific literature [9].

Increasingly, NLP and text mining aid psychiatric and biomedical research to analyze large volumes of textual data and support clinical decision-making [10], literature reviews, and sentiment analysis [11]. Current applications of text mining in this domain include healthcare and bioinformatics, where it extracts meaningful insights from large datasets [12]. However, despite their growing use, there remains limited work systematically evaluating NLP-based classification against human reviewers, particularly in clinical psychiatry. There remains limited methodological guidance and evaluation of topic modeling in applied research contexts [13]. Few studies have systematically evaluated the capacity of NLP tools to classify research articles solely from abstract content and to compare their performance with that of traditional manual review.

We were particularly interested in Autism Spectrum Disorder (ASD) articles. ASD is a neurodevelopmental condition with a steadily increasing prevalence [14] and significant implications for public health [15]. The body of literature addressing ASD treatments continues to expand [16]. As of this writing, a PubMed https://pubmed.ncbi.nlm.nih.gov/ (accessed on 20 August 2022) search for “autism spectrum disorder” yields over 30,000 results. Given the growing population of individuals with ASD requiring clinical care, it is increasingly important for clinicians to have efficient tools to navigate the vast and complex literature to identify clinically relevant information.

To date, the application of machine learning within ASD research has focused predominantly on genetic analysis and diagnostic classification. Meanwhile, comparatively fewer studies have addressed its potential in evaluating new treatment approaches, such as transcranial magnetic stimulation [17]. Therefore, this paper aims to address that gap by exploring the use of machine-learning and text-mining techniques to identify ASD-related publications specifically relevant to treatment and clinical practice, distinguishing them from those centered on genetics, diagnosis, or underlying physiology.

In the context of systematic reviews, several studies have explored the use of NLP techniques to support literature screening and classification. These approaches aim to reduce the time and effort required for manual review while improving consistency. However, most existing work focuses on supervised machine-learning models that require labeled datasets, such as Support Vector Machines (SVM) [18], logistic regression, and neural network-based approaches.

Comparatively fewer studies have examined unsupervised approaches, such as topic modeling, for classification tasks in systematic reviews, particularly when labeled data are unavailable. Moreover, limited research has systematically compared machine-based classification with human reviewers across different backgrounds, experiences, and expertise levels.

Another important research direction involves probabilistic and uncertainty-estimation approaches in text classification, which aim to quantify prediction confidence and improve interpretability [19,20,21]. These approaches highlight the importance of reliability and variability in automated systems, especially in high-stakes domains such as clinical research.

Despite these advances, there remains a lack of studies that jointly examine (1) human variability, (2) machine consistency, and (3) the interaction between reviewer expertise and classification outcomes. This study addresses these gaps by providing a comparative analysis of human and NLP-based classification approaches in the context of clinical psychiatry.

3. Methods and Results

PubMed was selected as the data source due to its comprehensive coverage of the biomedical and psychiatric literature, ensuring the relevance and quality of retrieved articles. The search terms used were “transcranial magnetic stimulation” and “autism,” which yielded 170 articles. These were extracted with the RISmed package of R https://www.r-project.org/ (accessed on 20 August 2022) to create a data frame containing the PMID (PubMed Identifier is a unique identifier assigned to each article indexed in PubMed) and abstract. Of these, 9 results had PMID and no abstract, and 1 article was a correction of an existing article, leaving 160 abstracts for analysis. Each PMID and corresponding abstract were then randomized and classified into “Treatment” versus “Non-Treatment” categories using a computer algorithm (topic modeling) and by 4 human reviewers consisting of 2 psychiatrists (MW and DW), a biostatistician (MP), and a medical student (AH) who were blind to the computer algorithms and each other’s classifications. Text mining and details of the study were overseen by a computer scientist (CK).

All documents included in this study were processed and analyzed using SAS^® Enterprise Miner [22] and Text Miner [23] version 15.2 to extract underlying topics and themes from textual data. Topic modeling was implemented using SAS Text Miner, which employs an LDA-based approach to identify latent thematic structures. This unsupervised method was selected to enable classification without requiring labeled training data. The software enables clustering of similar documents based on term frequency across and within documents, allowing for meaningful grouping of content [24]. Term frequency was used as part of the preprocessing to capture the relative importance of terms within and across documents by the software. This workflow included several preprocessing steps: text parsing, text filtering, and topic extraction. During parsing, various parts of speech, noun phrases, and multi-word expressions were identified. Tokenization was performed to break the text into individual units. Meanwhile, standard NLP techniques, such as stemming, lemmatization, synonym normalization, and stop-word removal, were applied to reduce noise and redundancy. Term weighting and frequency adjustments were configured to streamline the feature set. Finally, topic modeling was employed to identify prominent topics within the unstructured text corpus.

A previous study [25] has shown that an abstract can sufficiently represent the content of the full article. Based on this finding, we used the 160 abstracts as input to the text-mining software to generate the corpus topic for analysis. Default parameters were used as a baseline configuration. By default, the software produced 25 topics; however, many of these topics contained overlapping terms. Therefore, we iteratively re-ran the SAS Text Miner through iterative refinement by adjusting parameters (top keyword terms according to term weights) until we arrived at a reduced set of four unique topics to improve interpretability and enable clear binary classification (Table 1), each with a distinct set of keywords and no overlapping terms. Themes were assigned to these categories based on the semantic interpretation of their topic keywords. Topics containing terms related to therapeutic interventions, clinical procedures, or treatment outcomes (e.g., rTMS, session, patient) were categorized as “Treatment”. In contrast, topics focusing on mechanisms, neurophysiology, or theoretical constructs (e.g., mirror, system, plasticity) were categorized as “Non-Treatment.”

Once the four topics were assigned to either the “Treatment” or “Non-Treatment” theme, we categorized each abstract according to the topics it was associated with. If an abstract was classified under Topics 1 or 2, it was categorized as “Treatment.” Conversely, if it was associated with Topics 3 or 4, it was labeled as “Non-Treatment.” When an abstract was assigned to multiple topics from both themes, we followed a majority rule: it was assigned to the theme with the greater number of associated topics. For instance, if an abstract was linked to Topics 1, 3, and 4, it was classified as “Non-Treatment,” since two of the three topics fall under that category.

In the event of a tie, such as an abstract with equal numbers of “Treatment” and “Non-Treatment” topics, we used the topic weights generated by SAS Text Miner to resolve the classification. The abstract was assigned to the theme with the higher cumulative topic weight. For example, if the combined weights of Topics 1 and 2 equaled 0.5, and the combined weights of Topics 3 and 4 equaled 0.3, the abstract would be classified as “Treatment.” Topic classification was also performed by four human reviewers: two psychiatrists (DW and MW), a research biostatistician (MP), and a medical student (AH). The overall study design and use of text mining were overseen by a computer scientist (CK). Based on the topic keywords generated by SAS Text Miner, Reviewers were asked to classify each abstract as “Treatment” or “Non-Treatment” as shown in Table 1.

To assess the agreement between the algorithm and human expert classifications, we calculated standard evaluation metrics, including accuracy, precision (positive predictive value), sensitivity, and specificity (Table 2). The rationale for calling the computer algorithm a reference model lies in its deterministic and reproducible classification process, which eliminates intra-rater variability and provides a consistent baseline for comparison across human reviewers. Cohen’s Kappa statistics were used to measure inter-rater reliability between the computer-generated and human-labeled classifications (Table 3). Figure 1 provides a visual comparison of agreement levels between human reviewers and the reference model, illustrating variability in agreement levels across reviewers while remaining consistently above chance. The Kappa statistic, also known as Cohen’s Kappa, is a commonly used metric for assessing inter-rater agreement between two evaluators (pairwise comparisons)—in this case, the computer algorithm and a human expert—while adjusting for agreement that could occur by chance [26]. Kappa values range from −1 to 1, where 1 represents perfect agreement, 0 indicates agreement equivalent to chance, and negative values reflect worse-than-chance agreement. Generally, values above 0.60 are considered substantial, and values above 0.80 suggest near-perfect agreement.

Approximately 3 months after the initial classification of the 160 abstracts, we randomly selected 25 abstracts from the original set for reassessment. The subset of 25 abstracts was selected as a practical and exploratory sample to assess intra-rater reliability over time while minimizing reviewer burden and ensuring the feasibility of reclassification. Each human reviewer was asked to reclassify these abstracts as either “Treatment” or “Non-Treatment.” We then used Cohen’s Kappa statistics to evaluate intra-rater reliability, measuring the consistency of each reviewer’s classifications over time. In essence, this analysis compares how consistently everyone classified the same abstract after a 3-month interval. Please see the second row of Table 4 (Intra-rater correlation). In Table 4, we highlighted the agreements of inter- and intra-rater correlations, as well as those between a reviewer and other reviewers from Table 3.

The Kappa scores for agreement between the computer and human reviewers indicate varying levels of agreement (Table 2): AH (0.64) and DW (0.69) demonstrate substantial agreement, while MP (0.44) and MW (0.40) exhibit moderate agreement. The results in Table 4 showed that AH (the medical student) had the highest intra-rater reliability, with a Kappa score of 0.82. DW (psychiatrist) had a moderate level of agreement over time, with a Kappa of 0.43, while MP (biostatistician) and MW (child psychiatrist) had slightly lower consistency, with Kappa scores of 0.40 and 0.38, respectively.

Between reviewers, the Kappa statistics indicate that AH and DW have the highest inter-rater agreement (0.73), reflecting substantial agreement. In contrast, MP and especially MW show lower agreement levels with both human raters and the computer, suggesting more variability in their classifications. Although all Kappa values were positive, indicating better-than-chance consistency, AH demonstrated the highest reliability.

In addition to reporting Cohen’s Kappa statistics, we computed approximate 95% confidence intervals (CIs) for the agreement between each human reviewer and the reference model to assess the precision of the estimated agreement. Because the contingency tables were not fully retained for exact variance estimation, we employed a Wald-type approximation based on the observed agreement (

P_{o}

) and inferred expected agreement (

P_{e}

). Specifically,

P_{e}

was derived from the reported Kappa values using the relationship

P_{e} = (P_{o} - κ) / (1 - κ)

, and the standard error of Kappa was approximated as

S E (κ) = \sqrt{P_{o} (1 - P_{o}) / [n (1 - P_{e})^{2}]}

, where

n

denotes the number of abstracts. The 95% confidence intervals were then computed as

κ \pm 1.96 \times S E (κ)

. These intervals should be interpreted as approximate estimates.

The approximate 95% confidence intervals for the reviewer-versus-reference-model Kappa values were as follows: AH, 0.517–0.763; DW, 0.576–0.804; MP, 0.297–0.583; and MW, 0.264–0.536, indicating that agreement ranged from moderate to substantial across reviewers while remaining consistently above chance levels.

4. Interpretations of Results

This study evaluates the consistency and variability of human reviewers compared with an NLP-based classification approach for systematic review tasks in clinical psychiatry. Overall, the results indicate that while human reviewers can achieve moderate to substantial agreement with the reference model, there is notable variability both across reviewers (inter-rater) and within reviewers over time (intra-rater).

The performance metrics presented in Table 2 demonstrate that all human reviewers achieved accuracy levels above chance, with values ranging from 0.68 to 0.85. Similarly, Cohen’s Kappa values indicate moderate to substantial agreement with the reference model, suggesting that human classification performance is meaningful and not random. However, the observed variability across reviewers highlights differences in classification behavior, which may be influenced by factors such as training, experience, and individual interpretation strategies.

The inclusion of approximate 95% confidence intervals for the Kappa statistics provides additional insight into the precision and stability of agreement estimates. The intervals for AH and DW fall predominantly within the moderate-to-substantial agreement range, suggesting relatively stable classification performance. In contrast, the intervals for MP and MW extend toward lower agreement ranges, indicating greater variability. Importantly, all confidence intervals lie above zero, confirming that agreement between human reviewers and the reference model is consistently better than chance.

Inter-rater reliability analysis further demonstrates that agreement between reviewers varies across pairs. Higher agreement between certain reviewers (e.g., AH and DW) suggests greater alignment in classification criteria or interpretation, whereas lower agreement among other pairs reflects differences in judgment or classification approach. These findings reinforce the importance of considering variability in human decision-making when conducting systematic reviews.

The intra-rater reliability results reveal additional variability in classification consistency over time. Notably, AH exhibited higher intra-rater reliability compared to other reviewers, while DW, MP, and MW showed lower consistency levels. This pattern suggests that classification decisions may not be entirely stable, even for the same individual, particularly when based on limited abstract-level information.

One possible explanation for these findings is that reviewers may apply different decision-making strategies when interpreting abstracts. Less experienced reviewers may rely on simpler and more consistent heuristics, while more experienced reviewers may incorporate a broader range of contextual knowledge, potentially introducing additional variability. However, it is important to emphasize that these interpretations are hypothesis-generating and should not be considered definitive causal explanations. The study design does not directly measure cognitive processes, and further research is required to systematically examine the role of expertise, bias, and decision-making strategies in classification performance.

Overall, the findings suggest that while human reviewers can achieve meaningful classification performance, their decisions are subject to variability. In contrast, the NLP-based approach provides deterministic and reproducible classifications, offering advantages in consistency and scalability. These results support the use of computational methods as complementary tools in systematic review workflows, particularly in contexts involving large volumes of the literature.

At the same time, it is important to recognize that the reference model is not a ground truth and does not guarantee correctness. Rather, it serves as a consistent baseline for comparison. Therefore, the results should be interpreted in terms of relative agreement and consistency, rather than absolute classification accuracy.

5. Limitations

This study has several strengths, including the use of widely accessible analytical tools, a multidisciplinary research team, and a clearly defined and practically relevant classification task. These characteristics enhance the generalizability of the study design and suggest that the overall framework may be applicable across a range of research settings.

At the same time, several methodological limitations should be acknowledged. First, the analysis was based on article abstracts rather than full-text content. While abstracts provide a concise summary, they may omit important contextual details that could influence classification decisions. This limitation may have contributed to variability in both human and machine classifications. Future research should incorporate full-text analysis to provide a more comprehensive basis for evaluation and to better assess the capabilities of automated approaches.

Second, the absence of an externally validated ground truth limits the interpretation of classification performance. In this study, the computational model serves as a reference model, rather than an absolute benchmark of correctness. Accordingly, the reported metrics should be interpreted in terms of relative agreement and consistency, rather than definitive accuracy.

Third, the intra-rater reliability analysis was conducted on a relatively small subset of abstracts, which limits the statistical robustness and generalizability of those findings. In addition, confidence intervals reported for agreement measures are approximate, reflecting the absence of complete contingency table data. These results should therefore be considered exploratory. Future studies should employ larger reclassification samples and retain full contingency data to enable more precise statistical estimation.

Fourth, interpretations related to differences in reviewer performance, such as potential influences of expertise, cognitive bias, or heuristic decision-making, are inferred from observed patterns rather than directly measured variables. As such, these interpretations should be viewed as hypothesis-generating and interpreted with caution. Future work should incorporate controlled experimental designs or cognitive assessment methods to more rigorously examine these factors.

Finally, while computational approaches offer advantages in terms of consistency and reproducibility, they are not without limitations. Models trained on incomplete or biased data may replicate or amplify existing biases. Therefore, ongoing evaluation, validation, and refinement are essential to ensure that automated systems maintain their benefits without introducing unintended sources of error.

6. Discussions, Conclusions, and Future Research

This study demonstrates the potential of NLP and text mining techniques to support systematic review processes in clinical psychiatry. Rather than focusing solely on classification accuracy, this work emphasizes the evaluation of consistency and variability in classification decisions, providing a complementary perspective to traditional performance-based assessments.

The results show that human reviewers exhibit measurable inter-rater and intra-rater variability, even among trained professionals. While agreement levels were generally above chance and, in some cases, substantial, the observed variability highlights the influence of individual interpretation and decision-making strategies in classification tasks. In contrast, the NLP-based approach produced deterministic and reproducible classifications, offering advantages in terms of consistency and scalability.

Importantly, the computational model in this study is interpreted as a reference model rather than a ground truth, as no externally validated benchmark was available. Consequently, the reported performance metrics should be understood in terms of relative agreement and consistency, rather than absolute classification accuracy. This distinction is essential for appropriately contextualizing the results and avoiding overinterpretation.

A key contribution of this study lies in demonstrating that variability among human reviewers is not negligible and may have practical implications for the reliability of systematic reviews. The findings suggest that differences in reviewer background and experience can influence classification outcomes, reinforcing the importance of structured and reproducible methodologies in evidence synthesis. At the same time, the results highlight the potential for NLP-based tools to serve as supportive mechanisms that enhance consistency and reduce variability in the large-scale literature screening.

The interpretation of differences in performance across reviewers, including potential influences of expertise, cognitive bias, and heuristic decision-making, should be approached with caution. These factors were not directly measured in this study and are inferred from observed patterns in the data. Therefore, such interpretations should be considered hypothesis-generating, rather than definitive explanations. Future research should employ controlled experimental designs or cognitive assessment frameworks to more rigorously investigate these effects.

Several limitations of this study should be acknowledged. First, the use of abstracts rather than full-text articles may limit contextual depth and contribute to classification variability for both human and machine reviewers. Second, the absence of an externally validated ground truth prevents definitive conclusions about classification correctness. Third, the relatively small sample size used for intra-rater reliability analysis limits the statistical robustness of those estimates. Additionally, the confidence intervals reported for Kappa statistics are approximate, reflecting the absence of full contingency table data. These limitations suggest that the findings should be interpreted as exploratory and provide a foundation for future investigation.

Future research can extend this work in several directions. First, incorporating full-text analysis may improve classification performance and reduce ambiguity. Second, exploring multiple topic modeling configurations and systematic hyperparameter tuning could enhance model robustness and interpretability. Third, the inclusion of additional agreement metrics, such as Fleiss’ Kappa or Krippendorff’s Alpha, would provide a more comprehensive evaluation of multi-rater agreement. Fourth, integrating supervised learning approaches and comparing them with unsupervised methods may offer further insights into classification performance. Finally, the incorporation of visualizations, such as workflow diagrams of the NLP pipeline, could enhance both the interpretability and transparency of the analytical process.

Emerging approaches, including uncertainty-aware models and large language models (LLMs), such as ChatGPT [27], present promising opportunities to enhance automated literature review processes. These methods may provide improved contextual understanding and probabilistic confidence estimates, addressing some of the limitations of traditional topic modeling techniques.

Overall, this study supports the role of NLP as a complementary tool in systematic review workflows. Rather than replacing human expertise, automated methods can enhance efficiency, consistency, and scalability. The integration of computational approaches with expert judgment represents a promising direction for improving the reliability and reproducibility of systematic reviews in clinical psychiatry and related fields.

By systematically examining both human variability and machine consistency, this study provides a practical and scalable framework for integrating computational methods into systematic review workflows, contributing to more reproducible and reliable evidence synthesis in data-intensive research domains.

Author Contributions

Conceptualization, C.S.K. and M.R.P.; Data curation, A.H. and C.S.K.; Formal analysis, C.S.K. and M.R.P.; Investigation, C.S.K., D.W., M.R.P., and M.W.; Methodology, C.S.K. and M.R.P.; Project administration, C.S.K. and M.R.P.; Resources, C.S.K., D.W., M.R.P., and M.W.; Software, A.H. and C.S.K.; Supervision, C.S.K. and M.R.P.; Validation, A.H., C.S.K., D.W., M.R.P., and M.W.; Visualization, C.S.K. and M.R.P.; Writing—original draft, A.H., C.S.K., and M.R.P.; Writing—review and editing, A.H., C.S.K., D.W., M.R.P., and M.W., Funding acquisition, C.S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Science Foundation under Grant No. 2028011.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent for publication was obtained from all identifiable human participants, who are also the authors of this manuscript.

Data Availability Statement

The original data presented in the study are openly available in PubMed at https://pubmed.ncbi.nlm.nih.gov/ (accessed on 20 August 2022).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ASD	Autism Spectrum Disorder
LDA	Latent Dirichlet Allocation
LLM	Large Language Model
NLP	Natural Language Processing
PMID	PubMed Identifier
PPV	Positive Predictive Value
RISmed	Research Information Systems for PubMed

References

Lee, R.S.T. Natural Language Processing: A Textbook with Python Implementation; Springer Nature Singapore Pte Ltd.: Singapore, 2024. [Google Scholar]
Qamar, U.; Raza, M.S. Applied Text Mining; Springer Nature Singapore Pte Ltd.: Singapore, 2024. [Google Scholar]
Rao, Y.; Li, Q. Topic Modeling: Advanced Techniques and Applications; Springer Nature Singapore Pte Ltd.: Singapore, 2025. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Choi, J.A.; Ku, C.S. Examining Platform Strategy for Influencer Marketing Using Text Mining. In Proceedings of the 9th IEEE/ACIS International Conference on Big Data, Cloud Computing, and Data Science (BCD 2024-Summer), Kitakyushu, Japan, 17 July 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Choi, J.A.; Ku, C.S. Identifying the Public’s Changing Concerns during a Global Health Crisis: Text Mining and Comparative Analysis of Tweets during the COVID-19 Pandemic. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing; Lee, R., Ed.; SNPD 2021, Studies in Computational Intelligence; Springer: Cham, Switzerland, 2022; Volume 1012. [Google Scholar]
Jensen, L.J.; Saric, J.; Bork, P. Literature mining for the biologist: From information retrieval to biological discovery. Nat. Rev. Genet. 2004, 7, 119–129. [Google Scholar] [CrossRef]
Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2004; pp. 168–177. [Google Scholar]
Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey. Multimed. Tools Appl. 2019, 78, 15169–15211. [Google Scholar] [CrossRef]
Zhang, T.; Schoene, A.M.; Ji, S.; Ananiadou, S. Natural language processing applied to mental illness detection: A narrative review. npj Digit. Med. 2022, 5, 46. [Google Scholar] [CrossRef] [PubMed]
Abbe, A.; Grouin, C.; Zweigenbaum, P.; Falissard, B. Text mining applications in psychiatry: A systematic literature review. Int. J. Methods Psychiatr. Res. 2015, 25, 86–100. [Google Scholar] [CrossRef] [PubMed]
Hankar, M.; Kasri, M.; Beni-Hssane, A. A comprehensive overview of topic modeling: Techniques, applications and challenges. Neurocomputing 2025, 628, 129638. [Google Scholar] [CrossRef]
Laureate, C.D.P.; Buntine, W.; Linger, H. A systematic review of the use of topic models for short text social media analysis. Artif. Intell. Rev. 2023, 56, 14223–14255. [Google Scholar] [CrossRef] [PubMed]
Salari, N.; Rasoulpoor, S.; Rasoulpoor, S.; Shohaimi, S.; Jafarpour, S.; Abdoli, N.; Khaledi-Paveh, B.; Mohammadi, M. The global prevalence of autism spectrum disorder: A comprehensive systematic review and meta-analysis. Ital. J. Pediatr. 2022, 48, 112. [Google Scholar] [CrossRef] [PubMed]
Leigh, J.P.; Du, J. Brief Report: Forecasting the Economic Burden of Autism in 2015 and 2025 in the United States. J. Autism Dev. Disord. 2015, 45, 4135–4139. [Google Scholar] [CrossRef] [PubMed]
Sweileh, W.; Al-Jabi, S.; Sawalha, A.; Zyoud, S. Bibliometric profile of the global scientific research on autism spectrum disorders. SpringerPlus 2016, 5, 1480. [Google Scholar] [CrossRef]
Hyde, K.K.; Novack, M.N.; LaHaye, N.; Parlett-Pelleriti, C.; Anden, R.; Dixon, D.R.; Linstead, E. Applications of Supervised Machine Learning in Autism Spectrum Disorder Research: A Review. Rev. J. Autism Dev. Disord. 2019, 6, 128–146. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, Y.; Zhou, L.; Zhang, F.; Wang, H. Acoustic signals-based probabilistic fault diagnosis for expansion joints of small and medium bridges using Bayesian ensemble learning. Eng. Struct. 2026, 354, 123379. [Google Scholar] [CrossRef]
Sha, D.; Zeng, X.; Johannssen, A.; Wang, R.; Tran, K.P. A Two-Stage NLP-Driven Framework for Interval-Valued Carbon Price Prediction Using Sentiment Analysis and Error Correction. J. Forecast. 2026, 45, 806–818. [Google Scholar] [CrossRef]
Li, S.; Xu, X.; He, C.; Shen, F.; Yang, Y.; Shen, H.T. Cross-Modal Uncertainty Modeling With Diffusion-Based Refinement for Text-Based Person Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 2881–2893. [Google Scholar] [CrossRef]
SAS Institute Inc. SAS Enterprise Miner 15.2: Reference Help; SAS Institute Inc.: Cary, NC, USA, 2018. [Google Scholar]
SAS Institute Inc. SAS Text Miner 15.2: Reference Help; SAS Institute Inc.: Cary, NC, USA, 2018. [Google Scholar]
Chakraborty, G.; Pagolu, M.; Garla, S. Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS; SAS Institute Inc.: Cary, NC, USA, 2013. [Google Scholar]
Ku, C.S.; Pugliese, K.; Dmello, J.R.; Peltier, M.R.; Green, R.; Ranjan, S. Validating the Use of Natural Language Processing and Text Mining for Hospital-Based Violence Intervention Programs and Criminal Justice Articles. Information 2025, 16, 1098. [Google Scholar] [CrossRef]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
OpenAI. ChatGPT. Available online: https://chat.openai.com (accessed on 8 February 2026).

Figure 1. Agreement between Human Reviewers and the Reference Model.

Table 1. Topics and themes.

Topic		Treatment	Non-Treatment
1	rtms,+frequency,dlpfc,+session,+error	DW, MP, MW	AH
2	brain,+technique,tms,+patient,+disease	AH, DW, MP, MW
3	mirror,observation,+action,+system,mns		AH, DW, MP, MW
4	ctbs,plasticity,bdnf,mep,+age		AH, DW, MP, MW

Table 2. Validation of human reviewers against the computer algorithm (reference model).

	Human Reviewers
Measurement	AH	DW	MP	MW	Mean (SD)
Accuracy	0.83	0.85	0.73	0.68	0.77 (0.08)
Precision (PPV)	0.84	0.91	0.79	0.90	0.86 (0.06)
Sensitivity	0.89	0.84	0.76	0.54	0.76 (0.15)
Specificity	0.74	0.87	0.67	0.90	0.80 (0.11)

Table 3. Kappa matrix.

	AH	DW	MP	MW	Computer
AH	1.0
DW	0.73	1.0
MP	0.54	0.51	1.0
MW	0.37	0.46	0.38	1.0
Computer	0.64	0.69	0.44	0.40	1.0

Table 4. Kappa values (highlighting agreements).

	Human Reviewers
Measurement	AH	DW	MP	MW	Mean (SD)
Inter-rater Correlation (k)	0.64	0.69	0.44	0.40	0.54 (0.14)
Intra-rater Correlation (k)	0.82	0.43	0.40	0.38	0.51 (0.21)

Kappa (AH vs. Other Reviewers)		0.73	0.54	0.37	0.55 (0.18)
Kappa (DW vs. Other Reviewers)	0.73		0.51	0.46	0.57 (0.14)
Kappa (MP vs. Other Reviewers)	0.54	0.51		0.38	0.48 (0.09)
Kappa (MW vs. Other Reviewers)	0.37	0.46	0.38		0.40 (0.05)
Legend:	Blue				Almost Perfect Agreement (0.81–1.00)
	Green				Substantial Agreement (0.61–0.80)
	Yellow				Moderate Agreement (0.41–0.60)
	Orange				Fair Agreement (0.21–0.40)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ku, C.S.; Weiner, D.; Wells, M.; Huang, A.; Peltier, M.R. Automating Systematic Reviews in Clinical Psychiatry: Comparing Domain Experts and NLP-Based Text Mining. Information 2026, 17, 463. https://doi.org/10.3390/info17050463

AMA Style

Ku CS, Weiner D, Wells M, Huang A, Peltier MR. Automating Systematic Reviews in Clinical Psychiatry: Comparing Domain Experts and NLP-Based Text Mining. Information. 2026; 17(5):463. https://doi.org/10.3390/info17050463

Chicago/Turabian Style

Ku, Cyril S., Daniel Weiner, Meera Wells, Andrew Huang, and Morgan R. Peltier. 2026. "Automating Systematic Reviews in Clinical Psychiatry: Comparing Domain Experts and NLP-Based Text Mining" Information 17, no. 5: 463. https://doi.org/10.3390/info17050463

APA Style

Ku, C. S., Weiner, D., Wells, M., Huang, A., & Peltier, M. R. (2026). Automating Systematic Reviews in Clinical Psychiatry: Comparing Domain Experts and NLP-Based Text Mining. Information, 17(5), 463. https://doi.org/10.3390/info17050463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automating Systematic Reviews in Clinical Psychiatry: Comparing Domain Experts and NLP-Based Text Mining

Abstract

1. Introduction

2. Background and Literature Review

3. Methods and Results

4. Interpretations of Results

5. Limitations

6. Discussions, Conclusions, and Future Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI