1. Introduction
Review articles serve as essential resources for practicing psychiatrists by synthesizing the published literature to build conceptual models and identify prevailing trends, risk factors, disease mechanisms, and strategies for effective clinical care for mental illness. Their importance has grown significantly in the past two decades, coinciding with the exponential increase in scientific publications driven by internet proliferation. A persistent limitation of traditional review articles, however, is the potential for authorial bias—authors may selectively cite studies that align with their preconceived hypotheses or conceptual frameworks. To mitigate this, systematic reviews impose predefined protocols for querying bibliographic databases and use manual screening to determine article relevance. Despite these efforts, subjectivity in study inclusion/exclusion decisions remains a concern, potentially introducing implicit bias into the review.
Natural Language Processing (NLP) and text mining have emerged as powerful tools for analyzing large-scale unstructured textual data [
1,
2]. One of the text mining techniques, topic modeling [
3], including Latent Dirichlet Allocation (LDA) [
4], enables the automatic discovery of latent thematic structures from large text corpora, facilitating exploration, classification, and summarization of vast amounts of textual data.
Despite these advances, a key research gap remains. Existing studies primarily focus on applying NLP techniques to analyze the psychiatric literature. Still, relatively few have systematically evaluated the ability of NLP-based approaches to classify research articles based solely on abstract-level information and directly compared their performance with human reviewers of varying expertise levels. Furthermore, the variability among human reviewers and its implications for systematic review reliability have not been sufficiently explored.
To address this gap, this study investigates the use of topic modeling in text mining to classify psychiatric research articles related to transcranial magnetic stimulation and autism into treatment and non-treatment categories. Specifically, this study aims to answer the following research questions (RQs):
RQ-1: How consistent are human reviewers in classifying psychiatric research articles?
RQ-2: How does an NLP-based classification approach compare to human reviewers in terms of consistency?
RQ-3: How does the reviewer’s background influence classification performance?
We compare the performance of an automated classification approach with that of human reviewers from diverse professional backgrounds and experiences. By examining both inter-rater and intra-rater reliability, variability, and consistency across reviewers, we contribute new insights into the role of NLP in supporting and partially automating systematic review processes.
The remainder of this paper is organized as follows:
Section 2 presents a review of related work;
Section 3 describes the methodology and results;
Section 4 presents interpretations of the results;
Section 5 describes the limitations of the study; and
Section 6 presents the conclusions and future research directions.
2. Background and Literature Review
NLP has seen widespread adoption across a range of disciplines. Notable applications include sentiment analysis of social media content [
5,
6], analysis of the medical literature [
7], and evaluation of customer feedback in business contexts [
8]. These diverse uses underscore NLP’s versatility in transforming complex textual datasets into actionable knowledge. LDA is a probabilistic model that represents documents as mixtures of latent topics. Its variants are in widespread use for a variety of purposes, including social media analysis, digital humanities, and systematic reviews of the scientific literature [
9].
Increasingly, NLP and text mining aid psychiatric and biomedical research to analyze large volumes of textual data and support clinical decision-making [
10], literature reviews, and sentiment analysis [
11]. Current applications of text mining in this domain include healthcare and bioinformatics, where it extracts meaningful insights from large datasets [
12]. However, despite their growing use, there remains limited work systematically evaluating NLP-based classification against human reviewers, particularly in clinical psychiatry. There remains limited methodological guidance and evaluation of topic modeling in applied research contexts [
13]. Few studies have systematically evaluated the capacity of NLP tools to classify research articles solely from abstract content and to compare their performance with that of traditional manual review.
We were particularly interested in Autism Spectrum Disorder (ASD) articles. ASD is a neurodevelopmental condition with a steadily increasing prevalence [
14] and significant implications for public health [
15]. The body of literature addressing ASD treatments continues to expand [
16]. As of this writing, a PubMed
https://pubmed.ncbi.nlm.nih.gov/ (accessed on 20 August 2022) search for “autism spectrum disorder” yields over 30,000 results. Given the growing population of individuals with ASD requiring clinical care, it is increasingly important for clinicians to have efficient tools to navigate the vast and complex literature to identify clinically relevant information.
To date, the application of machine learning within ASD research has focused predominantly on genetic analysis and diagnostic classification. Meanwhile, comparatively fewer studies have addressed its potential in evaluating new treatment approaches, such as transcranial magnetic stimulation [
17]. Therefore, this paper aims to address that gap by exploring the use of machine-learning and text-mining techniques to identify ASD-related publications specifically relevant to treatment and clinical practice, distinguishing them from those centered on genetics, diagnosis, or underlying physiology.
In the context of systematic reviews, several studies have explored the use of NLP techniques to support literature screening and classification. These approaches aim to reduce the time and effort required for manual review while improving consistency. However, most existing work focuses on supervised machine-learning models that require labeled datasets, such as Support Vector Machines (SVM) [
18], logistic regression, and neural network-based approaches.
Comparatively fewer studies have examined unsupervised approaches, such as topic modeling, for classification tasks in systematic reviews, particularly when labeled data are unavailable. Moreover, limited research has systematically compared machine-based classification with human reviewers across different backgrounds, experiences, and expertise levels.
Another important research direction involves probabilistic and uncertainty-estimation approaches in text classification, which aim to quantify prediction confidence and improve interpretability [
19,
20,
21]. These approaches highlight the importance of reliability and variability in automated systems, especially in high-stakes domains such as clinical research.
Despite these advances, there remains a lack of studies that jointly examine (1) human variability, (2) machine consistency, and (3) the interaction between reviewer expertise and classification outcomes. This study addresses these gaps by providing a comparative analysis of human and NLP-based classification approaches in the context of clinical psychiatry.
3. Methods and Results
PubMed was selected as the data source due to its comprehensive coverage of the biomedical and psychiatric literature, ensuring the relevance and quality of retrieved articles. The search terms used were “transcranial magnetic stimulation” and “autism,” which yielded 170 articles. These were extracted with the RISmed package of R
https://www.r-project.org/ (accessed on 20 August 2022) to create a data frame containing the PMID (PubMed Identifier is a unique identifier assigned to each article indexed in PubMed) and abstract. Of these, 9 results had PMID and no abstract, and 1 article was a correction of an existing article, leaving 160 abstracts for analysis. Each PMID and corresponding abstract were then randomized and classified into “Treatment” versus “Non-Treatment” categories using a computer algorithm (topic modeling) and by 4 human reviewers consisting of 2 psychiatrists (MW and DW), a biostatistician (MP), and a medical student (AH) who were blind to the computer algorithms and each other’s classifications. Text mining and details of the study were overseen by a computer scientist (CK).
All documents included in this study were processed and analyzed using SAS
® Enterprise Miner [
22] and Text Miner [
23] version 15.2 to extract underlying topics and themes from textual data. Topic modeling was implemented using SAS Text Miner, which employs an LDA-based approach to identify latent thematic structures. This unsupervised method was selected to enable classification without requiring labeled training data. The software enables clustering of similar documents based on term frequency across and within documents, allowing for meaningful grouping of content [
24]. Term frequency was used as part of the preprocessing to capture the relative importance of terms within and across documents by the software. This workflow included several preprocessing steps: text parsing, text filtering, and topic extraction. During parsing, various parts of speech, noun phrases, and multi-word expressions were identified. Tokenization was performed to break the text into individual units. Meanwhile, standard NLP techniques, such as stemming, lemmatization, synonym normalization, and stop-word removal, were applied to reduce noise and redundancy. Term weighting and frequency adjustments were configured to streamline the feature set. Finally, topic modeling was employed to identify prominent topics within the unstructured text corpus.
A previous study [
25] has shown that an abstract can sufficiently represent the content of the full article. Based on this finding, we used the 160 abstracts as input to the text-mining software to generate the corpus topic for analysis. Default parameters were used as a baseline configuration. By default, the software produced 25 topics; however, many of these topics contained overlapping terms. Therefore, we iteratively re-ran the SAS Text Miner through iterative refinement by adjusting parameters (top keyword terms according to term weights) until we arrived at a reduced set of four unique topics to improve interpretability and enable clear binary classification (
Table 1), each with a distinct set of keywords and no overlapping terms. Themes were assigned to these categories based on the semantic interpretation of their topic keywords. Topics containing terms related to therapeutic interventions, clinical procedures, or treatment outcomes (e.g., rTMS, session, patient) were categorized as “Treatment”. In contrast, topics focusing on mechanisms, neurophysiology, or theoretical constructs (e.g., mirror, system, plasticity) were categorized as “Non-Treatment.”
Once the four topics were assigned to either the “Treatment” or “Non-Treatment” theme, we categorized each abstract according to the topics it was associated with. If an abstract was classified under Topics 1 or 2, it was categorized as “Treatment.” Conversely, if it was associated with Topics 3 or 4, it was labeled as “Non-Treatment.” When an abstract was assigned to multiple topics from both themes, we followed a majority rule: it was assigned to the theme with the greater number of associated topics. For instance, if an abstract was linked to Topics 1, 3, and 4, it was classified as “Non-Treatment,” since two of the three topics fall under that category.
In the event of a tie, such as an abstract with equal numbers of “Treatment” and “Non-Treatment” topics, we used the topic weights generated by SAS Text Miner to resolve the classification. The abstract was assigned to the theme with the higher cumulative topic weight. For example, if the combined weights of Topics 1 and 2 equaled 0.5, and the combined weights of Topics 3 and 4 equaled 0.3, the abstract would be classified as “Treatment.” Topic classification was also performed by four human reviewers: two psychiatrists (DW and MW), a research biostatistician (MP), and a medical student (AH). The overall study design and use of text mining were overseen by a computer scientist (CK). Based on the topic keywords generated by SAS Text Miner, Reviewers were asked to classify each abstract as “Treatment” or “Non-Treatment” as shown in
Table 1.
To assess the agreement between the algorithm and human expert classifications, we calculated standard evaluation metrics, including accuracy, precision (positive predictive value), sensitivity, and specificity (
Table 2). The rationale for calling the computer algorithm a reference model lies in its deterministic and reproducible classification process, which eliminates intra-rater variability and provides a consistent baseline for comparison across human reviewers. Cohen’s Kappa statistics were used to measure inter-rater reliability between the computer-generated and human-labeled classifications (
Table 3).
Figure 1 provides a visual comparison of agreement levels between human reviewers and the reference model, illustrating variability in agreement levels across reviewers while remaining consistently above chance. The Kappa statistic, also known as Cohen’s Kappa, is a commonly used metric for assessing inter-rater agreement between two evaluators (pairwise comparisons)—in this case, the computer algorithm and a human expert—while adjusting for agreement that could occur by chance [
26]. Kappa values range from −1 to 1, where 1 represents perfect agreement, 0 indicates agreement equivalent to chance, and negative values reflect worse-than-chance agreement. Generally, values above 0.60 are considered substantial, and values above 0.80 suggest near-perfect agreement.
Approximately 3 months after the initial classification of the 160 abstracts, we randomly selected 25 abstracts from the original set for reassessment. The subset of 25 abstracts was selected as a practical and exploratory sample to assess intra-rater reliability over time while minimizing reviewer burden and ensuring the feasibility of reclassification. Each human reviewer was asked to reclassify these abstracts as either “Treatment” or “Non-Treatment.” We then used Cohen’s Kappa statistics to evaluate intra-rater reliability, measuring the consistency of each reviewer’s classifications over time. In essence, this analysis compares how consistently everyone classified the same abstract after a 3-month interval. Please see the second row of
Table 4 (Intra-rater correlation). In
Table 4, we highlighted the agreements of inter- and intra-rater correlations, as well as those between a reviewer and other reviewers from
Table 3.
The Kappa scores for agreement between the computer and human reviewers indicate varying levels of agreement (
Table 2): AH (0.64) and DW (0.69) demonstrate substantial agreement, while MP (0.44) and MW (0.40) exhibit moderate agreement. The results in
Table 4 showed that AH (the medical student) had the highest intra-rater reliability, with a Kappa score of 0.82. DW (psychiatrist) had a moderate level of agreement over time, with a Kappa of 0.43, while MP (biostatistician) and MW (child psychiatrist) had slightly lower consistency, with Kappa scores of 0.40 and 0.38, respectively.
Between reviewers, the Kappa statistics indicate that AH and DW have the highest inter-rater agreement (0.73), reflecting substantial agreement. In contrast, MP and especially MW show lower agreement levels with both human raters and the computer, suggesting more variability in their classifications. Although all Kappa values were positive, indicating better-than-chance consistency, AH demonstrated the highest reliability.
In addition to reporting Cohen’s Kappa statistics, we computed approximate 95% confidence intervals (CIs) for the agreement between each human reviewer and the reference model to assess the precision of the estimated agreement. Because the contingency tables were not fully retained for exact variance estimation, we employed a Wald-type approximation based on the observed agreement () and inferred expected agreement (). Specifically, was derived from the reported Kappa values using the relationship , and the standard error of Kappa was approximated as , where denotes the number of abstracts. The 95% confidence intervals were then computed as . These intervals should be interpreted as approximate estimates.
The approximate 95% confidence intervals for the reviewer-versus-reference-model Kappa values were as follows: AH, 0.517–0.763; DW, 0.576–0.804; MP, 0.297–0.583; and MW, 0.264–0.536, indicating that agreement ranged from moderate to substantial across reviewers while remaining consistently above chance levels.
4. Interpretations of Results
This study evaluates the consistency and variability of human reviewers compared with an NLP-based classification approach for systematic review tasks in clinical psychiatry. Overall, the results indicate that while human reviewers can achieve moderate to substantial agreement with the reference model, there is notable variability both across reviewers (inter-rater) and within reviewers over time (intra-rater).
The performance metrics presented in
Table 2 demonstrate that all human reviewers achieved accuracy levels above chance, with values ranging from 0.68 to 0.85. Similarly, Cohen’s Kappa values indicate moderate to substantial agreement with the reference model, suggesting that human classification performance is meaningful and not random. However, the observed variability across reviewers highlights differences in classification behavior, which may be influenced by factors such as training, experience, and individual interpretation strategies.
The inclusion of approximate 95% confidence intervals for the Kappa statistics provides additional insight into the precision and stability of agreement estimates. The intervals for AH and DW fall predominantly within the moderate-to-substantial agreement range, suggesting relatively stable classification performance. In contrast, the intervals for MP and MW extend toward lower agreement ranges, indicating greater variability. Importantly, all confidence intervals lie above zero, confirming that agreement between human reviewers and the reference model is consistently better than chance.
Inter-rater reliability analysis further demonstrates that agreement between reviewers varies across pairs. Higher agreement between certain reviewers (e.g., AH and DW) suggests greater alignment in classification criteria or interpretation, whereas lower agreement among other pairs reflects differences in judgment or classification approach. These findings reinforce the importance of considering variability in human decision-making when conducting systematic reviews.
The intra-rater reliability results reveal additional variability in classification consistency over time. Notably, AH exhibited higher intra-rater reliability compared to other reviewers, while DW, MP, and MW showed lower consistency levels. This pattern suggests that classification decisions may not be entirely stable, even for the same individual, particularly when based on limited abstract-level information.
One possible explanation for these findings is that reviewers may apply different decision-making strategies when interpreting abstracts. Less experienced reviewers may rely on simpler and more consistent heuristics, while more experienced reviewers may incorporate a broader range of contextual knowledge, potentially introducing additional variability. However, it is important to emphasize that these interpretations are hypothesis-generating and should not be considered definitive causal explanations. The study design does not directly measure cognitive processes, and further research is required to systematically examine the role of expertise, bias, and decision-making strategies in classification performance.
Overall, the findings suggest that while human reviewers can achieve meaningful classification performance, their decisions are subject to variability. In contrast, the NLP-based approach provides deterministic and reproducible classifications, offering advantages in consistency and scalability. These results support the use of computational methods as complementary tools in systematic review workflows, particularly in contexts involving large volumes of the literature.
At the same time, it is important to recognize that the reference model is not a ground truth and does not guarantee correctness. Rather, it serves as a consistent baseline for comparison. Therefore, the results should be interpreted in terms of relative agreement and consistency, rather than absolute classification accuracy.
5. Limitations
This study has several strengths, including the use of widely accessible analytical tools, a multidisciplinary research team, and a clearly defined and practically relevant classification task. These characteristics enhance the generalizability of the study design and suggest that the overall framework may be applicable across a range of research settings.
At the same time, several methodological limitations should be acknowledged. First, the analysis was based on article abstracts rather than full-text content. While abstracts provide a concise summary, they may omit important contextual details that could influence classification decisions. This limitation may have contributed to variability in both human and machine classifications. Future research should incorporate full-text analysis to provide a more comprehensive basis for evaluation and to better assess the capabilities of automated approaches.
Second, the absence of an externally validated ground truth limits the interpretation of classification performance. In this study, the computational model serves as a reference model, rather than an absolute benchmark of correctness. Accordingly, the reported metrics should be interpreted in terms of relative agreement and consistency, rather than definitive accuracy.
Third, the intra-rater reliability analysis was conducted on a relatively small subset of abstracts, which limits the statistical robustness and generalizability of those findings. In addition, confidence intervals reported for agreement measures are approximate, reflecting the absence of complete contingency table data. These results should therefore be considered exploratory. Future studies should employ larger reclassification samples and retain full contingency data to enable more precise statistical estimation.
Fourth, interpretations related to differences in reviewer performance, such as potential influences of expertise, cognitive bias, or heuristic decision-making, are inferred from observed patterns rather than directly measured variables. As such, these interpretations should be viewed as hypothesis-generating and interpreted with caution. Future work should incorporate controlled experimental designs or cognitive assessment methods to more rigorously examine these factors.
Finally, while computational approaches offer advantages in terms of consistency and reproducibility, they are not without limitations. Models trained on incomplete or biased data may replicate or amplify existing biases. Therefore, ongoing evaluation, validation, and refinement are essential to ensure that automated systems maintain their benefits without introducing unintended sources of error.
6. Discussions, Conclusions, and Future Research
This study demonstrates the potential of NLP and text mining techniques to support systematic review processes in clinical psychiatry. Rather than focusing solely on classification accuracy, this work emphasizes the evaluation of consistency and variability in classification decisions, providing a complementary perspective to traditional performance-based assessments.
The results show that human reviewers exhibit measurable inter-rater and intra-rater variability, even among trained professionals. While agreement levels were generally above chance and, in some cases, substantial, the observed variability highlights the influence of individual interpretation and decision-making strategies in classification tasks. In contrast, the NLP-based approach produced deterministic and reproducible classifications, offering advantages in terms of consistency and scalability.
Importantly, the computational model in this study is interpreted as a reference model rather than a ground truth, as no externally validated benchmark was available. Consequently, the reported performance metrics should be understood in terms of relative agreement and consistency, rather than absolute classification accuracy. This distinction is essential for appropriately contextualizing the results and avoiding overinterpretation.
A key contribution of this study lies in demonstrating that variability among human reviewers is not negligible and may have practical implications for the reliability of systematic reviews. The findings suggest that differences in reviewer background and experience can influence classification outcomes, reinforcing the importance of structured and reproducible methodologies in evidence synthesis. At the same time, the results highlight the potential for NLP-based tools to serve as supportive mechanisms that enhance consistency and reduce variability in the large-scale literature screening.
The interpretation of differences in performance across reviewers, including potential influences of expertise, cognitive bias, and heuristic decision-making, should be approached with caution. These factors were not directly measured in this study and are inferred from observed patterns in the data. Therefore, such interpretations should be considered hypothesis-generating, rather than definitive explanations. Future research should employ controlled experimental designs or cognitive assessment frameworks to more rigorously investigate these effects.
Several limitations of this study should be acknowledged. First, the use of abstracts rather than full-text articles may limit contextual depth and contribute to classification variability for both human and machine reviewers. Second, the absence of an externally validated ground truth prevents definitive conclusions about classification correctness. Third, the relatively small sample size used for intra-rater reliability analysis limits the statistical robustness of those estimates. Additionally, the confidence intervals reported for Kappa statistics are approximate, reflecting the absence of full contingency table data. These limitations suggest that the findings should be interpreted as exploratory and provide a foundation for future investigation.
Future research can extend this work in several directions. First, incorporating full-text analysis may improve classification performance and reduce ambiguity. Second, exploring multiple topic modeling configurations and systematic hyperparameter tuning could enhance model robustness and interpretability. Third, the inclusion of additional agreement metrics, such as Fleiss’ Kappa or Krippendorff’s Alpha, would provide a more comprehensive evaluation of multi-rater agreement. Fourth, integrating supervised learning approaches and comparing them with unsupervised methods may offer further insights into classification performance. Finally, the incorporation of visualizations, such as workflow diagrams of the NLP pipeline, could enhance both the interpretability and transparency of the analytical process.
Emerging approaches, including uncertainty-aware models and large language models (LLMs), such as ChatGPT [
27], present promising opportunities to enhance automated literature review processes. These methods may provide improved contextual understanding and probabilistic confidence estimates, addressing some of the limitations of traditional topic modeling techniques.
Overall, this study supports the role of NLP as a complementary tool in systematic review workflows. Rather than replacing human expertise, automated methods can enhance efficiency, consistency, and scalability. The integration of computational approaches with expert judgment represents a promising direction for improving the reliability and reproducibility of systematic reviews in clinical psychiatry and related fields.
By systematically examining both human variability and machine consistency, this study provides a practical and scalable framework for integrating computational methods into systematic review workflows, contributing to more reproducible and reliable evidence synthesis in data-intensive research domains.