Evaluating Psychological Competency via Chinese Q&A in Large Language Models

Gao, Feng; He, Yishen; Chen, Qin; Liu, Feng

doi:10.3390/app15169089

Open AccessFeature PaperArticle

Evaluating Psychological Competency via Chinese Q&A in Large Language Models

¹

Lab of Artificial Intelligence for Education, East China Normal University, Shanghai 200062, China

²

Shanghai Institute for AI Education, East China Normal University, Shanghai 200062, China

³

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

⁴

School of Psychology, Shanghai Jiao Tong University, Shanghai 200030, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9089; https://doi.org/10.3390/app15169089

Submission received: 28 July 2025 / Revised: 14 August 2025 / Accepted: 14 August 2025 / Published: 18 August 2025

(This article belongs to the Special Issue Emerging Trends in Affective Computing and Measuring Emotional Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Recently, the application of large language models (LLMs) in psychology has gained increasing attention. However, their psychological competence still requires further investigation. This study explores this issue through the lens of Chinese psychological knowledge question answering (QA). Specifically, we constructed a dedicated dataset based on Chinese qualification examinations for psychological counselors and psychotherapists. Subsequently, we evaluated dense, Mixture-of-Expert, and reasoning LLMs with varying parameter sizes and evaluation modes in the Chinese context, measuring answer accuracy in both closed-ended and open-ended settings. The experimental results showed that the larger and more recent LLMs achieved higher accuracy in psychological QA. While few-shot learning led to improvements in accuracy, Chain-of-Thought prompting and reasoning LLMs provided only limited gains. Notably, LLMs achieved higher accuracy in closed-ended settings than in open-ended ones. Furthermore, error analysis indicated that LLMs can produce incorrect or hallucinated responses, primarily due to insufficient psychological knowledge and conceptual confusion. Although current LLMs show promise in psychological QA tasks, users should remain cautious about over-reliance on their responses. A complementary, human-AI collaborative approach is recommended for practical use.

Keywords:

large language models; psychological question answering; LLM evaluation

1. Introduction

Recent advances in large language models (LLMs), such as ChatGPT [1], GPT-4o [2], and Qwen [3,4], have shown notable capabilities in programming [5,6,7], question answering [8,9,10], and psychology-related tasks [11,12], serving as helpful assistants for humans. After training on large-scale corpora, LLMs can recall knowledge [13] and generate human-like, conversational responses [14]. They have shown potential in specialized fields and professional qualification exams, including those in law [15], neurosurgery [16], and medicine [17]. Consequently, LLMs are increasingly reshaping how learners access and engage with knowledge in educational contexts [18,19]. While LLMs offer promising supports, concerns remain regarding their risk of hallucinated content [20,21], outdated or inaccurate information [22,23], and inconsistent performance across model updates [24]. Moreover, LLMs rely on probabilistic next-token prediction, leading to plausible but incorrect answers [13]. Unwitting reliance on flawed outputs may mislead users, impairing their judgment and potentially resulting in poor decision-making [20,25,26]. Therefore, it is essential to understand the reliability of LLMs, prior to their adoption in sensitive contexts such as psychological applications.

With rapid development of LLMs, their implementations span diverse psychological areas [14], including psychological healthcare chatbot [11,27], sentiment analysis [28,29], and mental health assessments [12,30]. While these applications demonstrate great potential, the capabilities of LLMs for psychological applications raise critical concerns about their practical competence and limitations in this domain. In response to these concerns, researchers have begun evaluating LLMs using GPT-4-generated psychological questions [31], after-school exercises extracted from psychology textbooks [32], and questions from psychology and teacher qualification exams [33]. These studies provide important guidance for the development of LLMs in the psychological domain. Nonetheless, they still leave room for further exploration. For example, comparative analyses across LLM architectures, such as dense models, mixture-of-experts (MoE), and reasoning models, are still relatively limited. Additionally, aspects such as model scale, version updates, inference strategies, and open-ended QA settings have not yet been examined. Investigating these questions is key to enabling users to identify reliable models and to implement effective inference strategies in real-world settings. This underscores the need for deeper exploration of the psychological capabilities of language models.

To address this critical gap, we present a large-scale, comprehensive evaluation to assess LLMs’ psychological competence within the Chinese context. This evaluation is based on a dedicated QA dataset comprising over 6,000 high-quality question–choice–answer pairs, derived from Chinese professional examinations for psychological counselors and psychotherapists. We evaluate a diverse range of LLMs in the Chinese context, including dense models (e.g., Qwen3, GLM-4), MoE models (e.g., Qwen3-235B-A22B, DeepSeek-V3), and reasoning models (e.g., DeepSeek-R1, o4-mini), across various parameter scales (e.g., 7B, 14B, 32B, 72B) and under both closed-ended and open-ended settings. Furthermore, we examine the impact of few-shot learning, Chain-of-Thought (CoT) prompting, and temperature settings on psychological QA performance by comparing win rates across mode combinations. Finally, we conduct an in-depth error analysis of LLMs based on question-level pass rates to identify common error types and their potential causes in the psychological domain. In summary, our main contributions are as follows:

We constructed a large-scale Chinese psychological QA dataset comprising over 6000 high-quality psychological QA items from professional practice exams for psychological counselors and psychotherapists. This dataset covers a broad spectrum of knowledge relevant to clinical psychological practice and mental disorder diagnosis.
We evaluated diverse LLMs, including dense, MoE, and reasoning models across various parameter scales (7B to 235B) and both closed-ended and open-ended settings. Experimental results suggest that larger and newer models generally achieve higher accuracy, but performance gains are not always consistent across open-ended settings. MoE models outperform dense models when active parameters are comparable. Reasoning models provide limited accuracy gains but incur significantly higher inference latency.
We explored the impact of inference strategies on psychological QA using a win-rate comparison. Experimental results showed that few-shot learning consistently improved performance, while CoT offered limited benefits and increased latency. Lower temperatures improved accuracy in closed-ended QA but had negligible effects in open-ended settings.
We conducted a fine-grained error analysis by ranking question-level pass rates. We identified three major error types: factual errors, reasoning errors, and decision errors. Our findings highlight that LLMs still struggle with some basic psychological factual knowledge, psychometric reasoning, and interpreting nuanced behavioral scenarios, reflecting limitations in psychological knowledge coverage and contextual understanding.

2. Related Work

2.1. LLMs

LLMs have rapidly demonstrated remarkable abilities in language understanding, generation, and multi-modal tasks [34,35]. Since the introduction of GPT [36] and its successors [1,2], LLMs have been widely deployed in applications, including dialogue systems [37], programming [7], education [38], and healthcare [39]. Their ability to generalize and versatility have the potential to play a central role in modern AI development [40]. With each new generation, LLMs continue to evolve and improve. Researchers have significantly scaled up pre-training corpora and model sizes, while also diversifying data sources. Qwen3 increased its training data from 3 trillion tokens to 36 trillion, incorporating plain text, PDF-extracted text, and synthetic data [4]. DeepSeek-V3 was trained on 14.8 trillion high-quality tokens [41], and GLM-4-0414 on 15 trillion [42]. They attempt to improve LLMs’ reasoning, robustness, generalization, and domain-specific performance by leveraging larger-scale, higher-quality, and more diverse training corpora, ranging from common-sense and expert knowledge to multilingual and multimodal data. Meanwhile, the LLM landscape has expanded to a variety of model architectures and usage paradigms. Early models were predominantly dense [43,44,45], but the rise of MoE models brought new efficiency-performance tradeoffs by activating only part of the model during inference [4,41]. More recently, specialized reasoning language models such as Qwen3-Think [4], DeepSeek-R1 [46], and o4-mini [47] have been introduced, explicitly optimized for tasks involving multi-step reasoning and planning. In parallel, new inference strategies like CoT [48], few-shot learning [49], and temperature scaling [50] have gained attention for their impact on performance and stability. As the number of available LLMs and reasoning strategies continues to grow, it has become increasingly complex to identify suitable models or inference strategies for a given task. This motivates systematic evaluations that compare LLMs across diverse dimensions such as model architecture, parameter scale, and inference strategies. Our work follows this direction by conducting a comprehensive evaluation of LLMs on a psychological QA dataset, offering insights into model performance and error patterns across diverse settings.

2.2. Evaluations of LLMs

Given the broad applicability of LLMs, comprehensive evaluation is essential to gain insights into their potential limitations and future advancements [51,52]. LLM evaluation is a complex task that can be approached from multiple dimensions, such as natural language processing, understanding, generation, and factuality [53]. We focus on the dimension of factuality, in line with researchers assessing the reliability and accuracy of LLMs in QA tasks from diverse disciplines [38,54,55]. For example, C-Eval evaluated LLMs on QA tasks drawn from STEM, humanities, social science, and additional subject areas [54], revealing that the evaluated LLMs still exhibited notable limitations in these fields. Moreover, researchers have found that some advanced LLMs show the potential to outperform human test-takers on multiple-choice examinations, such as the Multistate Bar Exam [15] and the neurosurgery written board exams [16]. However, a few studies have examined how well LLMs understand psychological knowledge and answer questions within this domain. For example, ConceptPsy [31] summarized psychological concepts from the Chinese national post-graduate entrance examination and prompted GPT-4 to generate psychological questions for evaluation. PsycoLLM [32] constructed a high-quality multi-source psychological dataset to assess LLMs’ capability in mental health dialog and knowledge-based QA. Similarly, CPsyExam [33] collected psychological questions from Chinese examinations including psychology subject. Unlike previous research, we assess the psychological competence of LLMs using two Chinese psychological qualification practice exams from the perspectives of a counselor and psychotherapist. We systematically evaluate a wider spectrum of LLMs and inference strategies to assess their effectiveness. Moreover, we adopt open-ended settings aligned with scenarios where users inquire using only question stems without any answer options.

3. PsyFactQA: A Psychological QA Benchmark Dataset

To investigate the potential of LLMs in answering psychological questions, we constructed a psychological QA dataset derived from two Chinese qualification practice exams for psychological counselors and psychotherapists. The psychological counselor examination is a national vocational qualification test for individuals engaged in or preparing to enter the field of psychological counseling [56]. The psychotherapist examination is a qualification exam aimed at medical professionals with at least five years of experience in mental health institutions [57]. Psychological counselors typically work with clients from the general population, addressing interpersonal, family, and occupational stress and adjustment issues. In contrast, psychotherapists primarily treat individuals with mental disorders and provide medical care [58]. Both the examinations contain a substantial number of factual, knowledge-based questions related to psychological consultation and clinical experience. These items offer a reliable source for evaluating language models’ capabilities in psychological knowledge understanding and applications.

We collected, cleaned, and organized question–answer pairs from Psychological Counselor and Psychotherapist practice examination materials publicly available online. The resulting dataset, named PsyFactQA, comprises a total of 6005 items, with descriptive statistics summarized in Table 1. Based on source distribution, the dataset includes 3373 items from the Psychological Counselor exams, comprising 2001 single-choice and 1372 multiple-choice questions. This portion forms the Counselor subset and accounts for 56.17% of the total. The remaining 2632 items, all single-choice questions, are drawn from the Psychotherapist exam, constituting the Therapist subset and representing 43.83% of the dataset. In terms of question types, single-choice questions make up 66.68% of the dataset, while multiple-choice questions account for 33.32%. We further extracted questions answerable without options to support open-ended evaluation, resulting in a total of 3432 items. The Counselor subset contains 2044 items, accounting for 59.56%, while the Therapist subset contains 1388 items, accounting for 40.44%.

To further illustrate the characteristics of the constructed dataset, Figure 1 presents representative examples randomly sampled from both exam categories. Each item from the Psychological Counselor set provides four candidate options and primarily assesses foundational knowledge related to applied psychological practices, such as client case analysis and counseling principles. Items from the Psychotherapist set include five candidate options and focus on medical psychological knowledge, examining the model’s understanding of mental disorders, symptomatology, and pharmacological attributes.

4. Evaluation

4.1. Evaluated LLMs

We group the evaluated LLMs into two groups: those without explicit reasoning capabilities (non-reasoning models, e.g., GPT-4o, DeepSeek-V3) and those optimized for reasoning (reasoning models, e.g., o4-mini, DeepSeek-R1). The LLMs included in our evaluation are listed below, with ^† and ^‡ denoting open-sourced and closed-sourced models, respectively, based on their status as of 1 May 2025.

For open-sourced LLMs, we deployed their official release checkpoints on a single machine equipped with eight NVIDIA GeForce RTX 4090 GPUs, using the vLLM framework. For close-sourced LLMs, we utilized their official APIs to perform the question answering tasks. Except for the temperature parameter, which was adjust as required, all other hyperparameters were kept at their default values according to the model providers’ guidance. To ensure fairness in latency measurement, we used a batch size of 1, submitting questions sequentially and recording each corresponding response and its latency.

(1)

Non-Reasoning Models:

Qwen3^† [4] (2025): The latest (In this paper, “latest” refers to versions released up to 1 May 2025) series of models from the Qwen family are intended to advance performance, efficiency, and multilingual capabilities. Qwen3 was pre-trained on 36 trillion tokens drawn from a diverse corpus including plain text, text extracted from PDF documents, and synthetic data. We included two types of Qwen3 models: dense models (8B, 14B, and 32B) and MoE models (235B-A22B, and 30B-A3B).
DeepSeek-V3^† [41] (2025): This MoE model has 671B total parameters, with 37B activated during inference. It was pre-trained on 14.8 trillion high-quality and diverse tokens, and optimized through supervised fine-tuning and reinforcement learning.
Qwen2.5^† [44] (2025): This series was trained on 18 trillion high-quality tokens, incorporating more common sense, expert knowledge, and reasoning information. More than 1 million samples were used during post-training to enhance instruction-following capabilities. We included four dense models: 7B, 14B, 32B, and 72B.
GLM-4-0414^† [42] (2025): This latest GLM dense model series was pre-trained on 15 trillion high-quality tokens and further optimized via supervised fine-tuning and human preference alignment. Two model sizes were included: 9B, and 32B.
GLM-4-9B-Chat^† [59] (2024): This fourth-generation GLM model was pre-trained on 10 trillion multilingual tokens and fine-tuned with supervised learning and human feedback, showing strong performance in semantics, knowledge, and reasoning tasks.
ChatGLM3-6B^† [60] (2023): As the third-generation model in the GLM family, this dialog-optimized version shows competitive performance compared to other models within 10 billion parameters, especially in knowledge and reasoning tasks.
ChatGLM2-6B^† [61] (2023): As the second-generation GLM model, it was pre-trained on 1.4 trillion bilingual tokens and fine-tuned with supervised learning and human preference alignment, outperforming similarly sized models at the time of release.
ChatGLM-6B^† [62] (2023): As the first-generation GLM model, it was pre-trained on 1 trillion tokens and fine-tuned with supervised learning and reinforcement learning.
Qwen^† [43] (2023): As the first-generation models in the Qwen family, they were pre-trained on 3 trillion tokens and fine-tuned with supervised learning and reinforcement learning from human feedback. They are reported to perform well in arithmetic and logical reasoning. The evaluation includes three dense models: 7B, 14B, and 72B.
Baichuan2^† [45] (2023): The latest series of models from the Baichuan family, which was pre-trained on 2.6 trillion high-quality tokens, showed significantly improved performance across benchmarks. We included two dense models: 7B and 13B.
ERNIE-4.5^‡ [63] (2024): This is the latest closed-source LLM from the ERNIE family, jointly trained on text, images, and videos. It achieves significant improvements in cross-modal learning efficiency and multimodal understanding.
GPT-4o^‡ [2] (2024): One of the leading LLMs for dialogue, it achieves strong results across a wide range of benchmarks, including conversation, knowledge reasoning, and code generation, in both text and multimodal settings.
ERNIE-4^‡ [64] (2024): A widely adopted LLM particularly for Chinese language, it demonstrates improved capabilities in semantic understanding, text generation, and reasoning over its predecessor, making it suitable for complex tasks.
ChatGPT^‡ [1] (2022): Recognized as a milestone in conversational artificial intelligence, it delivers strong performance in language understanding and generation.

(2)

Reasoning Models:

Qwen3-Think^† [4] (2025): This series of models introduces a naturally integrated thinking mode. We included five models from both dense and MoE architectures: 32B, 14B, 8B, 235B-A22B, and 30B-A3B, and we activated their thinking mode.
QwQ-32B^† [65] (2025): Based on Qwen2.5, this model adopts multi-stage reinforcement learning, starting with accuracy and execution signals for math and coding, and then using general rewards and rule-based validators to enhance its performance.
DeepSeek-R1^† [66] (2025): This series of models utilize the group relative policy optimization [67]. We included the full-parameter DeepSeek-R1 and two distilled models: DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Qwen-32B.
GLM-Z1-0414^† [42] (2025): Built on GLM-4-0414, this model series enhances deep reasoning through cold-start and extended reinforcement learning, improving performance on math and complex logic tasks. We included its two models: 9B and 32B.
ERNIE-X1^‡ [68] (2025): The latest reasoning model from the ERNIE family, featuring enhanced comprehension, planning, and reflection. It excels in Chinese knowledge QA, literary generation, logical reasoning, and complex calculations.
o3-mini^‡ [69] (2025): A reasoning model focused on STEM domains, it improves efficiency while matching or surpassing its predecessor in multiple benchmarks.
o4-mini^‡ [47] (2025): The latest in OpenAI reasoning models family, it builds on o3-mini with stronger tool use and problem-solving capabilities, outperforming its predecessor in expert evaluations and setting a new standard for reasoning.

4.2. Evaluation Mode

To assess LLMs’ knowledge capability in psychological QA and gain insight into mainstream inference methods, this study explores five representative evaluation paradigms.

Chain-of-Thought (CoT) [48]: LLMs reason step by step before producing an answer, which has shown performance gains in math QA [70,71] and code generation [72,73]. In our experiments, LLMs were required to perform zero-shot CoT reasoning.
Few-Shot (FS) [49]: LLMs are provided with K in-domain examples with answers before generating a response to a new query, improving performance across various NLP tasks [74,75,76]. In our setting, LLMs were prompted with 5 random QA pairs.
Direct Answering: LLMs are required to directly produce the answer option or content based on the given question, without step-by-step reasoning or any examples.
Temperature: LLMs generate responses with different sampling temperatures $τ$ , controlling output randomness. Prior work found limited impact on task performance [77,78] but noted a weak impact to creativity [50]. We tested $τ \in {0.0, 1.0}$ in our experiments.
Closed/Open-Ended: In the closed-ended setting, LLMs receive both the question stem and answer options, reflecting standardized exam formats, and their responses are extracted using regular expressions. In contrast, in the open-ended setting, only the question stem is provided, and the generated answers are evaluated by Qwen3-32B for whether they at least contain the correct answers.

4.3. Research Questions

In this paper, we explore four research questions based on the PsyFactQA dataset.

RQ1: How well do popular LLMs perform on the psychological QA dataset?
RQ2: How do reasoning and non-reasoning LLMs differ on psychological QA?
RQ3: How do different evaluation modes affect the accuracy of psychological QA?
RQ4: What kinds of errors do LLMs make in psychological QA?

4.4. RQ1: Comparative Psychological QA Accuracy of Popular LLMs

As shown in Table 2, we report on the accuracy of the evaluated LLMs on the psychological QA dataset. To facilitate comparison between closed- and open-ended question answering, we depict their accuracy and response latency in Figure 2. We first analyze the accuracy in the closed-ended QA setting. (1) Models with larger parameter sizes tended to yield a higher accuracy. Among open-source models, Qwen2.5-72B (85.66%) and Qwen3-235B-A22B (84.03%) achieved the highest and second-highest accuracies, respectively, while Baichuan2-7B (42.98%) and Baichuan2-13B (44.45%) achieved the lowest and second-lowest. ERNIE-4.5 (89.84%) achieved the best performance in the closed-source models, making a 12.02% improvement over its predecessor, ERNIE-4 (77.82%). It also outperformed the best open-source model, Qwen2.5-72B (85.66%), by 4.18%. These superior models have larger parameters compared to the other baselines. Similarly, within the same model family, models with larger parameter sizes tended to achieve a higher accuracy. For instance, within the newly released Qwen3 series: Qwen3-32B (82.08%) > Qwen3-14B (77.72%) > Qwen3-8B (72.97%). A similar trend is seen in the GLM-4-0414 series: GLM-4-0414-32B (77.85%) > GLM-4-0414-9B (69.56%). Although the Qwen3 series was trained on a larger corpus than Qwen2.5, Qwen2.5-72B (85.66%) still outperformed Qwen3-32B (82.08%) by 3.58%, emphasizing the advantage of larger parameter scales. (2) Model performance improved across generations at comparable parameter scales. As model families evolved, improvements in the quality and diversity of training corpora enhanced their performance in psychological QA. Within the 32B-scale models, Qwen3-32B (82.08%) outperformed Qwen2.5-32B (80.05%). Among the 14B-scale models, Qwen3-14B (77.72%) performed better than Qwen2.5-14B (74.92%) and Qwen-14B (60.33%). A similar trend was observed in the GLM series, where GLM-4-0414-9B (69.56%) outperformed GLM-4-9B (67.29%). For the 7–8B scale, Qwen3-8B (72.97%) slightly surpassed Qwen2.5-7B (72.49%), both of which significantly outperformed Qwen-7B (49.41%). (3) MoE models outperformed dense models with similar active parameter sizes but showed lower performance when total parameter sizes were comparable. For example, the MoE model Qwen3-30B-A3B (77.60%), with 30B total parameters, scored 4.48% lower than the dense Qwen3-32B (82.08%) and 2.45% lower than Qwen2.5-32B (80.05%). In contrast, when the number of active parameters was comparable, MoE models often outperformed dense models by leveraging their larger total parameter scales. For instance, DeepSeek-V3 (83.31%), with a total of 671B parameters but only 37B active during inference, outperformed 32B dense models, such as Qwen3-32B (82.08%) and GLM-4-0414-32B (77.85%). Similarly, Qwen3-235B-A22B (84.03%), with 22B active parameters, surpassed both Qwen3-14B (77.72%) by 6.31% and Qwen3-32B (82.08%) by 1.95%.

In the open-ended QA setting, where no answer options were provided, similar performance trends were observed. (1) Larger parameter models tended to yield a higher accuracy. The top-performing open-source models were DeepSeek-V3 (65.36%) and Qwen2.5-72B (64.80%), while the lowest performers were ChatGLM3-6B (17.45%) and ChatGLM-6B (24.65%). Compared to the closed-ended setting, the open-ended setting placed greater emphasis on text comprehension and knowledge representation. LLM responses tended to be more comprehensive compared to merely selecting from predefined options. A response is deemed correct if it includes and is consistent with the correct answer. Nevertheless, some generated outputs were still incorrect in response to the psychological questions. The top three LLMs remained consistent across the closed- and open-ended settings: DeepSeek-V3, Qwen2.5-72B, and Qwen3-235B-A22B. The best-performing close-sourced model was ERNIE-4.5 (75.99%), which improved by 10.27% over ERNIE-4 (65.27%) and outperformed DeepSeek-V3 (65.36%) by 10.63%. Within the same model series, larger parameter models still showed a higher accuracy. In the Qwen3 series, Qwen3-32B (62.65%) outperformed Qwen3-14B (58.36%) and Qwen3-8B (50.90%). This trend also held across earlier Qwen series: Qwen2.5-72B (64.80%) > Qwen2.5-32B (63.61%) > Qwen2.5-14B (62.15%) > Qwen3-7B (55.10%); Qwen-72B (45.92%) > Qwen-14B (42.95%) > Qwen-7B (37.27%). The GLM series exhibited a similar pattern: GLM-4-0414-32B (56.06%) > GLM-4-0414-9B (49.21%). (2) Most models’ performance improved across generations at comparable parameter scales. While subsequent versions within the same model family tended to improve accuracy in closed-ended settings, this trend did not always extend to open-ended QA. For example, within the Qwen family, Qwen2.5-32B (63.61%) outperformed Qwen3-32B (62.65%); Qwen2.5-14B (62.15%) surpassed Qwen3-14B (58.36%) and Qwen-14B (42.95%); and Qwen2.5-7B (55.10%) exceeded both Qwen3-8B (50.90%) and Qwen-7B (37.27%). In contrast, the GLM and ERNIE series exhibited steady improvements across model generations. These findings emphasize the necessity of task-specific evaluation rather than assuming consistent performance improvements across model evolutions. (3) MoE models outperformed dense models when their scales of active parameters were comparable but underperformed when total parameter scales were similar. When the active parameter scales of MoE models were comparable to the total parameter scales of dense models, MoE models tended to achieve a higher accuracy: DeepSeek-V3 (65.36%) outperformed Qwen2.5-32B (63.61%) and surpassed Qwen3-32B (62.65%) and GLM-4-0414-32B (56.06%). Similarly, Qwen3-235B-A22B (63.96%) exceeded Qwen3-32B (62.65%) and Qwen3-14B (58.36%). When total parameter sizes were similar, MoE models generally performed worse: Qwen2.5-32B (63.61%) > Qwen3-32B (62.65%) > Qwen3-30B-A3B (56.12%).

4.5. RQ2: Reasoning vs. Non-Reasoning LLMs in Psychological QA Accuracy

Reasoning models had been previously shown to perform reliably in tasks requiring structured thinking, including mathematics and programming. Accordingly, we investigated whether the reasoning models could contribute to improved accuracy in the psychological QA dataset. To assess differences in accuracy and response latency between reasoning and non-reasoning models from the same family, we analyzed their performance and time consumption using ratio-based comparisons. For example,

\frac{Qwen 3 - 32 B - Think}{Qwen 3 - 32 B}

represents the performance ratio between a reasoning model (Qwen3-32B-Think) and its corresponding non-reasoning baseline (Qwen3-32B). A ratio

> 1

indicates a higher accuracy or response latency for the reasoning model. In this context, higher values are desirable for accuracy, while lower values are preferred for latency.

As shown in Table 3, we present the comparison results between reasoning and non-reasoning models. From the perspective of the full dataset, reasoning models did not consistently outperform their non-reasoning counterparts. In the closed-ended QA setting, only three comparisons,

\frac{DeepSeek - R 1}{DeepSeek - V 3}

,

\frac{DeepSeek - R 1 - 32 B}{Qwen 2.5 - 32 B}

, and

\frac{o 4 - mini}{GPT - 4 o}

, showed improvements in accuracy. In the open-ended setting, performance gains in accuracy were observed in four comparisons:

\frac{DeepSeek - R 1}{DeepSeek - V 3}

,

\frac{DeepSeek - R 1 - 32 B}{Qwen 2.5 - 32 B}

,

\frac{o 4 - mini}{GPT - 4 o}

, and

\frac{o 3 - mini}{GPT - 4 o}

, although the observed gains in accuracy were limited, falling within a modest range of

[1.01, 1.10]

. However, all reasoning models incurred a significantly higher inference latency across all comparisons. Notably, the latency ratios reached 156.50, 149.43, and 140.50 for

\frac{QwQ - 32 B}{Qwen 2.5 - 32 B}

,

\frac{DeepSeek - R 1 - 32 B}{Qwen 2.5 - 32 B}

, and

\frac{DeepSeek - R 1 - 7 B}{Qwen 2.5 - 7 B}

, respectively. In summary, reasoning models showed limited accuracy benefits but considerably increased computational and time costs. Our error analysis suggested that when the psychological questions involve simple computation or reasoning, both reasoning and non-reasoning models can answer them comparably. Therefore, non-reasoning models are preferable for fact-based psychological QA tasks due to their lower computational and time costs.

4.6. RQ3: Psychological QA Performance Across Evaluation Modes

We explore how different inference strategies affect QA accuracy on the PsyFactQA dataset. For this investigation, we select two series of LLMs, Qwen3 and Qwen2.5, and ensure consistency in model sizes: Qwen3-32B vs. Qwen2.5-32B, Qwen3-14B vs. Qwen2.5-14B, and Qwen3-8B vs. Qwen2.5-7B. We examine the effect of each independent evaluation mode by testing its two distinct values, including {False, True} or {0.00, 1.00}, across all combinations of the remaining modes. For example, to evaluate the effect of the CoT mode, we compare performance under CoT=False and CoT=True across all combinations of the remaining modes. Given three independent evaluation modes (CoT, FS, and temperature), each with two possible values, we evaluate each LLM under

2^{3}

distinct configurations.

As shown in Table 4, the results are organized into eight rows per model, which can be separated into closed-ended and open-ended QA accuracy categories. We first report on the evaluation modes that achieve the top two performance levels. In the closed-ended setting, Qwen3-32B achieved the overall highest and second-highest total accuracies of 84.48% and 84.36% under FS=True and CoT=False, with

τ = 1.0

and

τ = 0.0

, respectively. Within the Qwen2.5 series, Qwen2.5-32B showed the highest and second-highest total accuracies of 82.03% and 81.73% with the same configurations. By contrast, the best-performing setup changed slightly in the open-ended setting. Qwen2.5-32B achieved the highest and second-highest overall accuracies of 64.77% and 64.63% under FS=False and CoT=False, when

τ = 0.0

and

τ = 1.0

, respectively. Tested under the same setups, Qwen3-32B shows only slight drops of 0.29% and 1.98% in accuracy compared to Qwen2.5-32B, while still ranking among the top two results within the Qwen3 series.

To investigate the impact of the three modes on QA accuracy, we conducted a comparative analysis by isolating each mode under a set of controlled configurations. Let

X_{i}

denote the isolated mode (e.g., CoT), and

X_{- i}

the remaining modes held constant. For each fixed configuration

x^{(j)} \in {x^{(1)}, \dots, x^{(N)}}

of the non-isolated modes, and for two distinct values

x_{a}, x_{b}

of the isolated mode, we define the accuracy difference as follows:

Δ^{(j)} = Acc (X_{i} = x_{a} ∣ X_{- i} = x^{(j)}, L) - Acc (X_{i} = x_{b} ∣ X_{- i} = x^{(j)}, L),

(1)

where

L

denotes the evaluated LLM and

Acc (\cdot)

represents the accuracy score.

Based on the difference, we define the WinRate and DrawRate of

x_{a}

versus

x_{b}

as follows:

WinRate (x_{a} > x_{b}) = \frac{1}{N} \sum_{j = 1}^{N} I [Δ^{(j)} > 0],

(2)

DrawRate (x_{a} = x_{b}) = \frac{1}{N} \sum_{j = 1}^{N} I [Δ^{(j)} = 0],

(3)

where

I [\cdot]

denotes the indicator function. Then, the WinRate of

x_{b}

over

x_{a}

is computed as

WinRate (x_{b} > x_{a}) = 1 - WinRate (x_{a} > x_{b}) - DrawRate (x_{a} = x_{b}) .

(4)

As depicted in Figure 3, we present the accuracy differences across the three modes. For the total group in the closed-ended setting, disabling CoT resulted in a 100% WinRate, with an absolute accuracy difference

| Δ | \in [0.65, 10.51]

, indicating a substantial advantage. In contrast, under the open-ended setting, disabling CoT led to a WinRate of 87.5%, accompanied by a more pronounced accuracy difference of

| Δ | \in [0.58, 17.77]

. We found that CoT reasoning was less effective for factual psychological questions, where performance relied more on LLMs’ internal knowledge than on step-by-step reasoning. In such cases, enabling CoT offered little benefit and instead incurred substantial overhead in inference time and computational cost. In contrast, the few-shot mode generally improved psychological QA performance, particularly in the closed-ended setting where it achieved a 100 % WinRate despite only modest gains in absolute accuracy difference of

| Δ | \in [0.08, 2.63]

. Its effect was less pronounced in the open-ended setting, with WinRate dropping to 58.3%. Nevertheless, few-shot was still beneficial, even with randomly sampled examples. By contrast, the effect of temperature varied between the closed- and open-ended settings. In the closed-ended setting, a lower temperature (0.00) consistently yielded higher WinRates across all groups, with a small accuracy gap

| Δ | \in [0.02, 3.07]

in the total group, indicating a modest but consistent advantage for deterministic decoding in closed-ended factual questions. In the open-ended setting, results were more mixed, with WinRate often near 50% and minimal accuracy differences

| Δ | \in [0.00, 3.27]

in the total group.

To sum up, our findings highlight several practical implications for optimizing evaluation strategies when employing LLMs in psychological factual QA. We recommend minimizing CoT use to avoid unnecessary time and cost. Conversely, incorporating FS mode, even with randomly selected examples, consistently enhances performance, particularly in closed-ended setting. Regarding temperature, deterministic decoding provides modest but consistently superior performance in closed-ended questions. However, temperature has negligible and variable effects in open-ended QA scenarios, indicating less sensitivity to decoding strategy. Overall, selecting the FS mode and avoiding CoT generally yields optimal performance for psychological factual QA questions on the evaluated LLMs.

4.7. RQ4: Error Analysis of LLM Performance in Psychological QA

To further investigate LLM errors in psychological QA, we compared their predictions with reference answers and calculated the pass rate of each question across all baseline models. In this article, the pass rate is formally defined as follows:

PassRate (q) = \frac{1}{| L |} \sum_{i = 1}^{| L |} I [{\hat{y}}_{q}^{(i)} = y_{q}],

(5)

where L is the set of LLMs, q denotes a question,

y_{q}

is its reference answer,

{\hat{y}}_{q}^{(i)}

is the i-th LLM’s prediction, and

I [\cdot]

is the indicator function. Thus,

PassRate (q)

quantifies the proportion of LLMs that produce the correct prediction for question q.

We ranked all questions by pass rate and selected the 100 lowest-scoring questions from each of the three subsets: Counselor (Single Choice), Counselor (Multiple Choice), and Therapist. This resulted in 600 low-pass-rate questions evenly split between the open- and closed-ended QA settings, with 300 from each. These questions were then manually categorized into three error types. Figure 4 presents the distribution of error types in both closed- and open-ended QA tasks. Each error type is defined and exemplified below.

Factual Error: This type includes questions related to established psychological knowledge, such as concepts from the history of psychology, psychotherapy, cognitive psychology, psychometrics, expert consensus, and diagnostic criteria for mental disorders. These questions have clear, fixed answers and usually require no reasoning. They assess the model’s ability to recall factual knowledge about psychological concepts, diagnostic standards, and medication recommendations. Examples: “Which event marked the birth of scientific psychology?”, “Which medications are used to treat bulimia nervosa?”.
Reasoning Error: These questions involve basic psychometric computations and typically require recognizing variable relationships, or performing simple calculations. This category evaluates the model’s abilities in logic, causality, and basic numerical reasoning. Examples: “A child aged 8 has a mental age of 10. What is their ratio IQ?”, “A test item has a full score of 15 and an average score of 9.6. What is its difficulty index approximately?”
Decision Error: This question type involves interpreting specific behavioral scenarios or case-based descriptions to identify the underlying psychological concept or state. Such questions provide detailed context or observable behaviors and evaluate the model’s ability to apply theoretical knowledge to diagnostic or interpretive judgments. Examples: “If a client constantly discusses others to prove they have no issues, this pattern may indicate:”, “When the counselor and client have inconsistent goals that are difficult to align, how should the goals be determined?”, “How should a counselor respond to a client’s conflicted silence?”

Figure 4. Distribution of error types in closed-ended and open-ended QA tasks. Factual errors dominate in both settings, while reasoning and decision errors account for a smaller proportion.

In factual error cases, models commonly exhibited memory distortions and information confusion. For instance, in the question “Which of the following are encoding methods of sensory memory?”, where the correct answer is “acoustic encoding, visual imagery”, models frequently selected answers such as “afterimages, visual imagery”, which only reflect visual encoding and neglect the role of acoustic encoding, such as phonological input, a core auditory component of sensory memory encoding [79]. This indicates incomplete knowledge retention or option-induced bias. “Acoustic encoding” was more frequently omitted when candidate options were presented. In open-ended QA, models produced incorrect responses and hallucinated content. For example, when asked about “methods to assess content validity”, models produced unrelated methods such as “self-reporting” and fabricated non-standard ones such as “variance edition method” and “coefficient edition method”.

In reasoning error cases, the questions where theoretical knowledge, conversion rules were lacking, and model predictions were often incorrect. For instance, in the question “In the SCL-90 using a 0–4 scale, a total score above ( ) indicates a positive screening result”, the original scale does not specify a cutoff score. According to [80], based on Chinese normative data [81], a total score exceeding 160 on the 1–5 scale suggests a potential positive screening. However, when using the 0–4 scale, the score must first be adjusted by subtracting 90, resulting in a valid cutoff of 70 [82]. Most models failed to distinguish between these two scoring systems and directly output “160” regardless of scale, demonstrating an omission of the measurement range differences and leading to erroneous answers in both closed- and open-ended QA settings. The step-by-step generation process used in open-ended QA tended to yield more accurate outcomes than the direct selection of options in closed-ended QA. For example, in the question “In a normative group of 1000 people ranked by test scores, what is the percentile rank of someone ranked 100th?”, both reasoning and non-reasoning models tended to choose “90%” when given the options “10%, 90%, 10, 90”. This contradicts standard statistical conventions, where percentile ranks are written as plain integers without the percent sign [83]. While reasoning models computed correctly, they still selected the percentage-form answer. Without options, however, both non-reasoning and reasoning models were more likely to output “90” through step-by-step reasoning. These findings suggest that in reasoning questions requiring precise computation, closed-ended formats may cause models to skip reasoning steps and rely on pattern matching or superficial linguistic associations, leading to incorrect answers.

In decision errors, common mistakes included a lack of conceptual knowledge and misinterpretation of contextual information. These questions typically presented case descriptions, behavioral scenarios, or event narratives and required the model to apply abstract psychological knowledge to specific real-world situations. For example, in the question “During the process of adapting to stress, an individual becomes sensitive and vulnerable, with even minor daily hassles triggering intense emotional responses. This indicates that the person is in which stage of the general adaptation syndrome? A. Alarm B. Resistance C. Exhaustion D. Termination”, most models incorrectly selected “C. Exhaustion”. According to the definitions of the three stages of the general adaptation syndrome, the symptoms described align more closely with “B. Resistance”, during which the hypothalamic–pituitary–adrenal axis remains activated in response to ongoing stress, leading to increased sensitivity and emotional reactivity [84,85]. The “C. Exhaustion” stage, by contrast, refers to the collapse of adaptive capacity due to prolonged stress, typically characterized by helplessness, depressive mood, and emotional depletion [84]. In the open-ended setting, model performance did not improve: many responses still incorrectly produced “Alarm” or “Exhaustion”, and some hallucinated non-existent stages such as “Irritability phase” or “Cold wave stage”, reflecting a lack of knowledge regarding established diagnostic constructs. Errors due to contextual misunderstanding refer to incorrect interpretation of situational cues in the question. For instance, in “A therapist organizes an emotional management group therapy program for adolescents. The goal is to increase emotional awareness, understand the impact of irrational thoughts, learn to restructure those thoughts, and manage emotions. The group includes 10 participants, meeting once a week for 8 sessions (60 min each), with topics including: Meeting by Fate, Emotion Report, Seven Emotions, Tracing the Roots, Emotion Code, Emotion Decoding, Colorful Emotions, and Emotional Mastery. This therapy is best categorized as: A. Non-structured Psychotherapy B. Structured Psychotherapy C. Individual Psychotherapy D. Group Workshop E. Group Lecture”, most models incorrectly selected “B. Structured Psychotherapy”. While their reasoning cited the fixed duration, frequency, and thematic structure of the sessions as indicators of formality, they failed to recognize that the term “structured psychotherapy” has a specific definition in psychology, encompassing structured-oriented therapies [86], structural family therapy [87], and standardized clinical trial protocols [88], which differ from the described intervention. The correct answer is “D. Group Workshop”, as the question explicitly refers to an adolescent group-based emotional regulation program, aligning more closely with that category.

In summary, the observed errors in LLM responses appear to stem from two primary limitations. The first concerns incomplete or imprecise knowledge of psychological concepts, which may result in factual inaccuracies, terminological confusion, or a misinterpretation of theoretical content. The second relates to the restricted capacity of models to perform domain-specific reasoning and situational analysis, particularly when required to interpret nuanced psychological or behavioral contexts. Both issues are related to gaps in the scope and psychological specificity of the pretraining data. For the first limitation, incorporating retrieval-augmented generation or fine-tuning on a high-quality psychology-specific corpus including authoritative textbooks and academic articles could substantially improve domain coverage and knowledge completeness. Regarding the second limitation, a practical approach is to compile expert-annotated data on psychological reasoning and analysis and incorporate it into supervised fine-tuning or reinforcement learning from human feedback to improve the LLMs’ application of these skills.

5. Limitations

In this study, we evaluated the psychological knowledge competence of LLMs using a dedicated psychological question-answering dataset, examining both closed- and open-ended QA settings. In the open-ended setting, we employed a state-of-the-art model (Qwen3-32B) to determine whether the responses of the evaluated LLMs contained the correct answers. While the use of Qwen3-32B may influence the absolute accuracy scores, it was applied consistently across all determinations in this setting, ensuring the comparability of results among the evaluated LLMs. Moreover, this study still focuses on evaluation from an exam-oriented perspective, and there may be a gap between such testing and real-world application scenarios, which can be more complex. In future work, we will extend our dataset by constructing a knowledge graph based on the current data and using random sampling to generate multi-hop and more complex psychological questions with constrains. By inviting clinical experts to review the quality and effectiveness of these questions, we aim to develop a more challenging psychological question answering dataset. Furthermore, the PsyFactQA dataset and evaluation results were derived from the Chinese linguistic and cultural context, and we intend to construct multilingual psychological QA datasets to assess the cross-cultural capabilities of LLMs.

6. Conclusions

In this study, we conducted a large-scale, systematic evaluation of LLMs’ psychological competence from the perspectives of psychological counselor and psychotherapist question answering tasks. To provide a more comprehensive understanding of how well LLMs comprehend and apply psychological knowledge, we evaluated diverse model architectures, parameter scales, and evaluation paradigms. The experimental results reveal that larger and more recent LLMs generally achieve a higher accuracy in both closed- and open-ended evaluation settings. When active parameter scales are comparable, the MoE models outperform dense models; however, they have more total parameters. Regarding the inference strategies, few-shot learning consistently enhances the psychological QA performance, particularly in closed-ended settings, whereas CoT prompting and temperature tuning provide limited gains. Error analysis suggests that current LLMs still have limitations in mastering foundational psychological concepts, conducting psychometric reasoning, and interpreting behaviorally nuanced scenarios. Many of these shortcomings stem from incomplete knowledge coverage and a lack of contextual understanding, which is rooted in psychological expertise. Although advanced LLMs show promise in psychological QA tasks, our results caution against their unguided use in high-stakes, knowledge-intensive settings, such as clinical decision-making. It is recommended to adopt a human-AI collaborative approach in practical scenarios. In future work, we suggest curating more authoritative psychology corpora to enhance LLMs’ psychological competence through fine-tuning or retrieval-augmented generation. Additionally, it may be a feasible approach to integrating a knowledge graph into LLMs for psychological QA tasks, which can improve the interpretability and reliability of responses by providing traceable sources.

Author Contributions

Methodology, F.G.; software, F.G. and Y.H.; validation, F.G. and Y.H.; formal analysis, F.G.; investigation, F.G.; resources, Q.C.; data curation, F.G. and Y.H.; writing—original draft preparation, F.G., Q.C., and F.L.; writing—review and editing, F.L. and Q.C.; visualization, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLMs	Large Language Models
QA	Question Answering
CoT	Chain-of-thought
FS	Few-Shot
MoE	Mixture-of-Expert

References

OpenAI. Introducing ChatGPT. Available online: https://openai.com/blog/chatgpt (accessed on 27 July 2025).
OpenAI. Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o (accessed on 27 July 2025).
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
Hou, W.; Ji, Z. Comparing Large Language Models and Human Programmers for Generating Programming Code. Adv. Sci. 2025, 12, 2412279. [Google Scholar] [CrossRef] [PubMed]
Lyu, M.R.; Ray, B.; Roychoudhury, A.; Tan, S.H.; Thongtanunam, P. Automatic Programming: Large Language Models and Beyond. ACM Trans. Softw. Eng. Methodol. 2024, 34, 1–33. [Google Scholar] [CrossRef]
Husein, R.A.; Aburajouh, H.; Catal, C. Large language models for code completion: A systematic literature review. Comput. Stand. Interfaces 2025, 92, 103917. [Google Scholar] [CrossRef]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward Expert-Level Medical Question Answering with Large Language Models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
Louis, A.; van Dijck, G.; Spanakis, G. Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 22266–22275. [Google Scholar]
Li, X.; Zhou, Y.; Dou, Z. UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 8688–8696. [Google Scholar]
Chen, Y.; Xing, X.; Lin, J.; Zheng, H.; Wang, Z.; Liu, Q.; Xu, X. SoulChat: Improving LLMs’ Empathy, Listening, and Comfort Abilities through Fine-Tuning with Multi-Turn Empathy Conversations. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 1170–1183. [Google Scholar]
Blanco-Cuaresma, S. Psychological Assessments with Large Language Models: A Privacy-Focused and Cost-Effective Approach. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology, St. Julians, Malta, 21 March 2024; pp. 203–210. [Google Scholar]
Zheng, D.; Lapata, M.; Pan, J.Z. How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency. arXiv 2024. [Google Scholar] [CrossRef]
Demszky, D.; Yang, D.; Yeager, D.S.; Bryan, C.J.; Clapper, M.; Chandhok, S.; Eichstaedt, J.C.; Hecht, C.; Jamieson, J.; Johnson, M.; et al. Using Large Language Models in Psychology. Nat. Rev. Psychol. 2023, 2, 688–701. [Google Scholar] [CrossRef]
Katz, D.M.; Bommarito, M.J.; Gao, S.; Arredondo, P. GPT-4 passes the bar exam. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2024, 382, 20230254. [Google Scholar] [CrossRef]
Ali, R.; Tang, O.Y.; Connolly, I.D.; Zadnik Sullivan, P.L.; Shin, J.H.; Fridley, J.S.; Asaad, W.F.; Cielo, D.; Oyelese, A.A.; Doberstein, C.E.; et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery 2023, 93, 1353–1365. [Google Scholar] [CrossRef]
Zong, H.; Wu, R.; Cha, J.; Wang, J.; Wu, E.; Li, J.; Zhou, Y.; Zhang, C.; Feng, W.; Shen, B. Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. J. Med. Internet Res. 2024, 26, e66114. [Google Scholar] [CrossRef]
Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Xing, W.; Nixon, N.; Crossley, S.; Denny, P.; Lan, A.; Stamper, J.; Yu, Z. The Use of Large Language Models in Education. Int. J. Artif. Intell. Educ. 2025, 35, 439–443. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Hao, G.; Wu, J.; Pan, Q.; Morello, R. Quantifying the Uncertainty of LLM Hallucination Spreading in Complex Adaptive Social Networks. Sci. Rep. 2024, 14, 16375. [Google Scholar] [CrossRef]
Tang, W.; Cao, Y.; Deng, Y.; Ying, J.; Wang, B.; Yang, Y.; Zhao, Y.; Zhang, Q.; Huang, X.; Jiang, Y.; et al. EvoWiki: Evaluating LLMs on Evolving Knowledge. arXiv 2024. [Google Scholar] [CrossRef]
Wang, W.; Shi, J.; Tu, Z.; Yuan, Y.; Huang, J.; Jiao, W.; Lyu, M.R. The Earth is Flat? Unveiling Factual Errors in Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
Yoshida, L. The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, Proceedings of the 25th International Conference, AIED 2024, Recife, Brazil, 8–12 July 2024; Springer: Cham, Switzerland, 2024; pp. 61–73. [Google Scholar]
Steyvers, M.; Tejeda, H.; Kumar, A.; Belem, C.; Karny, S.; Hu, X.; Mayer, L.W.; Smyth, P. What Large Language Models Know and What People Think They Know. Nat. Mach. Intell. 2025, 7, 221–231. [Google Scholar] [CrossRef]
Echterhoff, J.M.; Liu, Y.; Alessa, A.; McAuley, J.; He, Z. Cognitive Bias in Decision-Making with LLMs. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 12640–12653. [Google Scholar]
Xie, H.; Chen, Y.; Xing, X.; Lin, J.; Xu, X. PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling. arXiv 2024. [Google Scholar] [CrossRef]
Lu, H.; Liu, T.; Cong, R.; Yang, J.; Gan, Q.; Fang, W.; Wu, X. QAIE: LLM-based Quantity Augmentation and Information Enhancement for few-shot Aspect-Based Sentiment Analysis. Inf. Process. Manag. 2025, 62, 103917. [Google Scholar] [CrossRef]
Hellwig, N.C.; Fehle, J.; Wolff, C. Exploring large language models for the generation of synthetic training samples for aspect-based sentiment analysis in low resource settings. Expert Syst. Appl. 2025, 261, 125514. [Google Scholar] [CrossRef]
Xu, X.; Yao, B.; Dong, Y.; Gabriel, S.; Yu, H.; Hendler, J.; Ghassemi, M.; Dey, A.K.; Wang, D. Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 8, 1–32. [Google Scholar] [CrossRef]
Zhang, J.; He, H.; Song, N.; Zhou, Z.; He, S.; Zhang, S.; Qiu, H.; Li, A.; Dai, Y.; Ma, L.; et al. ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology. arXiv 2024. [Google Scholar] [CrossRef]
Hu, J.; Dong, T.; Luo, G.; Ma, H.; Zou, P.; Sun, X.; Guo, D.; Yang, X.; Wang, M. PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation. IEEE Trans. Comput. Soc. Syst. 2025, 12, 539–551. [Google Scholar] [CrossRef]
Zhao, J.; Zhu, J.; Tan, M.; Yang, M.; Li, R.; Di, Y.; Zhang, C.; Ye, G.; Li, C.; Hu, X.; et al. CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Eds.; pp. 11248–11260. [Google Scholar]
Wang, C.; Luo, W.; Dong, S.; Xuan, X.; Li, Z.; Ma, L.; Gao, S. MLLM-Tool: A Multimodal Large Language Model for Tool Agent Learning. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 6678–6687. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 27 July 2025).
Algherairy, A.; Ahmed, M. Prompting large language models for user simulation in task-oriented dialogue systems. Comput. Speech Lang. 2025, 89, 101697. [Google Scholar] [CrossRef]
Seo, H.; Hwang, T.; Jung, J.; Kang, H.; Namgoong, H.; Lee, Y.; Jung, S. Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Appl. Sci. 2025, 15, 671. [Google Scholar] [CrossRef]
Duong, B.T.; Le, T.H. MedQAS: A Medical Question Answering System Based on Finetuning Large Language Models. In Future Data and Security Engineering, Big Data, Security and Privacy, Smart City and Industry 4.0 Applications, Proceedings of the 11th International Conference on Future Data and Security Engineering, FDSE 2024, Binh Duong, Vietnam, 27–29 November 2024; Dang, T.K., Küng, J., Chung, T.M., Eds.; Springer: Singapore, 2024; pp. 297–307. [Google Scholar]
Polignano, M.; Musto, C.; Pellungrini, R.; Purificato, E.; Semeraro, G.; Setzu, M. XAI.it 2024: An Overview on the Future of AI in the era of Large Language Models. In Proceedings of the 5th Italian Workshop on Explainable Artificial Intelligence, Co-Located with the 23rd International Conference of the Italian Association for Artificial Intelligence, Bolzano, Italy, 25–28 November 2024; Volume 3839, pp. 1–10. [Google Scholar]
DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. DeepSeek-V3 Technical Report. arXiv 2025, arXiv:2412.19437. [Google Scholar] [CrossRef]
THUDM. GLM-4-0414—A THUDM Collection. Available online: https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e (accessed on 27 July 2025).
Qwen. Qwen—A Qwen Collection. Available online: https://huggingface.co/collections/Qwen/qwen-65c0e50c3f1ab89cb8704144 (accessed on 27 July 2025).
Qwen. Qwen2.5—A Qwen Collection. Available online: https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e (accessed on 27 July 2025).
Yang, A.; Xiao, B.; Wang, B.; Zhang, B.; Bian, C.; Yin, C.; Lv, C.; Pan, D.; Wang, D.; Yan, D.; et al. Baichuan 2: Open Large-scale Language Models. arXiv 2025. [Google Scholar] [CrossRef]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025. [Google Scholar] [CrossRef]
OpenAI. Introducing OpenAI o3 and o4-mini. Available online: https://openai.com/index/introducing-o3-and-o4-mini (accessed on 27 July 2025).
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Peeperkorn, M.; Kouwenhoven, T.; Brown, D.; Jordanous, A. Is Temperature the Creativity Parameter of Large Language Models? In Proceedings of the 15th International Conference on Computational Creativity, ICCC 2024, Jönköping, Sweden, 17–21 June 2024; Association for Computational Creativity: Jönköping, Sweden, 2024; pp. 226–235. [Google Scholar]
Xiao, Z.; Deng, W.H.; Lam, M.S.; Eslami, M.; Kim, J.; Lee, M.; Liao, Q.V. Human-Centered Evaluation and Auditing of Language Models. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024. CHI EA ’24. [Google Scholar]
Zheng, S.; Zhang, Y.; Zhu, Y.; Xi, C.; Gao, P.; Xun, Z.; Chang, K. GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; pp. 1363–1382. [Google Scholar]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Huang, Y.; Bai, Y.; Zhu, Z.; Zhang, J.; Zhang, J.; Su, T.; Liu, J.; Lv, C.; Zhang, Y.; Lei, J.; et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 62991–63010. [Google Scholar]
Hu, Y.; Goktas, Y.; Yellamati, D.D.; De Tassigny, C. The Use and Misuse of Pre-Trained Generative Large Language Models in Reliability Engineering. In Proceedings of the 2024 Annual Reliability and Maintainability Symposium (RAMS), Albuquerque, NM, USA, 22–25 January 2024; pp. 1–7. [Google Scholar]
Fu, X.; Wu, G.; Chen, X.; Liu, Y.; Wang, Y.; Lin, C. Professional Qualification Certification for Psychological Counselors. In Proceedings of the 21st National Conference on Psychology, Beijing, China, 2 November 2018; p. 149. [Google Scholar]
Wang, M.; Jiang, G.; Yan, Y.; Zhou, Z. Certification Methods for Professional Qualification of Psychological Counselors and Therapists in China. Chin. J. Ment. Health 2015, 29, 503–509. [Google Scholar]
Shen, J.; Wu, Z. A Review of Competency Assessment Research for Psychotherapists in China. Med. Philos. 2021, 42, 50–53. [Google Scholar]
THUDM. GLM-4-9B-Chat. Available online: https://huggingface.co/THUDM/glm-4-9b-chat (accessed on 27 July 2025).
THUDM. ChatGLM3-6B. Available online: https://huggingface.co/THUDM/chatglm3-6b (accessed on 27 July 2025).
THUDM. ChatGLM2-6B. Available online: https://huggingface.co/THUDM/chatglm2-6b (accessed on 27 July 2025).
THUDM. ChatGLM-6B. Available online: https://huggingface.co/THUDM/chatglm-6b (accessed on 27 July 2025).
Baidu. ERNIE-4.5. Available online: https://yiyan.baidu.com (accessed on 27 July 2025).
Baidu. ERNIE-4. Available online: https://yiyan.baidu.com (accessed on 27 July 2025).
Qwen Team. QwQ-32B: Embracing the Power of Reinforcement Learning. Available online: https://qwenlm.github.io/blog/qwq-32b/ (accessed on 27 July 2025).
deepseek-ai. DeepSeek-R1. Available online: https://huggingface.co/deepseek-ai/DeepSeek-R1 (accessed on 27 July 2025).
Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.K.; Wu, Y.; et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv 2024, arXiv:2402.03300. [Google Scholar] [CrossRef]
Baidu. ERNIE-X1. Available online: https://yiyan.baidu.com (accessed on 27 July 2025).
OpenAI. OpenAI o3-mini. Available online: https://openai.com/index/openai-o3-mini (accessed on 27 July 2025).
Fung, S.C.E.; Wong, M.F.; Tan, C.W. Chain-of-Thoughts Prompting with Language Models for Accurate Math Problem-Solving. In Proceedings of the 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 6–8 October 2023; pp. 1–5. [Google Scholar]
Henkel, O.; Horne-Robinson, H.; Dyshel, M.; Thompson, G.; Abboud, R.; Ch, N.A.N.; Moreau-Pernet, B.; Vanacore, K. Learning to Love LLMs for Answer Interpretation: Chain-of-Thought Prompting and the AMMORE Dataset. J. Learn. Anal. 2025, 12, 50–64. [Google Scholar] [CrossRef]
Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Zhuo, T.Y.; Chen, T. Chain-of-Thought in Neural Code Generation: From and for Lightweight Language Models. IEEE Trans. Softw. Eng. 2024, 50, 2437–2457. [Google Scholar] [CrossRef]
Tian, Z.; Chen, J.; Zhang, X. Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; p. 645. [Google Scholar]
Parikh, S.; Tiwari, M.; Tumbade, P.; Vohra, Q. Exploring Zero and Few-shot Techniques for Intent Classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), Toronto, ON, Canada, 9–14 July 2023; pp. 744–751. [Google Scholar]
Li, Q.; Chen, Z.; Ji, C.; Jiang, S.; Li, J. LLM-based Multi-Level Knowledge Generation for Few-shot Knowledge Graph Completion. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; pp. 2135–2143, Main Track. [Google Scholar]
Bi, J.; Zhu, W.; He, J.; Zhang, X.; Xian, C. Large Model Fine-tuning for Suicide Risk Detection Using Iterative Dual-LLM Few-Shot Learning with Adaptive Prompt Refinement for Dataset Expansion. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 8520–8526. [Google Scholar]
Renze, M. The Effect of Sampling Temperature on Problem Solving in Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 7346–7356. [Google Scholar]
Patel, D.; Timsina, P.; Raut, G.; Freeman, R.; Levin, M.A.; Nadkarni, G.N.; Glicksberg, B.S.; Klang, E. Exploring Temperature Effects on Large Language Models Across Various Clinical Tasks. medRxiv 2024. [Google Scholar] [CrossRef]
Anderson, J. Cognitive Psychology and Its Implications; Post & Telecom Press: Beijing, China, 2012. [Google Scholar]
Dai, X. Handbook of Common Psychological Assessment Scales; People’s Military Medical Press: Beijing, China, 2010. [Google Scholar]
Jin, H.; Wu, W.; Zhang, M. A Preliminary Analysis of SCL-90 Scores in the Chinese Normal Population. Chin. J. Neuropsychiatr. Dis. 1986, 12, 260–263. [Google Scholar]
Wang, X.; Wang, X.; Ma, H. Manual of Mental Health Rating Scales; Chinese Journal of Mental Health Press: Beijing, China, 1999. [Google Scholar]
Shao, Z. Psychological Statistics; China Light Industry Press: Beijing, China, 2024. [Google Scholar]
McCarty, R. The Alarm Phase and the General Adaptation Syndrome: Two Aspects of Selye’s Inconsistent Legacy. In Stress: Concepts, Cognition, Emotion, and Behavior; Academic Press: Cambridge, MA, USA, 2016; pp. 13–19. [Google Scholar]
Loizzo, A.; Campana, G.; Loizzo, S.; Spampinato, S. Postnatal Stress Procedures Induce Long-Term Endocrine and Metabolic Alterations Involving Different Proopiomelanocortin-Derived Peptides. In Neuropeptides in Neuroprotection and Neuroregeneration; CRC Press: Boca Raton, FL, USA, 2012; pp. 107–126. [Google Scholar]
Rudolf, G. Structure-oriented psychotherapy; [Strukturbezogene Psychotherapie]. Psychotherapeut 2012, 57, 357–372. [Google Scholar] [CrossRef]
Corey, G. Theory and Practice of Counseling and Psychotherapy, 10th ed.; Cengage Learning: Boston, MA, USA, 2016. [Google Scholar]
Amianto, F.; Ferrero, A.; Pierò, A.; Cairo, E.; Rocca, G.; Simonelli, B.; Fassina, S.; Abbate-Daga, G.; Fassino, S. Supervised Team Management, with or without Structured Psychotherapy, in Heavy Users of a Mental Health Service with Borderline Personality Disorder: A Two-Year Follow-up Preliminary Randomized Study. BMC Psychiatry 2011, 11, 181. [Google Scholar] [CrossRef]

Figure 1. Example question–answer items from the PsyFactQA dataset.

Figure 2. Total QA accuracy of LLMs on the PsyFactQA dataset. LLMs with black names are non-reasoning models; those with blue names are reasoning models. The bar chart shows the accuracy for closed-ended (light bars) and open-ended (dark bars) questions. Overlaid lines represent response latency.

Figure 3. WinRate and DrawRate (%) under isolated evaluation modes. Each subplot compares outcomes between two values of a variable: left = False/

0.00

, right = True/

1.00

. Deep blue indicates the proportion of wins for the left value, medium blue for the right value, and soft blue for draws. The top row shows results for closed-ended QA settings and the bottom row for open-ended QA.

Figure 3. WinRate and DrawRate (%) under isolated evaluation modes. Each subplot compares outcomes between two values of a variable: left = False/

0.00

, right = True/

1.00

. Deep blue indicates the proportion of wins for the left value, medium blue for the right value, and soft blue for draws. The top row shows results for closed-ended QA settings and the bottom row for open-ended QA.

Table 1. Statistics of the PsyFactQA dataset. Metrics for the closed-ended and open-ended settings are shown in white and gray backgrounds, respectively.

	Counselor	Therapist	Total	Counselor	Therapist	Total
Single-Choice	2001	2632	4633	1679	1388	3067
Multiple-Choice	1372	0	1372	365	0	365
Total	3373	2632	6005	2044	1388	3432

Table 2. Performance of the LLMs on PsyFactQA. RL indicates the average response latency (in seconds). A gray background denotes results on open-ended QA setting. Bold and underlined scores are the best and second-best performances in each category, respectively. Top-2 results among non-reasoning and reasoning models are highlighted by box and dashed box, respectively.

LLMs	Counselor		Therapist	Total	RL	Counselor		Therapist	Total	RL
LLMs	Single (%)	Multiple (%)	Single (%)	(%)	(s)	Single (%)	Multiple (%)	Single (%)	(%)	(s)
Qwen3-32B	88.96	69.75	83.28	82.08	0.63	63.61	57.26	62.90	62.65	2.49
Qwen3-14B	84.36	64.21	79.71	77.72	0.25	58.43	51.78	60.01	58.36	0.80
Qwen3-8B	79.66	57.00	76.22	72.97	0.21	49.26	41.64	55.33	50.90	0.75
Qwen3-235B-A22B	91.20	67.93	86.97	84.03	0.75	63.19	53.97	67.51	63.96	0.33
Qwen3-30B-A3B	84.56	60.42	81.27	77.60	0.17	56.22	47.12	58.36	56.12	0.60
DeepSeek-V3	88.41	72.38	85.14	83.31	1.29	67.18	57.81	65.13	65.36	5.23
Qwen2.5-72B	92.75	73.11	86.82	85.66	0.32	63.19	59.73	68.08	64.80	1.70
Qwen2.5-32B	94.40	80.39	68.96	80.05	0.14	64.38	55.89	64.70	63.61	2.11
Qwen2.5-14B	89.71	71.79	65.31	74.92	0.13	62.54	58.36	62.68	62.15	1.63
Qwen2.5-7B	87.06	70.34	62.54	72.49	0.08	55.45	51.23	55.69	55.10	0.64
GLM-4-0414-32B	83.71	62.61	81.35	77.85	0.15	56.76	43.29	58.57	56.06	0.73
GLM-4-0414-9B	77.01	51.09	73.52	69.56	0.13	48.00	39.73	53.17	49.21	1.41
GLM-4-9B	72.16	49.71	72.76	67.29	0.19	43.90	34.79	52.45	46.39	0.71
ChatGLM3-6B	44.28	28.50	42.02	39.68	0.09	13.28	11.23	24.14	17.45	0.86
ChatGLM2-6B	50.17	29.52	48.82	44.86	0.18	28.65	24.11	34.29	30.45	0.82
ChatGLM-6B	38.33	11.22	35.56	30.92	1.85	24.18	18.08	26.95	24.65	1.79
Qwen-72B	86.51	65.38	81.31	79.40	0.24	41.33	30.14	55.62	45.92	1.01
Qwen-14B	65.22	40.45	66.98	60.33	0.20	40.62	36.16	47.55	42.95	0.95
Qwen-7B	56.02	27.11	56.00	49.41	0.33	34.78	32.88	41.43	37.27	1.12
Baichuan2-13B	51.17	23.69	50.15	44.45	0.12	30.61	26.30	38.62	33.39	0.99
Baichuan2-7B	54.57	7.65	52.58	42.98	0.07	34.01	26.03	39.12	35.23	0.63
ERNIE-4.5	95.85	82.73	88.98	89.84	0.61	77.49	73.70	74.78	75.99	2.05
ERNIE-4	85.21	62.32	80.28	77.82	1.04	$\begin{matrix} 67.30 \end{matrix}$	$\begin{matrix} 60.27 \end{matrix}$	64.12	65.27	2.29
GPT-4o	76.66	53.06	80.24	72.84	0.75	53.60	40.82	61.38	55.39	2.10
ChatGPT	53.92	29.01	58.70	50.32	0.72	37.05	24.93	45.53	39.19	1.63
Qwen3-32B-Think	87.61	38.78	82.86	74.37	15.07	57.18	54.79	57.71	57.14	26.72
Qwen3-14B-Think	85.21	38.78	79.37	72.04	11.72	52.65	46.03	56.12	53.35	18.52
Qwen3-8B-Think	80.56	35.50	77.32	68.84	14.84	47.41	36.44	51.37	47.84	20.65
Qwen3-235B-A22B-Think	85.91	$\begin{matrix} 68.73 \end{matrix}$	$\begin{matrix} 83.28 \end{matrix}$	$\begin{matrix} 80.83 \end{matrix}$	14.03	55.15	42.47	56.20	54.22	31.48
Qwen3-30B-A3B-Think	84.91	47.67	81.91	75.09	7.98	51.10	40.55	54.90	51.52	14.40
QwQ-32B	$\begin{matrix} 89.46 \end{matrix}$	46.94	$\begin{matrix} 84.65 \end{matrix}$	77.64	21.91	52.95	39.45	52.88	51.49	36.08
DeepSeek-R1	91.35	76.46	81.91	83.81	20.92	70.34	64.11	71.47	70.13	35.69
DeepSeek-R1-Distill- Qwen-32B	83.96	42.06	81.00	73.09	8.74	45.15	31.78	47.69	44.76	17.30
DeepSeek-R1-Distill-Qwen-7B	51.37	18.08	46.50	41.63	11.24	15.55	12.05	19.74	16.87	13.64
GLM-Z1-0414-32B	84.41	43.95	80.78	73.57	16.27	48.12	35.89	51.30	48.11	29.84
GLM-Z1-0414-9B	69.82	26.60	69.41	59.77	18.14	28.83	20.55	37.32	31.38	32.29
ERNIE-X1	78.51	63.41	74.66	73.37	44.20	64.92	58.36	65.71	64.54	38.61
o4-mini	78.71	55.25	79.67	73.77	7.73	57.95	46.03	67.29	60.46	2.49
o3-mini	74.11	52.48	77.01	70.44	5.14	57.36	47.95	64.77	59.35	14.81

Table 3. Performance comparison between reasoning and non-reasoning LLMs on the PsyFactQA. RL: average response latency (seconds). Gray background: open-ended QA. Bold: ratios

> 1

.

Table 3. Performance comparison between reasoning and non-reasoning LLMs on the PsyFactQA. RL: average response latency (seconds). Gray background: open-ended QA. Bold: ratios

> 1

.

$\frac{Reasoning}{Non - Reasoning}$	Counselor		Therapist	Total	RL	Counselor		Therapist	Total	RL
$\frac{Reasoning}{Non - Reasoning}$	Single $(↑)$	Multiple $(↑)$	Single $(↑)$	$(↑)$	$(↓)$	Single $(↑)$	Multiple $(↑)$	Single $(↑)$	$(↑)$	$(↓)$
$\frac{Qwen 3 - 32 B - Think}{Qwen 3 - 32 B}$	0.98	0.56	0.99	0.91	23.92	0.90	0.96	0.92	0.91	10.73
$\frac{Qwen 3 - 14 B - Think}{Qwen 3 - 14 B}$	1.01	0.60	1.00	0.93	46.88	0.90	0.89	0.94	0.91	23.15
$\frac{Qwen 3 - 8 B - Think}{Qwen 3 - 8 B}$	1.01	0.62	1.01	0.94	70.67	0.96	0.88	0.93	0.94	27.53
$\frac{Qwen 3 - 235 B - A 22 B - Think}{Qwen 3 - 235 B - A 22 B}$	0.94	1.01	0.96	0.96	18.71	0.87	0.79	0.83	0.85	95.39
$\frac{Qwen 3 - 30 B - A 3 B - Think}{Qwen 3 - 30 B - A 3 B}$	1.00	0.79	1.01	0.97	46.94	0.91	0.86	0.94	0.92	24.00
$\frac{DeepSeek - R 1}{DeepSeek - V 3}$	1.03	1.06	0.96	1.01	16.22	1.05	1.11	1.10	1.07	6.82
$\frac{DeepSeek - R 1 - 32 B}{Qwen 2.5 - 32 B}$	0.97	0.95	1.19	1.05	149.43	1.09	1.15	1.10	1.10	16.91
$\frac{DeepSeek - R 1 - 7 B}{Qwen 2.5 - 7 B}$	0.59	0.26	0.74	0.57	140.50	0.28	0.24	0.35	0.31	21.31
$\frac{QwQ - 32 B}{Qwen 2.5 - 32 B}$	0.95	0.58	1.23	0.97	156.50	0.82	0.71	0.82	0.81	17.10
$\frac{GLM - Z 1 - 32 B - 0414}{GLM - 4 - 32 B - 0414}$	1.01	0.70	0.99	0.95	108.47	0.85	0.83	0.88	0.86	40.88
$\frac{GLM - Z 1 - 9 B - 0414}{GLM - 4 - 9 B - 0414}$	0.91	0.52	0.94	0.86	139.54	0.60	0.52	0.70	0.64	22.90
$\frac{ERNIE - X 1}{ERNIE - 4.5}$	0.82	0.77	0.84	0.82	72.46	0.84	0.79	0.88	0.85	18.83
$\frac{o 4 - mini}{GPT - 4 o}$	1.03	1.04	0.99	1.01	10.31	1.08	1.13	1.10	1.09	1.19
$\frac{o 3 - mini}{GPT - 4 o}$	0.97	0.99	0.96	0.97	6.85	1.07	1.17	1.06	1.07	7.05

Table 4. Performance of Qwen2.5 and Qwen3 on the PsyFactQA across different evaluation modes. CoT and FS refer to the chain-of-thought and few-shot evaluation modes, respectively. $τ$ denotes the temperature hyperparameter used during evaluation. RL indicates the average request-to-output elapsed time in seconds. Gray shading highlights the performance under open-ended QA. Bold and underlined accuracies indicate the best and second-best performance within each model evaluation, while bold and

\begin{matrix} underlined \end{matrix}

denote the best and second-best performance across all evaluations.

Table 4. Performance of Qwen2.5 and Qwen3 on the PsyFactQA across different evaluation modes. CoT and FS refer to the chain-of-thought and few-shot evaluation modes, respectively. $τ$ denotes the temperature hyperparameter used during evaluation. RL indicates the average request-to-output elapsed time in seconds. Gray shading highlights the performance under open-ended QA. Bold and underlined accuracies indicate the best and second-best performance within each model evaluation, while bold and

\begin{matrix} underlined \end{matrix}

denote the best and second-best performance across all evaluations.

LLMs	CoT	FS	$τ$	Counselor		Therapist	Total	RL	Counselor		Therapist	Total	RL
LLMs	CoT	FS	$τ$	Single (%)	Multiple (%)	Single (%)	(%)	(s)	Single (%)	Multiple (%)	Single (%)	(%)	(s)
Qwen3-32B	✗	✗	0.00	89.11	71.21	83.74	82.66	0.63	64.92	55.34	66.35	64.48	2.37
Qwen3-32B	✗	✗	1.00	88.96	69.75	83.28	82.08	0.63	63.61	57.26	62.90	62.65	2.49
Qwen3-32B	✓	✗	0.00	86.66	63.05	82.29	79.35	5.17	46.93	32.60	50.14	46.71	9.45
Qwen3-32B	✓	✗	1.00	85.16	60.86	80.40	77.52	5.33	48.48	30.96	50.07	47.26	10.05
Qwen3-32B	✗	✓	0.00	90.70	75.73	$\begin{matrix} 84.04 \end{matrix}$	$\begin{matrix} 84.36 \end{matrix}$	0.48	62.42	44.11	58.93	59.06	2.05
Qwen3-32B	✗	✓	1.00	90.90	75.22	84.42	84.48	0.49	61.11	44.38	60.66	59.15	2.08
Qwen3-32B	✓	✓	0.00	88.31	64.21	83.47	80.68	4.92	53.72	35.62	53.67	51.78	8.52
Qwen3-32B	✓	✓	1.00	87.86	64.65	82.37	80.15	4.99	52.17	37.81	53.89	51.34	9.14
Qwen3-14B	✗	✗	0.00	84.11	63.99	79.75	77.60	0.25	58.96	51.78	60.95	59.00	0.97
Qwen3-14B	✗	✗	1.00	84.36	64.21	79.71	77.72	0.25	58.43	51.78	60.01	58.36	0.80
Qwen3-14B	✓	✗	0.00	83.21	55.83	78.34	74.82	3.11	42.88	31.23	47.77	43.62	4.05
Qwen3-14B	✓	✗	1.00	82.51	55.47	78.61	74.62	3.15	43.72	35.07	47.91	44.49	4.04
Qwen3-14B	✗	✓	0.00	85.61	63.41	80.40	78.25	0.21	58.49	42.19	59.01	56.96	0.91
Qwen3-14B	✗	✓	1.00	85.26	62.97	80.05	77.89	0.21	58.25	45.75	58.36	56.96	0.89
Qwen3-14B	✓	✓	0.00	82.51	63.05	78.91	76.49	3.15	46.99	32.88	48.63	46.15	3.86
Qwen3-14B	✓	✓	1.00	84.81	58.67	78.76	76.19	3.11	46.75	34.79	49.93	46.77	3.89
Qwen3-8B	✗	✗	0.00	79.71	57.58	76.03	73.04	0.21	49.08	41.92	55.76	51.02	0.87
Qwen3-8B	✗	✗	1.00	79.66	57.00	76.22	72.97	0.21	49.26	41.64	55.33	50.90	0.75
Qwen3-8B	✓	✗	0.00	79.66	50.87	75.53	71.27	3.29	37.46	25.48	42.36	38.17	3.41
Qwen3-8B	✓	✗	1.00	79.51	45.34	75.23	69.83	3.19	39.31	25.75	42.72	39.25	3.47
Qwen3-8B	✗	✓	0.00	79.66	58.67	76.37	73.42	0.18	53.72	42.47	50.43	51.19	0.58
Qwen3-8B	✗	✓	1.00	79.66	59.40	76.41	73.61	0.18	53.78	41.92	50.58	51.22	0.70
Qwen3-8B	✓	✓	0.00	80.71	54.81	76.10	72.77	2.38	40.02	24.38	40.71	38.64	3.33
Qwen3-8B	✓	✓	1.00	80.16	52.55	75.46	71.79	2.47	40.98	28.77	41.14	39.74	3.37
Qwen2.5-32B	✗	✗	0.00	94.45	80.32	68.58	79.88	0.14	63.73	$\begin{matrix} 66.85 \end{matrix}$	65.49	64.77	1.90
Qwen2.5-32B	✗	✗	1.00	94.40	80.39	68.96	80.05	0.14	63.43	67.67	65.27	$\begin{matrix} 64.63 \end{matrix}$	2.11
Qwen2.5-32B	✓	✗	0.00	85.56	65.31	66.07	72.39	4.41	56.22	45.48	55.04	54.60	9.80
Qwen2.5-32B	✓	✗	1.00	84.91	59.11	64.17	69.93	5.02	54.62	47.40	56.41	54.57	10.13
Qwen2.5-32B	✗	✓	0.00	94.85	$\begin{matrix} 83.82 \end{matrix}$	70.67	81.73	0.28	59.20	58.63	62.39	60.43	2.41
Qwen2.5-32B	✗	✓	1.00	$\begin{matrix} 94.75 \end{matrix}$	84.33	71.16	82.03	0.31	59.86	56.44	61.82	60.29	2.74
Qwen2.5-32B	✓	✓	0.00	88.61	67.86	65.01	73.52	3.13	61.41	52.33	58.50	59.27	6.95
Qwen2.5-32B	✓	✓	1.00	86.66	61.95	65.05	71.54	3.78	60.21	49.86	57.78	58.13	8.48
Qwen2.5-14B	✗	✗	0.00	90.00	72.67	65.31	75.22	0.11	59.98	65.75	62.32	61.54	1.44
Qwen2.5-14B	✗	✗	1.00	89.71	71.79	65.31	74.92	0.13	59.80	64.66	61.10	60.84	1.63
Qwen2.5-14B	✓	✗	0.00	83.86	61.08	61.78	68.98	3.63	52.47	47.67	50.58	51.19	8.61
Qwen2.5-14B	✓	✗	1.00	81.76	50.95	61.66	65.91	4.32	53.01	51.78	52.02	52.48	9.18
Qwen2.5-14B	✗	✓	0.00	89.66	76.31	63.87	75.30	0.28	51.22	50.14	64.19	56.35	3.04
Qwen2.5-14B	✗	✓	1.00	89.76	76.38	63.91	75.37	0.32	53.13	46.58	64.84	57.17	3.51
Qwen2.5-14B	✓	✓	0.00	84.91	63.48	62.31	70.11	3.22	57.59	53.15	57.13	56.93	6.53
Qwen2.5-14B	✓	✓	1.00	84.41	57.14	61.17	67.99	3.87	56.58	51.51	57.35	56.35	8.17
Qwen2.5-7B	✗	✗	0.00	86.71	70.77	62.73	72.56	0.08	52.83	56.71	52.88	53.26	0.64
Qwen2.5-7B	✗	✗	1.00	87.06	70.34	62.54	72.49	0.08	50.21	54.25	50.86	50.90	0.64
Qwen2.5-7B	✓	✗	0.00	79.21	52.04	57.83	63.63	2.75	47.41	40.82	48.49	47.14	5.75
Qwen2.5-7B	✓	✗	1.00	79.21	45.26	57.60	61.98	2.82	44.55	42.19	45.75	44.78	6.11
Qwen2.5-7B	✗	✓	0.00	88.11	73.62	62.84	73.72	0.10	40.98	34.52	46.11	42.37	1.01
Qwen2.5-7B	✗	✓	1.00	87.91	73.25	63.22	73.74	0.10	39.31	34.52	44.09	40.73	0.76
Qwen2.5-7B	✓	✓	0.00	82.16	57.22	57.94	65.85	2.10	50.86	46.03	51.87	50.76	3.74
Qwen2.5-7B	✓	✓	1.00	80.71	51.02	57.33	63.68	2.33	47.35	40.00	49.64	47.49	4.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, F.; He, Y.; Chen, Q.; Liu, F. Evaluating Psychological Competency via Chinese Q&A in Large Language Models. Appl. Sci. 2025, 15, 9089. https://doi.org/10.3390/app15169089

AMA Style

Gao F, He Y, Chen Q, Liu F. Evaluating Psychological Competency via Chinese Q&A in Large Language Models. Applied Sciences. 2025; 15(16):9089. https://doi.org/10.3390/app15169089

Chicago/Turabian Style

Gao, Feng, Yishen He, Qin Chen, and Feng Liu. 2025. "Evaluating Psychological Competency via Chinese Q&A in Large Language Models" Applied Sciences 15, no. 16: 9089. https://doi.org/10.3390/app15169089

APA Style

Gao, F., He, Y., Chen, Q., & Liu, F. (2025). Evaluating Psychological Competency via Chinese Q&A in Large Language Models. Applied Sciences, 15(16), 9089. https://doi.org/10.3390/app15169089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Psychological Competency via Chinese Q&A in Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. LLMs

2.2. Evaluations of LLMs

3. PsyFactQA: A Psychological QA Benchmark Dataset

4. Evaluation

4.1. Evaluated LLMs

4.2. Evaluation Mode

4.3. Research Questions

4.4. RQ1: Comparative Psychological QA Accuracy of Popular LLMs

4.5. RQ2: Reasoning vs. Non-Reasoning LLMs in Psychological QA Accuracy

4.6. RQ3: Psychological QA Performance Across Evaluation Modes

4.7. RQ4: Error Analysis of LLM Performance in Psychological QA

5. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI