CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching
Abstract
1. Introduction
2. Related Work
3. The Construction of CPG-EVAL
3.1. Problem Definition & Task Design
3.2. Grammar Item
- Comprehensive Coverage: The CGLM offers a grammar outline highly aligned with the Standards, covering all proficiency levels from beginner to advanced, and aligned with international examination systems such as HSK and YCT.
- Architectural Integrity: Grammar items in the CGLM are not listed in isolation; they are instead systematically categorized and integrated according to dimensions such as phonetics, vocabulary, semantics, and pragmatics, establishing explicit interconnections and a pedagogically sequenced progression of grammatical functions.
- Stage Appropriateness: The CGLM adjusts the cognitive load for learners at different proficiency levels, ensuring that grammar items at each level are both challenging and achievable.
- Teaching Effectiveness: Developed from over two decades of grammar teaching and assessment practices, the CGLM incorporates extensive empirical teaching data and classroom feedback, verifying the strong correlation between grammar points and teaching activities and demonstrating its effectiveness.
- Educational Compatibility: The CGLM aligns closely with the grammar distribution patterns in mainstream international Chinese textbooks, such as “Developing Chinese”, “New Practical Chinese Reader”, and “HSK Standard Course”, ensuring high compatibility between the grammar benchmark and textbook content.
3.3. Language Instance
3.4. Question Construction
- 1.
- SINGLE: Single Instance Mapping Test specifically assesses whether LLMs can accurately identify the correspondence between a single language instance and the indicated grammar item, following the basic format of P-GPR defined by [11]. SINGLE includes two subtasks: SINGLE-T (answer is T) and SINGLE-F (answer is F).Prompt template: 请判断:在句子{Language Instance} 中, 是否包含语法点{Grammar Item}, 如果包含, 请仅回答T, 如果不包含, 请仅回答F (Please evaluate: Does {Language Instance} contain the grammar point {Grammar Item}, If yes, output: T; if no, output: F.)?Example: 请判断:在句子“我不会弹钢琴。”中, 是否包含语法点 [能愿动词:会]? 如果包含, 请仅回答T, 如果不包含, 请仅回答F。(Please evaluate: Does the sentence “wǒ bù huì tán gāng qín.” (I cannot play the piano) contain the grammar point [qíng tài dòng cí:huì] (Modal verb: huì——can, will)? If yes, output: T; if no, output: F.Answer: T
- 2.
- BATCH: Batch Grammar-Instance Mapping Test assesses the ability of LLMs to determine the correspondence between multiple language instances and the indicated grammar item. It therefore probes whether an LLM can (i) maintain stable grammar–instance judgments under higher contextual load and (ii) reliably follow item-wise evaluation instructions by assigning a separate T/F label to each sentence, rather than collapsing the set into a single global impression.Prompt template: 请对1.号到9.号每个句子依次进行判断:如果包含语法点{Grammar Item}, 回答T。如果不包含, 回答F。(Please evaluate sentences 1 through 9 one by one: If a sentence contains the grammar point {Grammar Item}?, respond with T. If not, respond with F. Use only T or F for each sentence. No explanation is needed.){Language Instance} × 9Example: 请对1. 号到9. 号每个句子依次进行判断:如果包含语法点 [能愿动词:能], 回答T。如果不包含, 回答F。(Please evaluate sentences 1 through 9 one by one: If a sentence contains the grammar point [néng yuàn dòng cí: néng] (Modal verb: néng-be able to)?, respond with T. If not, respond with F. Use only T or F for each sentence. No explanation is needed.)
- 我不会弹钢琴。(I can’t play the piano.)
- 我妹妹会画画。(My younger sister can draw.)
- 他会参加下周的会议吗? (Will he attend next week’s meeting?)
- 你会游泳吗? (Can you swim?)
- 她会跳舞, 不会唱歌。(She can dance but can’t sing.)
- 这孩子会说话了。(This child can speak now.)
- 爷爷不会用电脑。(Grandpa can’t use a computer.)
- 你会不会做饭? (Can you cook or not?)
- 明天她不会来学校。(She won’t come to school tomorrow.)
Answer: FFFFFFFFF - 3.
- SIM-GRA: Similarity Grammar Discrimination Test evaluates whether LLMs can distinguish between grammatically similar grammar items. The four distractor grammar items are derived by calculating semantic vector similarity using embedding models. (Specifically, we use OpenAI’s text-embedding-3-large model to generate vector representations for each of the 739 grammar entries. Then, we compute the similarity scores between each grammar entry and the remaining 738 entries. The four most similar grammar entries are selected as distractor options based on similarity scores). This design explicitly tests whether LLMs can map observed linguistic forms back to the correct pedagogical abstraction, rather than merely verifying a proposed rule, thereby challenging their ability to perform fine-grained grammatical discrimination under conditions of high conceptual similarity.Prompt template: 句子{Language Instance}最适合作为以下哪个语法点的例句? 请只使用选项字母回答。(Which grammar point does the sentence {Language Instance}? best exemplify? Please answer using only the option letter.){Grammar Item} × 5Example: 句子“他没有哥哥。” 最适合作为以下哪个语法点的例句?请只使用选项字母回答。(Which grammar point does the sentence “tā méi yǒu gē gē.” (He doesn’t have an older brother.) best exemplify? Please answer using only the option letter.)
- “有”字句:表示附着:主语+动词+有+宾语 (“yǒu”-construction: indicating attachment - Subject + Verb + yǒu + Object)
- “有”字句:表示领有 (“yǒu”-construction: indicating possession)
- “有”字句:表示存在、具有:主语+有+着+宾语 (“You”-construction: indicating existence/possession with subject + yǒu + zhe + object)
- “有”字句:表示比较 (“yǒu”-construction: indicating comparison)
- “有”字句:表示存在 (“yǒu”-construction: indicating existence)
Answer: B - 4.
- CAT-GRA: Category Grammar Selection Test specifically evaluates the ability of LLMs to differentiate among grammar items within the same grammatical category. In CAT-GRA, all grammar items belonging to the same grammatical category as the correct answer are provided as options. Rather than merely increasing the number of options, this task is intentionally designed to challenge LLMs’ language processing abilities within a highly specialized and systematically organized pedagogical grammar knowledge context, where distinctions are subtle, formally constrained, and instructionally meaningful.Prompt template: 句子{Language Instance}最适合作为以下哪个语法点的例句? (Which grammar point does the sentence {language instance} best exemplify?)Example: 句子“苹果是红的。” 最适合作为以下哪个语法点的例句? (Which grammar point does the sentence “píng guǒ shì hóng de.” (The apple is red.) best exemplify?)
- 1063.“是”字句:表示等同或类属 (“Shi”-construction: expressing equivalence or class membership)
- 1064.“是”字句:表示说明或特征 (“Shi”-construction: indicating description or characteristic)
- 1065.“是”字句:表示存在 (“Shi”-construction: expressing existence)
- 3113.用 “是”强调 (Using “Shi” for emphasis)
Answer: 1064 - 5.
- CON-INS: Confusing Instance Discrimination Test is designed to examine whether LLMs can resist misleading linguistic cues and correctly reject language instances that appear similar to a target grammar item but do not actually instantiate it (CON-INS includes two subtasks: CON-INS-F10 (all instances do not contain the targeted grammar point) and CON-INS-T5F5 (half of the instances contain the targeted grammar point, half do not). CON-INS can be viewed as a variant of BATCH, increasing the difficulty at the language instance level by requiring LLMs to resist interference from linguistic similarities. Initially, 314 grammar items were selected following these two criteria, and DeepSeek-v3 was employed to generate sentences featuring similar linguistic forms but not aligned with the targeted grammar items. Two native-speaking Chinese language teachers specializing in teaching Chinese as a second language then screened and revised the sentences).Confusing instances primarily involve three types:
- (a)
- Grammatical markers embedded in lexical words. For example, (1) 会 as a modal verb (“can, will”) vs. 会议 (“meeting”), where 会 is part of a noun; (2) 可 as a degree adverb (“can”) vs. 可持续 (“sustainable”), where 可 functions as part of an adjective.
- (b)
- Identical surface forms expressing different grammatical meanings. For example, (1) 躺着, where 着 marks the continuation of a state, vs. 跑着, where 着 marks the continuation of an action; 当 in 当你去中国时 (“when you go to China”), where 当 functions as a preposition introducing a temporal clause, vs. 当 in 他当经理 (“he serves as a manager”), where 当 functions as a verb meaning “to assume a role.”
- (c)
- Different forms expressing similar grammatical meanings. For example: (1) 发挥着作用, where 着 marks an ongoing action, vs. 正在发挥作用, where 正在 marks an ongoing action; (2) 这 (this, colloquial/modern) vs. 此 (this, formal/archaic).
Prompt template: 请对1.号到10.号每个句子依次进行判断:如果包含语法点{Grammar Item}, 回答T。如果不包含, 回答F。(Please evaluate sentences 1 through 10 one by one: If a sentence contains the grammar point {Grammar Item}, respond with T. If not, respond with F. ) {Language Instance} × 10Example: 请对1. 号到10. 号每个句子依次进行判断:如果包含语法点 [经历态:用动态助词 “过” 表示], 回答T。如果不包含, 回答F。(Please evaluate sentences 1 through 10 one by one. If a sentence contains the grammar point [Experiential aspect: expressed by the dynamic particle “guo”] (in someone’s opinion, from someone’s perspective), respond with T. If not, respond with F.)- 我看过那本书。(I have read that book.)
- 我没吃过火锅。(I have never eaten hot pot.)
- 她过敏体质, 不能吃海鲜。(She has an allergic constitution and cannot eat seafood.)
- 他没学过法语。(He has never studied French.)
- 她当过学生会主席。(She once served as the student union president.)
- 这本书我已经过目了, 内容很不错。(I have already skimmed through this book; the content is very good.)
- 她过分谦虚, 反而让人觉得不真实。(She is overly modest, which makes her seem insincere.)
- 她过生日那天, 收到了很多礼物。(On her birthday, she received many gifts.)
- 他过马路时, 差点被车撞到。(He almost got hit by a car while crossing the street.)
- 她学过法语。(She has studied French.)
Answer: TTFTTFFFFT
3.5. Sampling Procedure for CPG-EVAL: Core
- N is the total number of relevant items in the population (e.g., 2844 elementary-level SINGLE-T questions);
- corresponds to the 95% confidence level;
- is the assumed proportion (used to yield the maximum required sample size);
- is the margin of error.
4. Evaluation
4.1. Setup and Models
4.2. Results
5. Analysis and Discussion
5.1. Results of SINGLE and BATCH
- (1)
- The performance of LLMs on SINGLE-T and BATCH-T tasks is significantly better than on SINGLE-F and BATCH-F tasks. This indicates that errors primarily arise from false positives, i.e., mistakenly identifying a language instance that does not contain a specific grammar item as one that does. It is particularly noteworthy that, except for Qwen2.5-7B, all other smaller models exhibit a serious false positive problem.
- (2)
- Some smaller-scale LLMs demonstrate recognition abilities on the SINGLE-T task (such as Llama-3.1-8B) and the BATCH-T task (such as glm-4-9b) that are comparable to, or even exceed, those of certain larger models.
- (3)
- LLMs employ different strategies when determining the presence or absence of a grammar item within a language instance. Some models tend to be more assertive—more likely to provide affirmative responses—while others are relatively conservative and more inclined to give negative responses.
- (4)
- In the more complex batch task BATCH-F, larger models exhibit a clear advantage when handling negative instances (i.e., instances that do not conform to grammatical rules). Under the difficulty posed by negative instances and the demand to follow batch judgment instructions, the accuracy of some smaller-scale models (such as internlm2_5-7b) even falls below that of the random baseline.
5.2. From the Results of SIM-GRA and CAT-GRA
- (1)
- Models with larger scales and more comprehensive training demonstrate greater robustness and generalization when confronted with fine-grained grammatical distinctions and specialized forms of interference.
- (2)
- Smaller-scale models, such as LLaMA-3.1-8B and InternLM2.5-7B, achieved relatively high scores in the SIM-GRA task (81.2% and 92.3%, respectively), but their performance dropped significantly in the CAT-GRA task (61.3% and 54.3%, respectively).
5.3. From the Results of CON-INS
- (1)
- As shown in Figure 2, among the five task types, CON-INS appears to exhibit the highest discriminative power, as the boxplots show the most pronounced separation between human participants, larger-scale, semi-larger-scale, and smaller-scale models.
- (2)
- As indicated by the comparison between F10 and T5F5 in Table 4, the greater the interference from confusing instances, the more likely models are to make errors.
- (3)
- Large-scale models perform relatively well, but there remains a significant accuracy gap between them and humans.
- (4)
- For the majority of smaller-scale models, accuracy on F10 falls below the random baseline.
5.4. From Completely Failed Items
6. Conclusions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Karataş, F.; Abedi, F.Y.; Ozek Gunyel, F.; Karadeniz, D.; Kuzgun, Y. Incorporating AI in Foreign Language Education: An Investigation into ChatGPT’s Effect on Foreign Language Learners. Educ. Inf. Technol. 2024, 29, 19343–19366. [Google Scholar] [CrossRef]
- Li, B.; Lowell, V.L.; Wang, C.; Li, X. A Systematic Review of the First Year of Publications on ChatGPT and Language Education: Examining Research on ChatGPT’s Use in Language Learning and Teaching. Comput. Educ. Artif. Intell. 2024, 7, 100266. [Google Scholar] [CrossRef]
- Monk, D.H. Subject Area Preparation of Secondary Mathematics and Science Teachers and Student Achievement. Econ. Educ. Rev. 1994, 13, 125–145. [Google Scholar] [CrossRef]
- Medgyes, P. The Non-Native Teacher; Macmillan: London, UK, 1994. [Google Scholar]
- Shi, Y.; Yu, K.; Dong, Y.; Chen, F. Large Language Models in Education: A Systematic Review of Empirical Applications, Benefits, and Challenges. Comput. Educ. Artif. Intell. 2026, 10, 100529. [Google Scholar] [CrossRef]
- Giannakos, M.; Azevedo, R.; Brusilovsky, P.; Cukurova, M.; Dimitriadis, Y.; Hernandez-Leo, D.; Järvelä, S.; Mavrikis, M.; Rienties, B. The Promise and Challenges of Generative AI in Education. Behav. Inf. Technol. 2025, 44, 2518–2544. [Google Scholar] [CrossRef]
- O’Keeffe, A.; Mark, G. The English Grammar Profile of Learner Competence: Methodology and Key Findings. Int. J. Corpus Linguist. 2022, 22, 457–489. [Google Scholar] [CrossRef]
- Ying, C.; Wang, H.; Jin, H.; Li, Y.; Liu, Y. (Eds.) Chinese Proficiency Grading Standards for International Chinese Language Education Grammar Learning Manual (Elementary, Intermediate, Advanced); Beijing Language and Culture University Press: Beijing, China, 2022. [Google Scholar]
- Sunakawa, Y. A Handbook of Japanese Grammar Patterns for Teachers and Learners; Kurosio Publishers: Tokyo, Japan, 2015. [Google Scholar]
- Odlin, T. (Ed.) Perspectives on Pedagogical Grammar; Cambridge Applied Linguistics, Cambridge University Press: Cambridge, UK, 1994. [Google Scholar] [CrossRef]
- Wang, D. Evaluation of Large Language Models’ Foreign Language Teaching Ability: An Experimental Study Focusing on Pedagogical Grammar. In Proceedings of the 103rd Language and Speech Understanding and Dialogue Processing Study Group, Tokyo, Japan, 20–22 March 2025; Volume 103, pp. 80–85. [Google Scholar] [CrossRef]
- Andrews, S. Why do L2 teachers need to ‘know about language’? Teacher metalinguistic awareness and input for learning. Lang. Educ. 1999, 13, 161–177. [Google Scholar] [CrossRef]
- Andrews, S. Teacher Language Awareness, 1st ed.; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar] [CrossRef]
- Wang, W.; Yan, Y. A review of teacher language awareness (2015–2024): Current trends and future directions. J. Lang. Teach. 2024, 4, 10–20. [Google Scholar] [CrossRef]
- Andrews, S.; Svalberg, A.M.L. Teacher Language Awareness. In Language Awareness and Multilingualism; Cenoz, J., Gorter, D., May, S., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 219–231. [Google Scholar] [CrossRef]
- Andrews, S. The language awareness of the L2 teacher: Its impact upon pedagogical practice. Lang. Aware. 2001, 10, 75–90. [Google Scholar] [CrossRef]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Linzen, T., Chrupała, G., Alishahi, A., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 353–355. [Google Scholar] [CrossRef]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32, pp. 3266–3280. [Google Scholar] [CrossRef]
- Xu, L.; Hu, H.; Zhang, X.; Li, L.; Cao, C.; Li, Y.; Xu, Y.; Sun, K.; Yu, D.; Yu, C.; et al. CLUE: A Chinese Language Understanding Evaluation Benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020; pp. 4762–4772. [Google Scholar] [CrossRef]
- Xu, L.; Li, A.; Zhu, L.; Xue, H.; Zhu, C.; Zhao, K.; He, H.; Zhang, X.; Kang, Q.; Lan, Z. SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark. arXiv 2023, arXiv:2307.15020. [Google Scholar] [CrossRef]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 26 April–1 May 2020. [Google Scholar] [CrossRef]
- Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Trans. Mach. Learn. Res. 2022, 2023, 1–95. [Google Scholar] [CrossRef]
- Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 2567–2577. [Google Scholar] [CrossRef]
- Chen, Z.; Chen, W.; Smiley, C.; Shah, S.; Borova, I.; Langdon, D.; Moussa, R.; Beane, M.; Huang, T.H.; Routledge, B.; et al. FINQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3697–3711. [Google Scholar]
- Guha, N.; Nyarko, J.; Ho, D.; Ré, C.; Chilton, A.; Chohlas-Wood, A.; Peters, A.; Waldon, B.; Rockmore, D.; Zambrano, D.; et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2023, 36, 44123–44279. [Google Scholar] [CrossRef]
- Hou, J.; Ao, C.; Wu, H.; Kong, X.; Zheng, Z.; Tang, D.; Li, C.; Hu, X.; Xu, R.; Ni, S.; et al. E-EVAL: A Comprehensive Chinese k-12 Education Evaluation Benchmark for Large Language Models. arXiv 2024, arXiv:2401.15927. [Google Scholar] [CrossRef]
- Huang, Y.; Bai, Y.; Zhu, Z.; Zhang, J.; Zhang, J.; Su, T.; Liu, J.; Lv, C.; Zhang, Y.; Lei, J.; et al. C-EVAL: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 62991–63010. [Google Scholar]
- National Language Commission. Chinese Proficiency Grading Standards for International Chinese Language Education; Beijing Language and Culture University Press: Beijing, China, 2021. [Google Scholar]
- Team Qwen. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
- Team GLM; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Zhao, H.; Lai, H.; et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv 2024, arXiv:2406.12793. [Google Scholar] [CrossRef]
- Cai, Z.; Cao, M.; Chen, H.; Chen, K.; Chen, K.; Chen, X.; Chen, X.; Chen, Z.; Chen, Z.; Chu, P.; et al. InternLM2 Technical Report. arXiv 2024, arXiv:2403.17297. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2025, arXiv:2412.19437. [Google Scholar] [CrossRef]
- OpenAI. GPT-4o-2024-08-06. 2024. Available online: https://platform.openai.com/docs/models/gpt-4o (accessed on 1 June 2025).
- OpenAI. GPT-4o-mini-2024-07-18. 2024. Available online: https://platform.openai.com/docs/models/gpt-4o-mini (accessed on 1 June 2025).
- Doubao Team. Doubao 1.5 Pro. 2025. Available online: https://team.doubao.com/en/special/doubao_1_5_pro (accessed on 1 June 2025).
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 9459–9474. [Google Scholar]


| Benchmark | Domain Focus | Language | Pedagogical Grammar |
|---|---|---|---|
| GLUE [17] | NLU | English | No |
| SuperGLUE [18] | NLU + Reasoning | English | No |
| CLUE [19] | NLU | Chinese | No |
| SuperCLUE [20] | NLU + Reasoning | Chinese | No |
| MMLU [21] | General Knowledge | Multilingual | No |
| BIG-bench [22] | Reasoning | Multilingual | No |
| PubMedQA [23] | Medicine | English | No |
| FINQA [24] | Finance | English | No |
| LegalBench [25] | Law | English | No |
| E-EVAL [26] | K–12 Education | Chinese | No |
| C-EVAL [27] | Multi-discipline knowledge | Chinese | Minimal |
| Question Bank | Number of Questions |
|---|---|
| SINGLE-T | 6651 |
| SINGLE-F | 11,968 |
| BATCH-T | 739 |
| BATCH-F | 11,968 |
| SIM-GRA | 6651 |
| CAT-GRA | 5283 |
| CON-INS-F10 | 314 |
| CON-INS-T5F5 | 314 |
| Model | #Parameters | Access |
|---|---|---|
| Qwen2.5-7B-Instruct | 7B | Weights |
| internlm2_5-7b-chat | 7B | Weights |
| Llama-3.1-8B-Instruct | 8B | Weights |
| GLM-4-9B-Chat | 9B | Weights |
| Qwen2.5-70B-Instruct | 70B | Weights |
| DeepSeek-V3-250324 | 660B MoE | API |
| GPT-4o-2024-08-06 | Undisclosed | API |
| GPT-4o-mini-2024-07-18 | Undisclosed | API |
| Doubao-1-5-pro-32k-250115 | Undisclosed | API |
| Doubao-1-5-lite-32k-250115 | Undisclosed | API |
| Qwen2.5-MAX-250409 | Undisclosed | API |
| Model | SINGLE | BATCH | SIM-GRA | CAT-GRA | CON-INS | Average | |||
|---|---|---|---|---|---|---|---|---|---|
| T | F | T*9 | F*9 | F*10 | T*5&F*5 | ||||
| RANDOM | 0.500 | 0.500 | 0.500 | 0.500 | 0.200 | 0.100 | 0.500 | 0.500 | 0.413 |
| Doubao-1-5-pro | 0.985 | 0.944 | 0.987 | 0.942 | 0.949 | 0.939 | 0.842 | 0.960 | 0.950 |
| GPT-4o | 0.977 | 0.939 | 0.926 | 0.934 | 0.938 | 0.906 | 0.823 | 0.935 | 0.933 |
| DeepSeek-v3 | 0.952 | 0.961 | 0.940 | 0.897 | 0.910 | 0.870 | 0.792 | 0.922 | 0.916 |
| Qwen2.5-Max | 0.989 | 0.905 | 0.849 | 0.918 | 0.903 | 0.879 | 0.796 | 0.945 | 0.910 |
| Doubao-1-5-lite | 0.952 | 0.938 | 0.921 | 0.889 | 0.908 | 0.859 | 0.716 | 0.902 | 0.903 |
| Qwen2.5-72B | 0.949 | 0.954 | 0.939 | 0.897 | 0.885 | 0.851 | 0.689 | 0.873 | 0.900 |
| GPT-4o-mini | 0.921 | 0.928 | 0.950 | 0.712 | 0.868 | 0.809 | 0.554 | 0.843 | 0.843 |
| Qwen2.5-7B | 0.906 | 0.930 | 0.717 | 0.779 | 0.838 | 0.608 | 0.600 | 0.732 | 0.797 |
| glm-4-9b | 0.953 | 0.794 | 0.930 | 0.522 | 0.765 | 0.727 | 0.428 | 0.767 | 0.750 |
| Llama-3.1-8B | 0.979 | 0.781 | 0.803 | 0.549 | 0.795 | 0.626 | 0.416 | 0.726 | 0.734 |
| internlm2_5-7b | 0.969 | 0.796 | 0.737 | 0.304 | 0.928 | 0.546 | 0.371 | 0.652 | 0.692 |
| Ave.LLM | 0.958 | 0.897 | 0.882 | 0.758 | 0.881 | 0.784 | 0.639 | 0.842 | 0.848 |
| Human-2 | 0.968 | 0.996 | 0.987 | 0.999 | 0.968 | 0.929 | 0.920 | 0.976 | 0.972 |
| Human-1 | 0.988 | 0.985 | 0.960 | 0.991 | 0.939 | 0.961 | 0.961 | 0.973 | 0.972 |
| Human-3 | 0.995 | 0.970 | 0.951 | 0.979 | 0.950 | 0.938 | 0.958 | 0.934 | 0.964 |
| Human-4 | 0.957 | 0.974 | 0.976 | 0.967 | 0.913 | 0.899 | 0.897 | 0.941 | 0.943 |
| Ave.Human | 0.977 | 0.981 | 0.969 | 0.984 | 0.943 | 0.932 | 0.934 | 0.956 | 0.963 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, D. CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching. Informatics 2026, 13, 29. https://doi.org/10.3390/informatics13020029
Wang D. CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching. Informatics. 2026; 13(2):29. https://doi.org/10.3390/informatics13020029
Chicago/Turabian StyleWang, Dong. 2026. "CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching" Informatics 13, no. 2: 29. https://doi.org/10.3390/informatics13020029
APA StyleWang, D. (2026). CPG-EVAL: Evaluating the Readiness of Large Language Models as Assistants and Teammates in Language Teaching. Informatics, 13(2), 29. https://doi.org/10.3390/informatics13020029

