Exploring GenAI-Powered Listening Test Development
Abstract
1. Introduction
2. Literature Review
2.1. CET
2.2. Approaches in Listening Assessment
2.3. Validity in Listening Assessment
2.4. Corpus-Based Approach in Listening Assessment
2.5. GenAI and Listening Test
3. Materials and Methods
3.1. Corpus Construction and Analysis
3.2. Test Development
3.3. Test Validity Measurement
3.3.1. Content Validity
3.3.2. Concurrent Validity
3.3.3. Face Validity
4. Results
4.1. Task Characteristics of CET-4 Listening Test Corpus
4.1.1. The Input
4.1.2. The Expected Response
4.2. Comparison of Task Characteristics of the Authentic Test and the GenAI Test
4.3. Students’ Performance on the Two Tests
4.4. Results of the Questionnaire
5. Discussion
6. Conclusions
Supplementary Materials
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| GenAI | Generative Artificial Intelligence |
| CET | College English Test |
| CET-4 | College English Test-Band 4 |
| IELTS | International English Language Testing System |
| TOEFL | Test of English as a Foreign Language |
| PTE | Pearson Test of English |
| AI | Artificial Intelligence |
| EFL | English as a Foreign Language |
| LLMs | Large Language Models |
| NLP | Natural Language Processing |
| TLU | Target Language Usage |
| CET-SET | College English Test-Spoken English Test |
| CEFR | Common European Framework of Reference |
| AUA | Assessment Use Argument |
| GPA | Grade Point Average |
| DET | Duolingo English Test |
| BNC | British National Corpus |
| COCA | Corpus of Contemporary American English |
| MICASE | Michigan Corpus of Academic Spoken English |
| TPO | TOEFL Practice Online |
| MCQ | Multiple-Choice Questions |
| WPM | Word Per Minute |
| PHP | Progressive Hint Prompting |
| L1 | First Language |
| CSE | China’s Standards of English Language Ability |
| SWOT | Strengths, Weaknesses, Opportunities, and Threats |
References
- Allen, M. S., Robson, D. A., & Iliescu, D. (2023). Face validity: A critical but ignored component of scale construction in psychological assessment (Vol. 39). Hogrefe Publishing. [Google Scholar] [CrossRef]
- Aryadoust, V. (2024). Topic and accent coverage in a commercialized L2 listening test: Implications for test-takers’ identity. Applied Linguistics, 45(5), 765–785. [Google Scholar] [CrossRef]
- Aryadoust, V., & Jia, Y. (2025). Assessing listening skills in SLA. In Reference module in social sciences. Elsevier. [Google Scholar] [CrossRef]
- Aryadoust, V., & Luo, L. (2023). The typology of second language listening constructs: A systematic review. Language Testing, 40(2), 375–409. [Google Scholar] [CrossRef]
- Aryadoust, V., Zakaria, A., & Jia, Y. (2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, 100204. [Google Scholar] [CrossRef]
- Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press. [Google Scholar]
- Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford University Press. [Google Scholar]
- Bourdeaud’Hui, H., Aesaert, K., & van Braak, J. (2021). Exploring the validity of a comprehensive listening test to identify differences in primary school students’ listening skills. Language Assessment Quarterly, 18(3), 228–252. [Google Scholar] [CrossRef]
- Buck, G. (2001). Assessing listening. Cambridge University Press. [Google Scholar] [CrossRef]
- Buck, G. (2018). Preface. In G. J. Ockey, & E. Wagner (Eds.), Assessing L2 listening: Moving towards authenticity (pp. xi–xvi). John Benjamins. [Google Scholar]
- Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15(2), 119–157. [Google Scholar] [CrossRef]
- Chuang, P.-L., & Yan, X. (2025). Language assessment in the era of generative artificial intelligence: Opportunities, challenges, and future directions. System, 134, 103846. [Google Scholar] [CrossRef]
- Cinkara, E., & Özen Tosun, Ö. (2017). Face validity study of a small-scale test in a tertiary-level intensive EFL program. Bartın University Journal of Faculty of Education, 6(2), 395–410. [Google Scholar] [CrossRef][Green Version]
- Dominguez Lucio, E., & Aryadoust, V. (2023). Neurocognitive evidence for test equity in an academic listening assessment. Behaviormetrika, 50(1), 155–175. [Google Scholar] [CrossRef]
- Fan, J., Frost, K., & Jin, Y. (2022). Local English testing in China’s tertiary education: Contexts, policies, and practices. Language Testing, 39(3), 453–473. [Google Scholar] [CrossRef]
- Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460–474. [Google Scholar] [CrossRef]
- Field, J. (2013). Cognitive validity. In A. Geranpayeh, & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language listening (Vol. 35, pp. 77–151). Cambridge University Press. [Google Scholar]
- Gardner, J., O’Leary, M., & Yuan, L. (2021). Artificial intelligence in educational assessment: ‘Breakthrough? Or buncombe and ballyhoo?’. Journal of Computer Assisted Learning, 37(5), 1207–1216. [Google Scholar] [CrossRef]
- Goh, C. C. M., & Aryadoust, V. (2025). Developing and assessing second language listening and speaking: Does AI make it better? Annual Review of Applied Linguistics, 45, 179–199. [Google Scholar] [CrossRef]
- Gu, X., & Li, Y. (2012). Longitudinal analysis of the task characteristics of the input and the expected response of the CET listening test. Foreign Language Testing and Teaching, (3), 17–26. [Google Scholar] [CrossRef]
- Hao, J., von Davier, A. A., Yaneva, V., Lottridge, S., von Davier, M., & Harris, D. J. (2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice, 43(2), 16–29. [Google Scholar] [CrossRef]
- He, L., & Jiang, Z. (2020). Assessing second language listening over the past twenty years: A review within the socio-cognitive framework. Frontiers in Psychology, 11, 2123. [Google Scholar] [CrossRef]
- He, R., Cao, J., & Tan, T. (2025). Generative artificial intelligence: A historical perspective. National Science Review, 12(5), nwaf050. [Google Scholar] [CrossRef] [PubMed]
- Hughes, A., & Hughes, J. (2020). Testing for language teachers (3rd ed.). Cambridge University Press. [Google Scholar] [CrossRef]
- Isaacs, T., Hu, R., Trenkic, D., & Varga, J. (2023). Examining the predictive validity of the Duolingo English test: Evidence from a major UK university. Language Testing, 40(3), 748–770. [Google Scholar] [CrossRef]
- Jin, T., & Lu, X. (2023). Eng-Editor: An online Chinese text evaluation and adaptation system. Available online: https://www.languagedata.net/tester/ (accessed on 25 June 2025).
- Jin, Y. (2019). Testing tertiary-level English language learners: The College English Test in China. In L. I.-W. Su, C. J. Weir, & J. R. W. Wu (Eds.), English language proficiency testing in Asia: A new paradigm bridging global and local contexts (pp. 101–130). Routledge. [Google Scholar] [CrossRef]
- Jin, Y. (2022). Consequential research of accountability testing: The case of the CET. Language Testing in Asia, 12(1), 15. [Google Scholar] [CrossRef]
- Jin, Y., & Cheng, L. (2013). The effects of psychological factors on the validity of high-stakes tests. Modern Foreign Languages, 36(1), 62–69. [Google Scholar]
- Jin, Y., Jie, W., & Wang, W. (2022). Exploring the alignment between the College English Test and language standards. Foreign Language World, 209(2), 18–26. [Google Scholar]
- Jin, Y., & Wu, E. (2017). An argument-based approach to test fairness: The case of multiple-form equating in the College English Test. International Journal of Computer-Assisted Language Learning and Teaching, 7(3), 58–72. [Google Scholar] [CrossRef]
- Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. [Google Scholar] [CrossRef]
- Karatay, Y., & Xu, J. (2025). Exploring the potential of conversational AI for assessing second language oral proficiency. TESOL Quarterly, 59(S1), 220–250. [Google Scholar] [CrossRef]
- Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2013). On the shortcomings of shortened tests: A literature review. International Journal of Testing, 13(3), 223–248. [Google Scholar] [CrossRef]
- Li, J., Huang, J., & Sheeran, T. (2025). ChatGPT4o as an AI peer assessor in EFL speaking classrooms: Examining scoring reliability and feedback effectiveness. SAGE Open, 15(3), 21582440251369938. [Google Scholar] [CrossRef]
- Lin, Z., & Chen, H. (2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, 103344. [Google Scholar] [CrossRef]
- Liu, X., Huang, J., Deng, Y., & Spiridakis, J. (2025). AI versus human assessment in EFL speaking classrooms: A comparative study in China. Computer Assisted Language Learning, 1–29. [Google Scholar] [CrossRef]
- Liu, Y. (2020). Effects of metacognitive strategy training on Chinese listening comprehension. Languages, 5(2), 21. [Google Scholar] [CrossRef]
- Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13–103). Macmillan Publishing. [Google Scholar]
- Ministry of Education of the People’s Republic of China, State Language Commission. (2024). China’s standard of English language ability. Shanghai Foreign Language Education Press.
- Mizumoto, A., Shintani, N., Sasaki, M., & Teng, M. F. (2024). Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment. Research Methods in Applied Linguistics, 3(2), 100116. [Google Scholar] [CrossRef]
- Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63(1), 59–82. [Google Scholar] [CrossRef]
- National College English Testing Committee. (2016). The national College English Test syllabus (2016 revised ed.). Available online: https://cet.neea.edu.cn/html1/folder/16113/1588-1.htm (accessed on 18 August 2025).
- Nguyen, P. H. T. (2022). Investigating the content validity of the IELTS listening test through the use of lexical bundles [Master’s dissertation, Nottingham Trent University]. [Google Scholar]
- Nishizawa, H. (2023). Construct validity and fairness of an operational listening test with world Englishes. Language Testing, 40(3), 493–520. [Google Scholar] [CrossRef]
- Ockey, G. J. (2024). Assessing listening. In E. Wagner, A. O. Batty, & E. Galaczi (Eds.), The Routledge handbook of second language acquisition and listening (pp. 230–240). Routledge. [Google Scholar] [CrossRef]
- Ockey, G. J., & Wagner, E. (2018). Assessing L2 listening: Moving towards authenticity. John Benjamins Publishing Company. [Google Scholar]
- Park, K. (2014). Corpora and language assessment: The state of the art. Language Assessment Quarterly, 11(1), 27–44. [Google Scholar] [CrossRef]
- Qiu, Y., & Aryadoust, V. (2024). The predictive value of gaze behavior and mouse-clicking in testing listening proficiency: A sensor technology study. System, 126, 103440. [Google Scholar] [CrossRef]
- Riazi, M. (2013). Concurrent and predictive validity of Pearson Test of English Academic (PTE Academic). Papers in Language Testing and Assessment, 2(2), 1–27. [Google Scholar] [CrossRef]
- Ronge, R., Maier, M., & Rathgeber, B. (2025). Towards a definition of generative artificial intelligence. Philosophy & Technology, 38(1), 31. [Google Scholar] [CrossRef]
- Runge, A., Attali, Y., LaFlair, G. T., Park, Y., & Church, J. (2024). A generative AI-driven interactive listening assessment task. Frontiers in Artificial Intelligence, 7, 1474019. [Google Scholar] [CrossRef]
- Saricaoglu, A., & Bilki, Z. (2025). The capacity of ChatGPT-4 for L2 writing assessment: A closer look at accuracy, specificity, and relevance. Annual Review of Applied Linguistics, 45, 253–273. [Google Scholar] [CrossRef]
- Sato, T., & Ikeda, N. (2015). Test-taker perception of what test items measure: A potential impact of face validity on student learning. Language Testing in Asia, 5(1), 10. [Google Scholar] [CrossRef]
- Schmidt, E., & Holzknecht, F. (2024). Investigating listening through technology. In E. Wagner, A. O. Batty, & E. Galaczi (Eds.), The Routledge handbook of second language acquisition and listening (pp. 357–367). Routledge. [Google Scholar] [CrossRef]
- Tao, X., & Aryadoust, V. (2024). A multidimensional analysis of a high-stakes English listening test: A corpus-based approach. Education Sciences, 14, 2. [Google Scholar] [CrossRef]
- Taylor, L. (2013). Introduction. In A. Geranpayeh, & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language listening (Vol. 35, pp. 1–35). Cambridge University Press. [Google Scholar]
- Vandergrift, L. (2007). Recent developments in second and foreign language listening comprehension research. Language Teaching, 40(3), 191–210. [Google Scholar] [CrossRef]
- Wagner, E. (2013). Assessing listening. In A. J. Kunnan (Ed.), The companion to language assessment (pp. 47–63). [Google Scholar] [CrossRef]
- Wagner, E. (2022). Assessing listening. In G. Fulcher, & L. Harding (Eds.), The Routledge handbook of language testing (2nd ed., pp. 223–235). Routledge. [Google Scholar]
- Wang, J., Zheng, Y., & Zou, Y. (2024). Face validity and washback effects of the shortened PTE Academic: Insights from teachers in Mainland China. Language Testing in Asia, 14(1), 32. [Google Scholar] [CrossRef]
- Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave Macmillan. [Google Scholar] [CrossRef]
- Xi, X. (2023). Advancing language assessment with AI and ML–leaning into AI is inevitable, but can theory keep up? Language Assessment Quarterly, 20(4–5), 357–376. [Google Scholar] [CrossRef]
- Xu, J., Zhao, C., & Sun, M. (2024). Applications of large language models in foreign language teaching and research. Foreign Language Teaching and Research Press. [Google Scholar]
- Yan, X., & Huang, B. H. (2025). Generative AI for the teaching, learning, and assessment of productive skills: An evidence-based approach to understanding its real impact. TESOL Quarterly, 59(S1), 5–18. [Google Scholar] [CrossRef]
- Zhang, T., Erlam, R., & de Magalhães, M. B. (2025). Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls. Annual Review of Applied Linguistics, 45, 274–293. [Google Scholar] [CrossRef]
- Zhao, Y., & Aryadoust, V. (2025). An automatized semantic analysis of two large-scale listening tests: A corpus-based study. Language Testing, 42(3), 312–343. [Google Scholar] [CrossRef]
- Zheng, Y., & Cheng, L. (2008). Test review: College English Test (CET) in China. Language Testing, 25(3), 408–417. [Google Scholar] [CrossRef]
- Zuhairoh, Z., Syafa’ah, N., & Kurniati, D. (2024). Content and face validity analysis on 9th grade final test items for secondary school level. Prominent Journal of English Studies, 7(1), 21–28. [Google Scholar] [CrossRef]
| Test | Test Number | Section | Text per Test | Question Number per Test | Duration |
|---|---|---|---|---|---|
| CET-4 | 35 | 1. News | 3 | 7 | 25 min |
| 2. Conversation | 2 | 8 | |||
| 3. Passage | 3 | 10 | |||
| Total | 280 | 875 | 875 |
| Aspect | Dimension | Category | Section | Coding |
|---|---|---|---|---|
| Input | Genre | narration, exposition, argumentation, description and practical writing | 1, 3 | ChatGPT 4o, human |
| Topic | life and emotions, society and current affairs, education and work, health and medicine, science and innovation, environment and ecology, economy and business, history and geography, culture and arts | 1, 2, 3 | ||
| Vocabulary | token (the number of total words in a text), type (the number of unique words in a text), frequency (how often a specific word or lexical item appears in a text) | 1, 2, 3 | Vocabprofilers 1 | |
| Difficulty | lexical, syntactical and textual | 1, 2, 3 | Eng-Editor 2 | |
| Conversational turns | number of turns in the dialogue | 2 | human | |
| Speech rate | words per minute | 1, 2, 3 | human | |
| Expected response | Assessment skills | understanding explicit information (comprehending the main idea, recognizing important or specific details and identifying the speaker’s explicitly expressed viewpoints, attitude, etc.) understanding implicit meaning (inferring implied meanings, identifying the communicative function of utterances, and inferring the speaker’s attitudes, viewpoints, etc.) using linguistic features to comprehend listening materials (distinguishing phonological features and understanding inter-sentential relationships) employing listening strategies 3 | 1, 2, 3 | Chat GPT 4o, human |
| Tests | CET-4 Listening Corpus | |||
|---|---|---|---|---|
| Section | News | Conversation | Passage | |
| Genre | Narration (52.4%) Exposition (45.7%) Argumentation (1.9%) | / | Exposition (66.7%) Argumentation (18.1%) Narration (15.2%) | |
| Topic | Society and current affairs (33.3%) Life and emotions (18%) | Life and emotions (40%) Education and work (25.7%) | Life and emotions (25.7%) Education and work (15.2%) | |
| Vocabulary | Token | 173 | 274 | 241 |
| Type | 107 | 146 | 136 | |
| Frequency | The first 4000 words covered 98.5% | The first 3000 words covered 99.2% | The first 4000 words covered 98.6% | |
| Difficulty 1 | Lexical | 5.15 | 4.15 | 4.90 |
| Syntactical | 4.02 | 3.59 | 4.14 | |
| Textual | 5.13 | 4.11 | 4.88 | |
| Conversational turns | / | 6 | / | |
| Speech rate | 135 | 155 | 134 | |
| Assessment skills | Understanding explicit information (98.8%) Understanding implicit meaning (1.2%) | Understanding explicit information (95.7%) Understanding implicit meaning (4.3%) | Understanding explicit information (98.8%) Understanding implicit meaning (1.2%) | |
| Test | The Authentic Test | The GenAI Test | |||||
|---|---|---|---|---|---|---|---|
| Section | News | Conversation | Passage | News | Conversation | Passage | |
| Genre | Narration | / | Exposition | Narration | / | Exposition | |
| Topic | Fire | Teach children how to save and spend their money | Personal Space | Storm | Teach children about healthy eating habits | Body Language | |
| Vocabulary | Token | 156 | 283 | 240 | 171 | 278 | 235 |
| Type | 92 | 165 | 134 | 130 | 158 | 147 | |
| Frequency | The first 5000 words covered 98.1% | The first 3000 words covered 98.2% | The first 4000 words covered 98.8% | The first 4000 words covered 98.2% | The first 3000 words covered 98.9% | The first 4000 words covered 98.7% | |
| Difficulty | Lexical | 6.66 | 4.93 | 4.99 | 5.35 | 4.20 | 5.02 |
| Syntactic | 3.87 | 3.74 | 3.63 | 4.15 | 3.74 | 4.43 | |
| Textual | 6.17 | 4.95 | 4.94 | 5.08 | 4.28 | 4.98 | |
| Conversational turns | / | 5.5 | / | / | 5.5 | / | |
| Speech rate | 129 | 161 | 119 | 135 | 154 | 132 | |
| Assessment skills | Understanding explicit information (important or specific details) | Understanding explicit information (important or specific details) | Understanding explicit information (important or specific details) | Understanding explicit information (important or specific details) | Understanding explicit information (important or specific details) | Understanding explicit information (important or specific details) | |
| Accent | American (female) | British (male)/American (female) | American (female) | American (female) | British (male)/American (female) | British (male) | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Guo, J. Exploring GenAI-Powered Listening Test Development. Languages 2026, 11, 17. https://doi.org/10.3390/languages11010017
Guo J. Exploring GenAI-Powered Listening Test Development. Languages. 2026; 11(1):17. https://doi.org/10.3390/languages11010017
Chicago/Turabian StyleGuo, Junyan. 2026. "Exploring GenAI-Powered Listening Test Development" Languages 11, no. 1: 17. https://doi.org/10.3390/languages11010017
APA StyleGuo, J. (2026). Exploring GenAI-Powered Listening Test Development. Languages, 11(1), 17. https://doi.org/10.3390/languages11010017

