Exploring GenAI-Powered Listening Test Development

Guo, Junyan

doi:10.3390/languages11010017

Open AccessArticle

Exploring GenAI-Powered Listening Test Development

by

Junyan Guo

Department of General Education, Wuxi University, Wuxi 214105, China

Languages 2026, 11(1), 17; https://doi.org/10.3390/languages11010017

Submission received: 31 August 2025 / Revised: 10 January 2026 / Accepted: 12 January 2026 / Published: 20 January 2026

Download Versions Notes

Abstract

The advent of Generative Artificial Intelligence (GenAI) has ushered in a transformative wave within the field of language education. However, the applications of GenAI are primarily in language teaching and learning, with assessment receiving much less attention. Drawing on task characteristics identified from a corpus of authentic prior tests, this study investigated the capacity of GenAI tools to develop a short College English Test-Band 4 (CET-4) listening test and examined the degree to which its content, concurrent, and face validity corresponded to those of an authentic, human-generated counterpart. The findings indicated that the GenAI-created test aligned well with the task characteristics of the target test domain, supporting its content validity, whereas sufficient robust evidence to substantiate its concurrent or face validity was limited. Overall, GenAI has demonstrated potential in developing listening tests; however, further optimization is needed to enhance their validity. Implications for language teaching, learning and assessment are therefore discussed.

Keywords:

Generative Artificial Intelligence (GenAI); College English Test-Band 4 (CET-4); listening test; test development; validity

1. Introduction

Listening lies at “the heart of language learning” (Vandergrift, 2007, p. 191). As a multi-layered construct, listening encompasses affective, behavioral and cognitive processes (L. He & Jiang, 2020). Listening comprehension is a dynamic and complex process during which listeners constantly integrate oral information from multiple sources, break the message into manageable units, and interpret them through existing knowledge, context, and expectations to verify, adjust, or deepen their understanding (Dominguez Lucio & Aryadoust, 2023; Ockey, 2024). Since much of this process is internal, transient, and not directly observable, it poses a significant challenge for listening test developers to create tasks whose responses allow test designers to infer a test-taker’s listening competence. As a result, listening assessment in a second language is viewed as more difficult and challenging than assessing other skills (Buck, 2001) and plays a peripheral role in research (Buck, 2018), leading some to call listening the forgotten or Cinderella skill (Aryadoust & Jia, 2025; Wagner, 2022). Significant debates and unresolved issues persist regarding the best and most effective evaluation of L2 listening (Wagner, 2022), including the authenticity of task materials (e.g., the use of real-world spoken language, non-verbal components, different types of speech varieties and accents), task types and response formats (e.g., interactive speaking and listening), and test washback and impact (Aryadoust & Jia, 2025; Ockey & Wagner, 2018; Wagner, 2013). Despite these challenges, listening tests remain a crucial tool for language learning and assessment, as evidenced by their significant role in various internationally recognized proficiency tests (e.g., the International English Language Testing System, IELTS; the Test of English as a Foreign Language, TOEFL; the Pearson Test of English, PTE).

Technological innovations have brought new hope for listening assessment. On the one hand, behavioral and neurophysiological technologies enable researchers to capture listeners’ multimodal responses, such as mouse clicking, gaze behavior, and brain activity, to elucidate the listening process (Dominguez Lucio & Aryadoust, 2023; Qiu & Aryadoust, 2024; Schmidt & Holzknecht, 2024). On the other hand, and more notably, with the advent and development of Artificial Intelligence (AI), the outstanding performance of Large Language Models (LLMs) in the landscape of Natural Language Processing (NLP) has attracted widespread attention and research. Featuring a vast parameter scale and excellent language understanding and generation capabilities, these models have demonstrated tremendous potential and application value in the field of education, particularly in language assessment (Gardner et al., 2021; Hao et al., 2024). As a subfield of AI, Generative AI (or GenAI) constitutes a group of AI algorithms and models that can create original content, such as texts, images, videos and problem-solving strategies with creativity and adaptability comparable to those of humans (R. He et al., 2025). It includes characteristics such as (multi-)modality, interaction, flexibility, and productivity (Ronge et al., 2025). The affordances and conveniences brought by GenAI are evident. In addition to alleviating the burden of the time- and labor-intensive test development process, GenAI tools (such as ChatGPT) show potential for accommodating individuals’ learning preferences and diverse needs (Chuang & Yan, 2025). In particular, GenAI tools hold promise for developing target-specific language assessments. These tools can be trained to not only generate test content and develop item banks for large-scale standardized tests and classroom-based assessments, but also align assessment materials with proficiency standards and curriculum frameworks (Yan & Huang, 2025).

GenAI has been increasingly explored in the assessment of reading, speaking and writing. For example, studies have been conducted to investigate the capabilities of GenAI to generate reading text and assessment items (e.g., Lin & Chen, 2024; Zhang et al., 2025), score and rate essays (e.g., Mizumoto et al., 2024; Saricaoglu & Bilki, 2025), mediate (Karatay & Xu, 2025) and assess speaking (e.g., Li et al., 2025; X. Liu et al., 2025). However, their application in the listening skill is still limited and in the preliminary exploration stage of development (Goh & Aryadoust, 2025). Initial research results have indicated GenAI’s capability to adapt reading materials, generate smooth and natural texts, design various exercises of different types, and convert text to speech (Xu et al., 2024). All of these promising qualities provide strong support for the development of AI-powered listening tests. In light of these circumstances, this study aimed to utilize GenAI tools to develop a College English Test—Band 4 (CET-4) listening test. The National College English Test is a large-scale standardized exam administered twice a year by the Ministry of Education in China. Specifically, this study investigated the capability of GenAI tools to create a short CET-4 listening test grounded in task features summarized from a corpus of previous authentic tests, and examined the extent to which it aligned with an authentic human-generated CET-4 listening test with respect to content, concurrent and face validity. The implications may offer new perspectives on the digital intelligent development of second language listening assessment.

2. Literature Review

2.1. CET

CET has a history spanning nearly forty years, since 1987. As a large-scale, nation-level, standardized test, CET is organized and administered biannually (normally in June and December, exceptions happened during the COVID years) by the National College English Testing Committee on behalf of the Higher Education Department, Ministry of Education of the People’s Republic of China (Zheng & Cheng, 2008). Under the Ministry of Education, the teaching and testing policies of College English are developed through parallel administrative and expert bodies. The two tracks (College English teaching curricular requirements and College English Test Syllabus) mutually inform each other, resulting in a reciprocal relationship between teaching/learning and testing/assessment (Y. Jin, 2022). Therefore, the aim of CET is to assess whether tertiary-level non-English-major students have met the learning objectives specified in the National College English Teaching Syllabus (hereafter referred to as the National Syllabus), thereby promoting the implementation of the National Syllabus and enhancing the quality of college English teaching and learning.

Since CET is a large-scale, nation-level test, it has a clear construct definition and operational framework, systematic test specifications and item templates, and rigorous administration and ongoing validation (Y. Jin, 2019; Y. Jin & Cheng, 2013; Y. Jin et al., 2022; Y. Jin & Wu, 2017). The Target Language Use (TLU) domain of the CET encompasses general academic and real-life English suitable for non-English major university students. CET consists of a written test (two levels: Band 4 [lower] and 6 [higher]) and a spoken test (College English Test-Spoken English Test, CET-SET, also two levels: Band 4 and 6). CET-4 is approximately at B1 on the Common European Framework of Reference (CEFR) and CET-6 at B2. The CEFR is a standard and comprehensive framework developed by the Council of Europe to assess language proficiency. It consists of six proficiency levels, ranging from A1 (beginner) to A2 (elementary), B1 (intermediate), B2 (upper intermediate), C1 (advanced), and C2 (proficient). The CET written test format has undergone several rounds of reforms. The current version (since 2016) is composed of four parts: writing (15%), listening comprehension (35%), reading comprehension (35%), and translation (15%). Over the past four decades, the number of CET test-takers has increased from the initial 100,000 to around 20 million (Fan et al., 2022).

2.2. Approaches in Listening Assessment

The purpose of a listening assessment is to measure test-takers’ ability to process and understand aural language characteristic of the TLU domain (Aryadoust et al., 2024; Wagner, 2022). To unpack the multidimensional nature of listening and accurately assess test-takers’ listening abilities, Buck and Tatsuoka (1998) explored the attribute-based approach to identify cognitive and linguistic attributes underlying listening test performance. More recently, Aryadoust and Luo (2023) systematically reviewed and summarized three approaches to L2 listening assessment: the process-based approach, the subskill-based approach and the attribute-based approach. The process-based approach focuses on the cognitive functions involved in listening comprehension, which include bottom-up processing, top-down processing, memory, and the use of cognitive and metacognitive strategies, as well as other dimensions such as self-concept, attention, and concentration. The subskill-based approach emphasizes the specific skills or abilities required in listening, including knowledge of the sound system, understanding local linguistic meanings (vocabulary and syntax), understanding global meanings or inferred meanings, communicative listening ability, and integrated listening skills. The attribute-based assessment tools encompass factors that influence the interpretation of test results, such as task- or test-related attributes and listener-related attributes. Existing listening tests primarily target discrete subskills associated with listening comprehension (Aryadoust & Jia, 2025). According to this classification, listening assessment skills, as specified in the CET Test Syllabus (2016 Revised Edition, hereafter referred to as the Test Syllabus), comprise three subskill-based dimensions (understanding explicit information, understanding implicit meaning, and using linguistic features to comprehend listening materials) and one process-based dimension (employing listening strategies).

2.3. Validity in Listening Assessment

Once a test has been developed, it must undergo a systematic validation process. The validity of a test is “an overall evaluative judgment of the degree to which evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores” (Messick, 1989, p. 3). A listening test is valid if it accurately measures the listening ability it is intended to assess. This general, overarching notion of validity is referred to as construct validity (Hughes & Hughes, 2020). Language testers have developed frameworks to theoretically conceptualize and systematically structure the multifaceted aspects of the validation process, such as Weir’s (2005) socio-cognitive approach, Kane’s (2013) argument-based approach, and Bachman and Palmer’s (2010) Assessment Use Argument (AUA). Among the repertoire of terms, concepts and frameworks, three subordinate dimensions of validity, namely content validity, criterion-related validity and face validity, are examined in this study.

Content validity considers the alignment between the test content with the construct it intends to measure. In other words, the test includes a representative sample of items that reflect the language skills being measured. The evidence of content validity in the listening test can be observed in various aspects, including the characteristics of task input, such as texts (in terms of type, form, lexis, and length), and recordings (in terms of speech rate, dialect, accent, and length), and the expected response such as operational assessment subskills (both global and detailed) (Bachman & Palmer, 1996; Field, 2013; Hughes & Hughes, 2020). Previous research has examined some of these individual components. For example, subskills (e.g., identifying explicit information, deriving implicit meaning) were explored in Bourdeaud’Hui et al.’s (2021) study of the validity of a comprehensive listening test for primary school students. In another study, Nishizawa (2023) examined the validity of using a range of Englishes in the input of a post-entry English language placement test at a public university in the United States. While Aryadoust (2024) analyzed the topics and accents of the IELTS listening test to assess whether they reflected the various and diverse real-life contexts that test-takers may encounter, Nguyen (2022) investigated the frequency, structural, and functional patterns of the lexical bundles in the same test. Zhao and Aryadoust (2025) examined the semantic features of simulated mini-lectures in the listening sections of the IELTS and TOEFL. It was found that relative content validity was supported; however, the representativeness of the commercialized listening test materials needs to be enhanced.

Criterion-related validity examines the score value, comparing it with different forms of the same test, different tests and with external standards and frameworks (Taylor, 2013). It mainly encompasses concurrent validity (when the test and the criterion are administered at about the same time) and predictive validity (the degree to which a test can predict test-takers’ future performance). For instance, Riazi (2013) examined the concurrent and predictive validity of a newly developed Pearson Test of English Academic (PTE Academic) administered to 60 international university students whose first language is not English. The concurrent validity was measured against the test-takers’ IELTS Academic scores, and the predictive validity was assessed by their academic performance, as indicated by their grade point average (GPA). Similarly, Isaacs et al. (2023) investigated the concurrent validity of the Duolingo English Test (DET) in relation to IELTS and TOEFL, as well as its predictive value for students’ academic attainment at a UK university.

Face validity refers to “the clarity, relevance, difficulty, and sensitivity of a test to its intended audience”(Allen et al., 2023, p. 154). Few empirical studies have investigated face validity, and the predominant focus is on test-takers’ views. These studies have mainly assessed whether test-takers’ perceptions of the final English curriculum assessment (Zuhairoh et al., 2024), tertiary-level English as a Foreign Language (EFL) program tests (Cinkara & Özen Tosun, 2017) and high-stakes tests (Sato & Ikeda, 2015) aligned with the test designers’ intentions. The results of these studies underscore the importance of understanding test-takers’ perceptions and incorporating face validity in test development, highlighting its positive effect on promoting students’ learning (Wang et al., 2024).

2.4. Corpus-Based Approach in Listening Assessment

A corpus is a systematically compiled electronic database of spoken and/or written language data (Park, 2014). In language assessment, the corpus-based approach involves constructing a corpus comprising authentic test materials collected over multiple years and analyzing patterns of language use to inform language assessment. Tao and Aryadoust (2024) analyzed the linguistic features and corresponding function dimensions of transcripts from the listening section of China’s national college entrance exam (Gaokao), covering 31 provinces from 2000 to 2022. In a separate study, Aryadoust (2024) examined and identified the topic and accent of the 256 listening sections of the IELTS from 1996 to 2021. Another line of research involves test developers employing representative corpora (e.g., the British National Corpus, BNC; the Corpus of Contemporary American English, COCA), learner corpora (e.g., the International Corpus of Learner English), and specialized corpora (e.g., the Michigan Corpus of Academic Spoken English, MICASE) as reference sources. In particular, specialized corpora have been used to design, develop and validate a test relevant to the target discourse. A representative example is MICASE, which served as a reference corpus in Nguyen’s (2022) study for comparing lexical bundles in a corpus of Sections 3 and 4 of the IELTS listening tests, and in Zhao and Aryadoust’s (2025) study for examining the semantic features of mini-lecture corpus in the listening section of IELTS (series 1–17) and TOEFL Practice Online (TPO, 1–74).

2.5. GenAI and Listening Test

As mentioned in the Introduction section, although GenAI is gaining attention and exhibiting a gradual upward trend in the foreign language testing arena, its application in listening testing is still in its nascent development stage. Taking ChatGPT as an example, Xu et al. (2024) explored the diverse applications of GenAI tools in the development of listening tests. These researchers suggested that GenAI tools can be used to adapt reading materials into listening texts, create listening audio files, produce listening vocabulary lists, and design various listening question types (e.g., single- and multiple-choice, judgment, and ordering) targeting different skills (e.g., comprehending information, understanding main ideas, and inferencing). Based on IELTS Listening Section 4, Aryadoust et al. (2024) explored the capabilities of ChatGPT 4 in developing listening test scripts and test items across a range of proficiency levels (academic, low, intermediate, and advanced). These researchers examined the lexicogrammatical features of GenAI-produced texts, as well as topic variation and the degree of overlap in the test items. Their findings indicated that ChatGPT 4 was capable of producing linguistically diverse texts for different proficiency levels and varied topics in test items. However, the generated test items showed no variation in difficulty level and a frequent overlap of information in the options. Going a step further, Runge et al. (2024) employed ChatGPT 3 to generate content for an interactive listening task for the Duolingo English Test (DET). This scenario-based conversation task involved different interlocutors (student-student, student-professor). In each round of the conversation, test-takers needed to select the best from a number of options that best continued the conversation. At the end of the task, test-takers produced a summary of the conversation. The large-scale pilot, which involved 713 ChatGPT-generated plus human-reviewed conversations with 464 sessions per conversation, provided empirical evidence for the potential of GenAI and the necessity of human oversight in automatic assessment creation. While these studies have conducted an initial exploration of GenAI-assisted listening test development, certain aspects merit further investigation. For example, Xu et al.’s (2024) and Aryadoust et al.’s (2024) studies remained at the design stage without involving test-takers. Runge et al.’s (2024) study, albeit involving a substantial number of participants, did not collect any participant feedback. The validity of the GenAI-developed test, as well as the attitude and performance of GenAI-developed test-takers, requires further investigation.

The literature review provides the rationale for the present study by foregrounding a corpus-based approach to CET-4 listening test design that draws on the characteristics of authentic test tasks, while also highlighting the emerging potential of using GenAI to support test development. Specifically, this study utilized ChatGPT and MurfAI to develop a CET-4 listening test based on a corpus approach derived from the task characteristics of authentic tests, and then assessed its alignment with a human-generated authentic CET-4 test in terms of content, concurrent, and face validity. It is guided by the following research question: to what extent is the listening test developed by GenAI similar to those created by humans?

3. Materials and Methods

3.1. Corpus Construction and Analysis

According to the Test Syllabus, the CET-4 listening test consists of three sections with 25 Multiple-Choice Questions (MCQs). Section One comprises three news reports, each containing two to three questions, totaling seven questions, and spanning a total length of 450 to 500 words. Section Two focuses on conversations, containing two dialogues of 240 to 280 words each, accompanied by four questions, summing to eight questions in total. Section Three presents three passages, each approximately 220–240 words in length and three to four questions per text, totaling ten questions. The audios are recorded in standard British or American English, with a speech rate of 120–140 words per minute (wpm). All audio recordings are played once, with 15 s allotted for answering after each question. The listening test accounts for 35% of the CET-4 total score, with 7% allocated to news, 8% to conversations, and 20% to passages. The duration is 25 min.

As mentioned in Section 2.4, this study employed a corpus-based approach to inform the development of the CET-4 listening test. Therefore, it compiled a reference corpus of authentic tests from prior years, which served as an evidence-based foundation for modeling the task characteristics of authentic tests. The twice-yearly CET exam provided two parallel versions of the listening section each time, with some exceptions during the pandemic. Altogether thirty-five official CET-4 listening tests from June 2016 to June 2024, including listening scripts, test items and audios, were collected to construct an authentic CET-4 listening test corpus (hereafter referred to as the Corpus, see Table 1).

The task characteristics of the three sections in the corpus were analyzed and coded based on Bachman and Palmer’s (1996) and Gu and Li’s (2012) frameworks, as well as the Test Syllabus (Table 2). In the ChatGPT–human collaboration coding, the initial 30% was performed by ChatGPT and reviewed by the author and another independent expert researcher, with discrepancies resolved; the remaining 70% was coded by ChatGPT and reviewed by the author. In the human-only coding, the initial 30% was performed by the author and another independent expert researcher, with discrepancies resolved, and the remaining 70% was coded and reviewed by the author.

3.2. Test Development

Before using GenAI tools to develop the listening test, the author first selected materials from the Corpus of authentic tests as a reference and comparison. A news item, a conversation, and a passage, along with their corresponding questions, were randomly selected from the Corpus described in Section 3.1 to constitute a short authentic CET listening test (hereafter referred to as the Authentic test). This test comprised 10 questions in total: 2 from the news, 4 from the conversation, and 4 from the passage. According to the weightings of each section in the CET-4 listening, each news and conversation question carried 1 point, while each passage question was worth 2 points.

Based on the task characteristics of the authentic tests in the Corpus in Table 2, the author employed ChatGPT 4o to create a test. The ChatGPT-produced test (hereafter referred to as the GenAI test) also contained one news item, one conversation, and one passage, along with 10 questions. This process adopted Progressive Hint Prompting (PHP), during which the author interacted with ChatGPT, utilizing previous replies as hints and progressively refining subsequent prompts to achieve the desired output (Aryadoust et al., 2024). Taking the creation of the news item as an example, the topic of the news item from the Authentic test was fire. The initial prompt for ChatGPT 4o to generate a listening script was “Please write one piece of news on the topic of disaster, about 150 to 165 words, suitable for B1 EFL learners.” Occasionally, it strayed from the guidelines, such as delivering the scripts outside the word range. When this happened, prompts like “Please regenerate the text to stay within the word limit, since the current one exceeds the limit” would be delivered. For each text, ChatGPT 4o was prompted to produce multiple-choice questions. In the prompt, the questions were generated to measure the assessment skills specified in the CET Test Syllabus (summarized in Table 2). For example, in the Authentic test, two questions assessed explicit information (recognizing important or specific details). The two questions of the GenAI test also focused on this category (e.g., “Please write two questions focusing on two details in this news item”). Vocabulary of the scripts was checked against the core vocabulary list for CET-4 in the Test Syllabus using Eng-Editor. If there were low-frequency words that exceeded the standard level, they would be replaced with alternative high-frequency words. After the script generation was completed, MurfAI (an AI-powered text-to-speech platform) was used to generate recordings. Speech rate was controlled, and accents were selected according to the results of the Corpus (Table 2). The Authentic Test and the GenAI Test were presented in the Supplementary Materials.

3.3. Test Validity Measurement

3.3.1. Content Validity

The content validity of the GenAI test was measured by comparing its task characteristics with those of the Authentic test and the authentic test Corpus using the framework in Table 2.

3.3.2. Concurrent Validity

The two tests were administered to a sample of 295 first-year first-language (L1) Chinese university students (male: 193, female: 102) in China. Both tests were administered during normal College English course class hours with a one-week gap. The majors of these student participants included computer science and technology, optical information science and engineering, marketing, information and computational science, mechatronic engineering, software engineering, and information security. Their proficiency levels ranged from A2 to B1, as indicated by their first-term mid-term examination scores. The sample had a mean age of 18.49 years. None of them had taken CET-4 before, because first-year students are only allowed to participate in the CET-4 exam held in June at the end of their first year. The correlation and differences between the two scores were computed to evaluate the concurrent validity.

3.3.3. Face Validity

Face validity was assessed based on participants’ perceptions of both tests as measured by a questionnaire (see the Supplementary Materials). Students were not informed which test was generated by GenAI and which was manually created. They were asked to complete the questionnaire immediately after each test, resulting in two administrations of the same instrument. Because the tests were taken one week apart, this scheduling aimed to mitigate memory efforts from repeated exposure to the questionnaire. The questionnaire consisted of two parts, 11 items. Part One provided background information, including gender, age, major, and self-rated listening proficiency level. Part Two asked about students’ perceptions of the clarity, intonation, and speed of the recording, as well as the content, topic, item difficulty, and skill level assessed in the listening test. SPSS 27 and Excel were employed to analyze students’ performance on both tests and the results of the questionnaire.

4. Results

4.1. Task Characteristics of CET-4 Listening Test Corpus

4.1.1. The Input

Table 3 summarizes the task characteristics of authentic CET-4 listening tests from past years in the Corpus. In terms of genre, narration occupied over half of the news reports (52.4%), followed by exposition (45.7%) and argumentation (1.9%). In the listening passages, the highest proportion went to exposition (66.7%), followed by argumentation (18.1%) and narration (15.2%). The three major topics in the entire listening texts were life and emotions (26.5%), society and current affairs (18.9%), and education and work (13.6%). With respect to specific sections, the most prevalent theme in the News was society and current affairs (33.3%), which was expected due to the time-sensitive nature of news reports. The dominant topic in the Conversation and Passage sections was life and emotions, accounting for 40% and 25.7%, respectively. At the vocabulary level, the average token and type for the three sections were 173 and 107 for News, 274 and 146 for Conversation, and 241 and 136 for Passage. The results of Vocabprofilers indicated that the first 4000 words could cover 98.5% of the News, 99.6% of the Conversation and 98.6% of the Passage. In addition, the vocabulary list in the Test Syllabus could cover 99% of the listening test. As was suggested by Nation (2006), 98% text coverage is adequate for comprehension.

The difficulty level of the listening texts was measured through Eng-Editor. It was found that the average lexical (5.15) and textual difficulty (5.13) of the news reports reached the difficulty level of CET-4 reading, and the syntactic level (4.02) was equal to the National College Entrance Examination (Gaokao). As for the conversations, the average lexical (4.15) and textual (4.11) difficulty matched that of Gaokao, while the syntactic difficulty (3.59) corresponded to the High School Entrance Exam (Zhongkao). The average difficulty level of the passages was at the Gaokao level. News listening was the most challenging aspect, probably because it often used topic-specific vocabulary and complex sentence structures, and was designed to convey current events accurately and efficiently.

On average, the conversations had 6 turns, with a minimum of 2 and a maximum of 10. Twenty percent of the conversations had fewer than 5 turns, and 8.6% had over 8 turns. The average speech rate was 140 words per minute (wpm), at a normal but slightly fast pace according to the Test Syllabus (120–140 wpm). The average speech rate for each section was 135 wpm for the News section, 155 wpm for the Conversation section and 134 wpm for the Passage section. The relatively fast pace of the dialogue showed that speakers talked continuously and at a natural tempo.

4.1.2. The Expected Response

The skills assessed primarily focused on understanding explicit information rather than inferencing from implicit information. Specifically, for each section, 98.8%, 95.7%, and 98.8% of questions were relevant to understanding explicit information, while 1.2%, 4.3%, and 1.2% were pertinent to understanding implicit information. According to the Test Syllabus, understanding explicit information involves comprehending the main idea, recognizing important or specific details, and identifying the speaker’s explicitly expressed viewpoints and attitudes. In the category of understanding explicit information, the focus was primarily on recognizing and comprehending important or specific details (93.9%, 93.9% and 97.7% for each section). For instance, Question 22 in the second version of June 2017 asked about the characteristics of mules (“What does the speaker say about mules?” The key is “They have strong muscles.”). Understanding the main idea and identifying the speaker’s explicit viewpoints and attitudes were less frequently asked. In terms of understanding implicit information, test-takers needed to infer from implied meanings, identify the communicative function of utterances and infer the speaker’s attitudes and viewpoints. For instance, in Question 22 of the second version from December 2021 (“What do we learn about the speaker?”), listeners needed to deduce the speaker’s occupation from the passage.

4.2. Comparison of Task Characteristics of the Authentic Test and the GenAI Test

Table 4 shows that the GenAI test was consistent with the Corpus tests in terms of the task characteristics summarized in Table 3. Meanwhile, it also aligned with the task characteristics of the selected Authentic test. The genres in the News and Passage sections were narrative and exposition, the most representative genres in these two sections. With parallel topics across the three sections, the same number of turns in the conversation, the same skills assessed, and similar accents, the GenAI-developed test could be designed to closely mirror the Authentic test.

4.3. Students’ Performance on the Two Tests

The statistical analysis of students’ scores on the Authentic test and the GenAI test revealed that the average score on the GenAI test (M = 7.73, SD = 0.174) was higher than that on the Authentic test (M = 4.58, SD = 0.159). Kolmogorov–Smirnov and Shapiro–Wilk tests were used to check the normality of the data distribution. The results indicated that the data were not normally distributed for either the Authentic test, D (295) = 0.118, p < 0.001, W = 0.968, p < 0.001, or the GenAI test, D (295) = 0.092, p < 0.001, W = 0.976, p < 0.001. Because the assumption of normality was violated, Spearman’s rho was employed to examine the correlation. The results revealed a significant yet weak correlation between the two sets of scores (ρ = 0.208, p < 0.001). However, the Wilcoxon signed-rank test indicated a significant difference between the Authentic test and the GenAI-developed test (Z = −11.581, p < 0.001, N = 295).

4.4. Results of the Questionnaire

It is generally perceived by students that listening is a difficult skill. The majority of students (87.88%) believed they could follow common expressions, short sentences, and daily conversations, but found it hard to understand long, fast, and complex texts with professional contexts. Repeated listening was required in this case.

Three questions in the questionnaire concerned the listening recording, namely clarity, naturalness, and speech rate. Regarding clarity, the percentages of respondents who believed both recordings were “very clear” were very similar (Authentic, 10.18%, and GenAI, 9.43%). In terms of the ratio of those considering the recording “clear”, the proportion for the Authentic test (54.04%) was higher than that for the GenAI test (35.35%). Regarding the percentage of students who considered the recording “neutral” and “unclear,” the values for the Authentic test were lower than those for the GenAI test (neutral 28.77% and 41.41%, unclear 7.02% and 13.8%). Similarly, the proportions of students who thought the recordings were “very natural” were nearly equal (Authentic, 17.89%, and GenAI, 18.18%). More respondents perceived the real test as “natural” compared to the GenAI test (54.04% vs. 47.81%), whereas fewer considered it “moderate” and “unnatural” (24.56% vs. 26.94% and 3.51% vs. 7.07%, respectively). As far as speech rate was concerned, few respondents considered the recordings slow (Authentic, 1.4%, and GenAI, 0.67%). The percentages of students choosing “moderate”, “relatively fast”, and “fast” for the Authentic test were 51.58%, 37.89%, and 9.12%, respectively. For the GenAI test, the percentages were 27.61%, 59.93%, and 11.78%, respectively. Since the GenAI test was designed according to the statistical results of the Corpus (Table 3), the speech rates of News and Passage were faster than those in the Authentic test (Table 4). The students’ feedback was consistent with the results of the statistical analysis presented in Table 4. Based on students’ responses, it is evident that the recording of the Authentic test was superior to the GenAI test in both clarity and naturalness. This finding might be explained by the fact that the recording of the Authentic test was produced by human speakers.

The questionnaire also encompassed four aspects of the listening text. In terms of content, the ratios of students who thought the listening materials were highly relevant to real-life or learning contexts were roughly similar (Authentic, 6.32%, and GenAI, 5.72%). The proportion of respondents who considered the materials “relevant” was higher for the real test than for the GenAI test (57.89% vs. 49.16%), whereas the proportions of those who considered the materials “moderately relevant” or “not relevant” were higher for the GenAI test than for the Authentic test (36.03% vs. 32.63% and 9.09% vs. 3.16%, respectively). Concerning the topic, participants generally thought that the GenAI test was rich in topics and themes, with the percentages of “very rich” and “rich” higher than those of the Authentic test (13.47% vs. 12.63%, 64.98% vs. 62.11%).

It is worth mentioning that, despite students scoring higher in the GenAI test than in the Authentic test, on the whole, students were inclined to think that the GenAI test was more difficult. This was reflected in the percentages of students rating the GenAI test as “too difficult” and “difficult”, which exceeded those for the real test (22.29% vs. 19.3%, 58.92% vs. 55.79%, respectively). One possible reason for this result is that the fast speech rate of the GenAI test likely contributed to its perceived difficulty, making the listeners feel they had little time to grasp the detailed information. According to the questionnaire, the majority of students thought that both tests could effectively assess their listening ability, with the proportion of “very effective” and “effective” reaching 11.58% and 61.4% for the Authentic test and 11.45% and 63.3% for the GenAI test. In general, students agreed that, to some extent, both tests could reflect their actual listening ability.

5. Discussion

The findings of this study revealed that GenAI tools demonstrated potential across several aspects of listening test development; nevertheless, cautious interpretation and continued robust empirical investigations are needed to validate their effectiveness. Firstly, in terms of content validity, GenAI showed promise in enhancing the efficiency of listening test development, particularly in its capacity to generate listening materials and test items under corpus-informed task characteristics, thereby reducing the time and resources required in conventional test construction. GenAI-developed listening texts aligned well with authentic test scripts in terms of the task input, such as genre, topic, vocabulary and expected responses, such as assessment skills. Additionally, the audio recordings produced by MurfAI maintained consistency with authentic test audios in terms of speech rate and accents. This consistency could enhance the authenticity of the test.

Secondly, regarding concurrent validity, while the scores of student participants on the Authentic test and the GenAI-produced test exhibited a significant yet weak correlation, their scores on the GenAI test were higher than the Authentic test, and the Wilcoxon signed-rank test revealed a statistically significant difference between the two scores. This finding suggested that although the GenAI-generated test was designed to align with authentic task characteristics, the two tests did not elicit fully comparable performance. The use of shortened tests, single-form benchmarking, and differences in item options and students’ proficiency level may have contributed to the observed discrepancy. The use of shortened tests may have amplified performance differences, as Kruyen et al. (2013) noted that a shorter test version may have a substantial impact on the reliability and validity of the test scores. A second possible reason for the observed score differences lies in single-from concurrent validation. While the GenAI-developed test was constructed based on corpus evidence derived from multiple authentic tests, its performance was compared against only a single randomly authentic test form. The score discrepancy may reflect form-specific variability. A third element that may play a role is the options. Aryadoust et al. (2024) argued that the semantic overlap in the GenAI-generated options could indicate self-bias, raising the issue of predictability in answering test items. In this regard, learners could answer items correctly without fully understanding the question, which may affect test validity. While GenAI can emulate authentic tests in content and language features, it is subject to a certain degree of randomness or bias in item design, distractor construction, and difficulty level stratification. The authentic test items, however, were meticulously and rigorously crafted by the professional test development team. Therefore, they conformed better to the testing standards and requirements. An additional reason is the similar language backgrounds and homogeneous language proficiency level of test-takers in this study (pre-intermediate to intermediate), which may limit the discriminative power of the test data for learners with different language proficiency levels. However, this score difference does not necessarily indicate a lack of validity in the GenAI-developed test, but rather underscores the need for more empirical studies to include multiple authentic test forms, quality checks of items and options, and participants of varied proficiencies, enabling more robust calibration and validation of the full-length form of GenAI-produced listening tests.

Thirdly, concerning face validity, participants generally held the view that both the Authentic test and the GenAI-created test were capable of effectively evaluating their listening competence. With regard to specific feelings, they pointed out that the recordings of the Authentic test were clearer, more natural and materials closer to daily life. In contrast, the GenAI test demonstrated advantages in terms of topic variety. A possible reason is that the audio recordings of the authentic tests were made by native speakers with professional recording teams. With clear articulation and stable sound quality, it can better simulate authentic language use and real-world settings. Although GenAI-generated recordings can approach natural human speech in terms of speech rate and accents, discrepancies may still exist in clarity and naturalness, which could result in students experiencing acoustically unnatural feelings.

Taken together, this study has highlighted the strength of integrating GenAI into listening assessment in accelerating the efficiency of test development and producing varied and flexible exercises. However, its limitations constrain the generalizability of the findings and further empirical investigations are required. First, only one GenAI-generated test and one authentic test were compared and evaluated. Further empirical comparisons and investigations are needed to determine the extent to which the GenAI-created tests align with established measures and benchmarks. These include using full-length tests, inviting more diverse participant populations, and applying more robust statistical methods, particularly item-level analyses, which would enable finer-grained examination of item difficulty and discrimination. Second, the active participation of human experts, also known as “human-in-the-loop” or human-AI collaboration, remains indispensable for enhancing test validity and effectiveness (Runge et al., 2024; Zhang et al., 2025). The incorporation of reviewers helps ensure high-quality standards of the test (e.g., review of item quality, fairness, bias and audio quality). For example, by modifying the stylistic features of listening materials, optimizing the design of distractors, and adjusting the distribution of item difficulty and discrimination, GenAI-assisted test developers can enhance the degree of alignment between GenAI-developed tests and authentic tests. Third, the exclusive use of a questionnaire to elicit participants’ feedback may offer limited insights into their test-taking processes. Future research could benefit from incorporating complementary qualitative methods. Overall, more diverse groups of participants (e.g., students, teachers, testing experts), as well as quantitative (e.g., psychometric modeling) and qualitative (e.g., interviews, stimulated recalls) methods, are needed to offer a nuanced perspective on how GenAI-generated listening tests resemble/differ from human-crafted ones. More importantly, besides validity, ethical considerations such as data privacy, bias, hallucination and output copyright need to be carefully handled and addressed in future research.

6. Conclusions

The present study examined the ability of GenAI tools to construct a short CET-4 listening test grounded in task characteristics drawn from a corpus of earlier authentic tests, and evaluated how closely its validity indicators (content, concurrent, and face) matched those of an authentic, human-made test. The results suggested that the GenAI-created test demonstrated strong content validity in terms of its close correspondence with the task characteristics of the target test domain, but there was insufficient robust evidence to support its concurrent and face validity. While a correlation was found between students’ scores on the two tests, the difference between the scores was statistically significant. The post-test participants’ questionnaire further indicated that although GenAI tools exhibited considerable potential in listening test development, they remained somewhat behind expert performance in designing natural tasks and delivering clear recordings.

Despite these weaknesses, the application of GenAI in listening test development still carries important implications for language teaching, learning and low-stakes classroom evaluation. First, GenAI has the potential to be trained to support language teachers in creating source materials that are closely aligned with curriculum goals and proficiency standards (Yan & Huang, 2025). Given that low-stakes assessments tend to reduce learner anxiety, leveraging GenAI in such a context can help foster a supportive environment in which assessments are perceived as instruments for learning rather than judgments. In addition, as listening instruction has received limited attention in curricula and practice (Y. Liu, 2020), listening tasks with adjustable topics, speeds and accents can support and enhance adaptive and differentiated listening training. However, teachers need to critically evaluate the GenAI-generated content before carefully and ethically integrating it into classroom practice.

As has been found in this preliminary and exploratory step, GenAI presents promising possibilities in developing listening tests. With the continuous advancement of AI technology, it is expected to improve efficiency and flexibility in writing the listening scripts, constructing testing items and creating audios that closely approximate genuine communicative contexts. Nevertheless, as researchers continue to explore its frontiers in language testing and enjoy the conveniences and ease brought by AI in developing, administering, scoring and analyzing tests (Xi, 2023), they should always be aware of its weaknesses and threats as indicated in the Strengths, Weaknesses, Opportunities, and Threats [SWOT] analysis framework (Farrokhnia et al., 2024; Zhang et al., 2025). Research and refinement should continue, with the ethical and responsible use of AI always being in humans’ hands.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/languages11010017/s1, Text S1: The Authentic Test; Text S2: The GenAI Test; Text S3: Questionnaire.

Funding

This research was funded by the 12th China Foreign Language Education Foundation, grant number ZGWYJYJJ12A114.

Institutional Review Board Statement

There is no institutional review board at the author’s university. The study adhered to ethical procedures by inviting two external professors to review and approve the data processing prior to implementation.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available upon request from the author.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GenAI	Generative Artificial Intelligence
CET	College English Test
CET-4	College English Test-Band 4
IELTS	International English Language Testing System
TOEFL	Test of English as a Foreign Language
PTE	Pearson Test of English
AI	Artificial Intelligence
EFL	English as a Foreign Language
LLMs	Large Language Models
NLP	Natural Language Processing
TLU	Target Language Usage
CET-SET	College English Test-Spoken English Test
CEFR	Common European Framework of Reference
AUA	Assessment Use Argument
GPA	Grade Point Average
DET	Duolingo English Test
BNC	British National Corpus
COCA	Corpus of Contemporary American English
MICASE	Michigan Corpus of Academic Spoken English
TPO	TOEFL Practice Online
MCQ	Multiple-Choice Questions
WPM	Word Per Minute
PHP	Progressive Hint Prompting
L1	First Language
CSE	China’s Standards of English Language Ability
SWOT	Strengths, Weaknesses, Opportunities, and Threats

References

Allen, M. S., Robson, D. A., & Iliescu, D. (2023). Face validity: A critical but ignored component of scale construction in psychological assessment (Vol. 39). Hogrefe Publishing. [Google Scholar] [CrossRef]
Aryadoust, V. (2024). Topic and accent coverage in a commercialized L2 listening test: Implications for test-takers’ identity. Applied Linguistics, 45(5), 765–785. [Google Scholar] [CrossRef]
Aryadoust, V., & Jia, Y. (2025). Assessing listening skills in SLA. In Reference module in social sciences. Elsevier. [Google Scholar] [CrossRef]
Aryadoust, V., & Luo, L. (2023). The typology of second language listening constructs: A systematic review. Language Testing, 40(2), 375–409. [Google Scholar] [CrossRef]
Aryadoust, V., Zakaria, A., & Jia, Y. (2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, 100204. [Google Scholar] [CrossRef]
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press. [Google Scholar]
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford University Press. [Google Scholar]
Bourdeaud’Hui, H., Aesaert, K., & van Braak, J. (2021). Exploring the validity of a comprehensive listening test to identify differences in primary school students’ listening skills. Language Assessment Quarterly, 18(3), 228–252. [Google Scholar] [CrossRef]
Buck, G. (2001). Assessing listening. Cambridge University Press. [Google Scholar] [CrossRef]
Buck, G. (2018). Preface. In G. J. Ockey, & E. Wagner (Eds.), Assessing L2 listening: Moving towards authenticity (pp. xi–xvi). John Benjamins. [Google Scholar]
Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15(2), 119–157. [Google Scholar] [CrossRef]
Chuang, P.-L., & Yan, X. (2025). Language assessment in the era of generative artificial intelligence: Opportunities, challenges, and future directions. System, 134, 103846. [Google Scholar] [CrossRef]
Cinkara, E., & Özen Tosun, Ö. (2017). Face validity study of a small-scale test in a tertiary-level intensive EFL program. Bartın University Journal of Faculty of Education, 6(2), 395–410. [Google Scholar] [CrossRef][Green Version]
Dominguez Lucio, E., & Aryadoust, V. (2023). Neurocognitive evidence for test equity in an academic listening assessment. Behaviormetrika, 50(1), 155–175. [Google Scholar] [CrossRef]
Fan, J., Frost, K., & Jin, Y. (2022). Local English testing in China’s tertiary education: Contexts, policies, and practices. Language Testing, 39(3), 453–473. [Google Scholar] [CrossRef]
Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460–474. [Google Scholar] [CrossRef]
Field, J. (2013). Cognitive validity. In A. Geranpayeh, & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language listening (Vol. 35, pp. 77–151). Cambridge University Press. [Google Scholar]
Gardner, J., O’Leary, M., & Yuan, L. (2021). Artificial intelligence in educational assessment: ‘Breakthrough? Or buncombe and ballyhoo?’. Journal of Computer Assisted Learning, 37(5), 1207–1216. [Google Scholar] [CrossRef]
Goh, C. C. M., & Aryadoust, V. (2025). Developing and assessing second language listening and speaking: Does AI make it better? Annual Review of Applied Linguistics, 45, 179–199. [Google Scholar] [CrossRef]
Gu, X., & Li, Y. (2012). Longitudinal analysis of the task characteristics of the input and the expected response of the CET listening test. Foreign Language Testing and Teaching, (3), 17–26. [Google Scholar] [CrossRef]
Hao, J., von Davier, A. A., Yaneva, V., Lottridge, S., von Davier, M., & Harris, D. J. (2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice, 43(2), 16–29. [Google Scholar] [CrossRef]
He, L., & Jiang, Z. (2020). Assessing second language listening over the past twenty years: A review within the socio-cognitive framework. Frontiers in Psychology, 11, 2123. [Google Scholar] [CrossRef]
He, R., Cao, J., & Tan, T. (2025). Generative artificial intelligence: A historical perspective. National Science Review, 12(5), nwaf050. [Google Scholar] [CrossRef] [PubMed]
Hughes, A., & Hughes, J. (2020). Testing for language teachers (3rd ed.). Cambridge University Press. [Google Scholar] [CrossRef]
Isaacs, T., Hu, R., Trenkic, D., & Varga, J. (2023). Examining the predictive validity of the Duolingo English test: Evidence from a major UK university. Language Testing, 40(3), 748–770. [Google Scholar] [CrossRef]
Jin, T., & Lu, X. (2023). Eng-Editor: An online Chinese text evaluation and adaptation system. Available online: https://www.languagedata.net/tester/ (accessed on 25 June 2025).
Jin, Y. (2019). Testing tertiary-level English language learners: The College English Test in China. In L. I.-W. Su, C. J. Weir, & J. R. W. Wu (Eds.), English language proficiency testing in Asia: A new paradigm bridging global and local contexts (pp. 101–130). Routledge. [Google Scholar] [CrossRef]
Jin, Y. (2022). Consequential research of accountability testing: The case of the CET. Language Testing in Asia, 12(1), 15. [Google Scholar] [CrossRef]
Jin, Y., & Cheng, L. (2013). The effects of psychological factors on the validity of high-stakes tests. Modern Foreign Languages, 36(1), 62–69. [Google Scholar]
Jin, Y., Jie, W., & Wang, W. (2022). Exploring the alignment between the College English Test and language standards. Foreign Language World, 209(2), 18–26. [Google Scholar]
Jin, Y., & Wu, E. (2017). An argument-based approach to test fairness: The case of multiple-form equating in the College English Test. International Journal of Computer-Assisted Language Learning and Teaching, 7(3), 58–72. [Google Scholar] [CrossRef]
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. [Google Scholar] [CrossRef]
Karatay, Y., & Xu, J. (2025). Exploring the potential of conversational AI for assessing second language oral proficiency. TESOL Quarterly, 59(S1), 220–250. [Google Scholar] [CrossRef]
Kruyen, P. M., Emons, W. H. M., & Sijtsma, K. (2013). On the shortcomings of shortened tests: A literature review. International Journal of Testing, 13(3), 223–248. [Google Scholar] [CrossRef]
Li, J., Huang, J., & Sheeran, T. (2025). ChatGPT4o as an AI peer assessor in EFL speaking classrooms: Examining scoring reliability and feedback effectiveness. SAGE Open, 15(3), 21582440251369938. [Google Scholar] [CrossRef]
Lin, Z., & Chen, H. (2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, 103344. [Google Scholar] [CrossRef]
Liu, X., Huang, J., Deng, Y., & Spiridakis, J. (2025). AI versus human assessment in EFL speaking classrooms: A comparative study in China. Computer Assisted Language Learning, 1–29. [Google Scholar] [CrossRef]
Liu, Y. (2020). Effects of metacognitive strategy training on Chinese listening comprehension. Languages, 5(2), 21. [Google Scholar] [CrossRef]
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13–103). Macmillan Publishing. [Google Scholar]
Ministry of Education of the People’s Republic of China, State Language Commission. (2024). China’s standard of English language ability. Shanghai Foreign Language Education Press.
Mizumoto, A., Shintani, N., Sasaki, M., & Teng, M. F. (2024). Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment. Research Methods in Applied Linguistics, 3(2), 100116. [Google Scholar] [CrossRef]
Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63(1), 59–82. [Google Scholar] [CrossRef]
National College English Testing Committee. (2016). The national College English Test syllabus (2016 revised ed.). Available online: https://cet.neea.edu.cn/html1/folder/16113/1588-1.htm (accessed on 18 August 2025).
Nguyen, P. H. T. (2022). Investigating the content validity of the IELTS listening test through the use of lexical bundles [Master’s dissertation, Nottingham Trent University]. [Google Scholar]
Nishizawa, H. (2023). Construct validity and fairness of an operational listening test with world Englishes. Language Testing, 40(3), 493–520. [Google Scholar] [CrossRef]
Ockey, G. J. (2024). Assessing listening. In E. Wagner, A. O. Batty, & E. Galaczi (Eds.), The Routledge handbook of second language acquisition and listening (pp. 230–240). Routledge. [Google Scholar] [CrossRef]
Ockey, G. J., & Wagner, E. (2018). Assessing L2 listening: Moving towards authenticity. John Benjamins Publishing Company. [Google Scholar]
Park, K. (2014). Corpora and language assessment: The state of the art. Language Assessment Quarterly, 11(1), 27–44. [Google Scholar] [CrossRef]
Qiu, Y., & Aryadoust, V. (2024). The predictive value of gaze behavior and mouse-clicking in testing listening proficiency: A sensor technology study. System, 126, 103440. [Google Scholar] [CrossRef]
Riazi, M. (2013). Concurrent and predictive validity of Pearson Test of English Academic (PTE Academic). Papers in Language Testing and Assessment, 2(2), 1–27. [Google Scholar] [CrossRef]
Ronge, R., Maier, M., & Rathgeber, B. (2025). Towards a definition of generative artificial intelligence. Philosophy & Technology, 38(1), 31. [Google Scholar] [CrossRef]
Runge, A., Attali, Y., LaFlair, G. T., Park, Y., & Church, J. (2024). A generative AI-driven interactive listening assessment task. Frontiers in Artificial Intelligence, 7, 1474019. [Google Scholar] [CrossRef]
Saricaoglu, A., & Bilki, Z. (2025). The capacity of ChatGPT-4 for L2 writing assessment: A closer look at accuracy, specificity, and relevance. Annual Review of Applied Linguistics, 45, 253–273. [Google Scholar] [CrossRef]
Sato, T., & Ikeda, N. (2015). Test-taker perception of what test items measure: A potential impact of face validity on student learning. Language Testing in Asia, 5(1), 10. [Google Scholar] [CrossRef]
Schmidt, E., & Holzknecht, F. (2024). Investigating listening through technology. In E. Wagner, A. O. Batty, & E. Galaczi (Eds.), The Routledge handbook of second language acquisition and listening (pp. 357–367). Routledge. [Google Scholar] [CrossRef]
Tao, X., & Aryadoust, V. (2024). A multidimensional analysis of a high-stakes English listening test: A corpus-based approach. Education Sciences, 14, 2. [Google Scholar] [CrossRef]
Taylor, L. (2013). Introduction. In A. Geranpayeh, & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language listening (Vol. 35, pp. 1–35). Cambridge University Press. [Google Scholar]
Vandergrift, L. (2007). Recent developments in second and foreign language listening comprehension research. Language Teaching, 40(3), 191–210. [Google Scholar] [CrossRef]
Wagner, E. (2013). Assessing listening. In A. J. Kunnan (Ed.), The companion to language assessment (pp. 47–63). [Google Scholar] [CrossRef]
Wagner, E. (2022). Assessing listening. In G. Fulcher, & L. Harding (Eds.), The Routledge handbook of language testing (2nd ed., pp. 223–235). Routledge. [Google Scholar]
Wang, J., Zheng, Y., & Zou, Y. (2024). Face validity and washback effects of the shortened PTE Academic: Insights from teachers in Mainland China. Language Testing in Asia, 14(1), 32. [Google Scholar] [CrossRef]
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave Macmillan. [Google Scholar] [CrossRef]
Xi, X. (2023). Advancing language assessment with AI and ML–leaning into AI is inevitable, but can theory keep up? Language Assessment Quarterly, 20(4–5), 357–376. [Google Scholar] [CrossRef]
Xu, J., Zhao, C., & Sun, M. (2024). Applications of large language models in foreign language teaching and research. Foreign Language Teaching and Research Press. [Google Scholar]
Yan, X., & Huang, B. H. (2025). Generative AI for the teaching, learning, and assessment of productive skills: An evidence-based approach to understanding its real impact. TESOL Quarterly, 59(S1), 5–18. [Google Scholar] [CrossRef]
Zhang, T., Erlam, R., & de Magalhães, M. B. (2025). Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls. Annual Review of Applied Linguistics, 45, 274–293. [Google Scholar] [CrossRef]
Zhao, Y., & Aryadoust, V. (2025). An automatized semantic analysis of two large-scale listening tests: A corpus-based study. Language Testing, 42(3), 312–343. [Google Scholar] [CrossRef]
Zheng, Y., & Cheng, L. (2008). Test review: College English Test (CET) in China. Language Testing, 25(3), 408–417. [Google Scholar] [CrossRef]
Zuhairoh, Z., Syafa’ah, N., & Kurniati, D. (2024). Content and face validity analysis on 9th grade final test items for secondary school level. Prominent Journal of English Studies, 7(1), 21–28. [Google Scholar] [CrossRef]

Table 1. CET-4 listening test corpus.

Test	Test Number	Section	Text per Test	Question Number per Test	Duration
CET-4	35	1. News	3	7	25 min
		2. Conversation	2	8
		3. Passage	3	10
Total			280	875	875

Table 2. Coding and analysis framework of the CET-4 listening test corpus.

Aspect	Dimension	Category	Section	Coding
Input	Genre	narration, exposition, argumentation, description and practical writing	1, 3	ChatGPT 4o, human
	Topic	life and emotions, society and current affairs, education and work, health and medicine, science and innovation, environment and ecology, economy and business, history and geography, culture and arts	1, 2, 3	ChatGPT 4o, human
	Vocabulary	token (the number of total words in a text), type (the number of unique words in a text), frequency (how often a specific word or lexical item appears in a text)	1, 2, 3	Vocabprofilers ¹
	Difficulty	lexical, syntactical and textual	1, 2, 3	Eng-Editor ²
	Conversational turns	number of turns in the dialogue	2	human
	Speech rate	words per minute	1, 2, 3	human
Expected response	Assessment skills	understanding explicit information (comprehending the main idea, recognizing important or specific details and identifying the speaker’s explicitly expressed viewpoints, attitude, etc.) understanding implicit meaning (inferring implied meanings, identifying the communicative function of utterances, and inferring the speaker’s attitudes, viewpoints, etc.) using linguistic features to comprehend listening materials (distinguishing phonological features and understanding inter-sentential relationships) employing listening strategies ³	1, 2, 3	Chat GPT 4o, human

¹ https://www.lextutor.ca/vp/comp/ (accessed on 23 March 2025); ² languagedata.net/tester (T. Jin & Lu, 2023); ³ Test Syllabus (National College English Testing Committee, 2016).

Table 3. Task characteristics of CET-4 listening corpus.

Tests		CET-4 Listening Corpus
Section		News	Conversation	Passage
Genre		Narration (52.4%) Exposition (45.7%) Argumentation (1.9%)	/	Exposition (66.7%) Argumentation (18.1%) Narration (15.2%)
Topic		Society and current affairs (33.3%) Life and emotions (18%)	Life and emotions (40%) Education and work (25.7%)	Life and emotions (25.7%) Education and work (15.2%)
Vocabulary	Token	173	274	241
	Type	107	146	136
	Frequency	The first 4000 words covered 98.5%	The first 3000 words covered 99.2%	The first 4000 words covered 98.6%
Difficulty ¹	Lexical	5.15	4.15	4.90
	Syntactical	4.02	3.59	4.14
	Textual	5.13	4.11	4.88
Conversational turns		/	6	/
Speech rate		135	155	134
Assessment skills		Understanding explicit information (98.8%) Understanding implicit meaning (1.2%)	Understanding explicit information (95.7%) Understanding implicit meaning (4.3%)	Understanding explicit information (98.8%) Understanding implicit meaning (1.2%)

¹ The difficulty level was aligned with China’s Standards of English Language Ability or CSE (Ministry of Education of the People’s Republic of China, State Language Commission, 2024) Level 3 to 7. The CSE (Level 1–9) is the first full-range English proficiency scale in China, designed to provide a sound basis for exam reforms and test development, as well as quality control. Level 3 (High School Entrance Exam, namely Zhongkao): 3.00–3.99; Level 4 (National College Entrance Examination, namely Gaokao): 4.00–4.99; Level 5 (CET-4): 5.00–5.99; Level 6 (CET-6): 6.00–6.99; Level 7 (National Postgraduate Entrance Examination): 7.00–7.99.

Table 4. Task characteristics of the Authentic test and the GenAI test.

Test		The Authentic Test			The GenAI Test
Section		News	Conversation	Passage	News	Conversation	Passage
Genre		Narration	/	Exposition	Narration	/	Exposition
Topic		Fire	Teach children how to save and spend their money	Personal Space	Storm	Teach children about healthy eating habits	Body Language
Vocabulary	Token	156	283	240	171	278	235
	Type	92	165	134	130	158	147
	Frequency	The first 5000 words covered 98.1%	The first 3000 words covered 98.2%	The first 4000 words covered 98.8%	The first 4000 words covered 98.2%	The first 3000 words covered 98.9%	The first 4000 words covered 98.7%
Difficulty	Lexical	6.66	4.93	4.99	5.35	4.20	5.02
	Syntactic	3.87	3.74	3.63	4.15	3.74	4.43
	Textual	6.17	4.95	4.94	5.08	4.28	4.98
Conversational turns		/	5.5	/	/	5.5	/
Speech rate		129	161	119	135	154	132
Assessment skills		Understanding explicit information (important or specific details)	Understanding explicit information (important or specific details)	Understanding explicit information (important or specific details)	Understanding explicit information (important or specific details)	Understanding explicit information (important or specific details)	Understanding explicit information (important or specific details)
Accent		American (female)	British (male)/American (female)	American (female)	American (female)	British (male)/American (female)	British (male)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, J. Exploring GenAI-Powered Listening Test Development. Languages 2026, 11, 17. https://doi.org/10.3390/languages11010017

AMA Style

Guo J. Exploring GenAI-Powered Listening Test Development. Languages. 2026; 11(1):17. https://doi.org/10.3390/languages11010017

Chicago/Turabian Style

Guo, Junyan. 2026. "Exploring GenAI-Powered Listening Test Development" Languages 11, no. 1: 17. https://doi.org/10.3390/languages11010017

APA Style

Guo, J. (2026). Exploring GenAI-Powered Listening Test Development. Languages, 11(1), 17. https://doi.org/10.3390/languages11010017

Article Menu

Exploring GenAI-Powered Listening Test Development

Abstract

1. Introduction

2. Literature Review

2.1. CET

2.2. Approaches in Listening Assessment

2.3. Validity in Listening Assessment

2.4. Corpus-Based Approach in Listening Assessment

2.5. GenAI and Listening Test

3. Materials and Methods

3.1. Corpus Construction and Analysis

3.2. Test Development

3.3. Test Validity Measurement

3.3.1. Content Validity

3.3.2. Concurrent Validity

3.3.3. Face Validity

4. Results

4.1. Task Characteristics of CET-4 Listening Test Corpus

4.1.1. The Input

4.1.2. The Expected Response

4.2. Comparison of Task Characteristics of the Authentic Test and the GenAI Test

4.3. Students’ Performance on the Two Tests

4.4. Results of the Questionnaire

5. Discussion

6. Conclusions

Supplementary Materials

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI