Next Article in Journal
Tensile Strain Effect on Thermoelectric Properties in Epitaxial CaMnO3 Thin Films
Previous Article in Journal
Dynamic Multi-Core Task Scheduling for Real-Time Hybrid Simulation Model in Power Grid: A Deep Reinforcement Learning-Based Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Intelligent English-Speaking Training System Using Generative AI and Speech Recognition

1
Department of Communications Engineering, Feng Chia University, Taichung City 407102, Taiwan
2
Department of Computer Science, National Chengchi University, Taipei City 116011, Taiwan
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(1), 189; https://doi.org/10.3390/app16010189
Submission received: 12 November 2025 / Revised: 20 December 2025 / Accepted: 21 December 2025 / Published: 24 December 2025
(This article belongs to the Section Applied Neuroscience and Neural Engineering)

Featured Application

This system can be applied to AI-assisted English-speaking training for non-native speakers, enabling users to improve oral proficiency through interactive virtual agents, real-time speech recognition, automated scoring, and personalized feedback in a low-anxiety learning environment.

Abstract

English is the first foreign language most Taiwanese have encountered, yet few have achieved proficient speaking skills. This paper presents a generative AI-based English speaking training system designed to enhance oral proficiency through interactive AI agents. The system employs ChatGPT version 5.2 to generate diverse and tailored conversational scenarios, enabling learners to practice in contextually relevant situations. Spoken responses are captured via speech recognition and analyzed by a large language model, which provides intelligent scoring and personalized feedback to guide improvement. Learners can automatically generate scenario-based scripts according to their learning needs. The D-ID AI system then produces a virtual character of the AI agent, whose lip movements are synchronized with the conversation, thereby creating realistic video interactions. Learning with an AI agent, the system maintains controlled emotional expression, reduces communication anxiety, and helps learners adapt to non-native interaction, fostering more natural and confident speech production. Accordingly, the proposed system supports compelling, immersive, and personalized language learning. The experimental results indicate that repeated practice with the proposed system substantially improves English speaking proficiency.

1. Introduction

In Taiwan, English is the primary foreign language to which most people are exposed. Globally, it plays an important role in international communication. Although English is part of the compulsory education program in Taiwan, many people still feel uncomfortable using it in real-life situations. The main reason is that English education tends to focus on exams, but they have fewer opportunities for conversation practice. As a result, most learners develop a fear of making mistakes, which leads to a lack of confidence in their speaking ability. Most people refrain from speaking English because they do not want them to know they cannot speak English. Therefore, in this study, English is used as the primary training language, and it is expected that AI technology and human–machine collaboration [1,2,3,4,5] can help train Chinese native speakers to practice speaking English without the fear of making mistakes, which is very helpful for improving their English-speaking ability.
The oral training system developed in this research utilizes GPT to generate English content for the course dialogs. Subsequently, it utilizes D-ID [6] to create a virtual character of the AI agent, whose lip movements match the spoken English content, thereby providing a realistic simulation experience. The virtual character maintains a neutral emotional expression to preserve realism while avoiding excessive emotional display. This result enables non-live dialog videos featuring real talkers to help users alleviate the tension of communicating in a foreign language. The training environment is built using AI technology, and the user and the AI agent engage in conversations on specific topics to achieve the training’s purpose. In response to the user’s answer, voice recognition will convert the voice into text and input it into the system, which will then use AI technology to conduct intelligent scoring and generate personalized improvement suggestions. This feedback provides the user with clear guidance for improvement, allowing them to gradually become accustomed to communicating with others in English, ultimately promoting more natural speech production.
When a user responds to a lesson for each scenario, GPT instantly scores the response and gives suggestions. The score is divided into two categories: semantically correct and keyword correct. The larger of the two is taken as the final score after comparing it with the standardized answer. Suggestions based on the answers help users quickly grasp the direction of improvement and make progress more efficient [7]. When a user completes a conversation, all the answers, scores, and suggestions are compiled into a learning record, and an Excel file is generated. This file can be viewed by clicking the Learning Record button on the interface and used as a reference for improvement in the next exercise.
The primary target audience of the system implemented in this study is high school students and above. It is expected that features such as a simple interface, quick start-up, and electronic teaching materials will enhance the convenience of English oral practice and increase the willingness to use the system [8]. Experimental results show that English-speaking ability significantly improves after repeated practice with the system.
The novelty of this study lies in its integrated design, which combines user-defined learning topics and content with the automatic generation of learning dialogs and dialog-based practice videos. The system further incorporates speech recognition to transcribe learners’ spoken responses and leverages generative AI in conjunction with an instructor-defined rubric table to provide automated analytical feedback and quantitative scoring. This end-to-end framework enables personalized, adaptive, and scalable English speaking practice with immediate, structured assessment.
The rest of this paper is organized as follows. Section 2 introduces the related works of this study. Section 3 introduces the proposed generative AI English-speaking training system. Section 4 shows the experimental results, and a conclusion is drawn in Section 5.

2. Related Works

To understand how artificial intelligence (AI) is reshaping language education, the review synthesizes empirical studies, conceptual analyses, and theoretical frameworks published in the past decade. Studies were grouped into five major thematic clusters: (1) AI frameworks and applications in education, (2) empirical evidence on AI’s impact on learning and motivation, (3) conceptual analyses of ChatGPT (OpenAI, San Francisco, CA, USA) and large language models (LLMs), (4) AI applications in computer-assisted language learning (CALL), and (5) perspectives on Generative AI (GenAI) in higher education. Across themes, this review highlights converging findings, methodological limitations, and critical research gaps requiring future investigation.

2.1. AI Frameworks and Applications in Education

A substantial body of research has focused on integrating AI technologies, including natural language processing (NLP), speech recognition, and intelligent tutoring systems, into pedagogical frameworks [5,9,10,11,12]. Studies such as those by Jawaid et al. [9], which build on the Jawaid TESOL Benchmarking Model [10], demonstrate how AI can support outcome-based, personalized English language instruction, particularly in English for Specific Purposes (ESP). Likewise, Rohmiyati [11] emphasized the transformative potential of chatbots, adaptive applications, and NLP-driven platforms for facilitating flexible practice in speaking, writing, and listening. These studies collectively underscore a shift toward structured AI-integrated instructional design, emphasizing personalized learning paths, real-time feedback, and data-driven decision-making.
The reviewed frameworks clarify how AI can be aligned with curriculum goals, supporting scalable and inclusive language learning. Concerns remain regarding data privacy, technological inequity, and teacher readiness for AI-enhanced pedagogy.
Research Gap: Despite advancements in conceptual models, the field lacks empirical validation of AI-integrated instructional frameworks, particularly comparative studies examining how different AI technologies perform across diverse learner groups and educational contexts.

2.2. Empirical Evidence on Learning Outcomes and Motivation

Empirical studies offer insight into how AI tools influence language achievement, motivation, and autonomy [13,14,15,16]. Wei [13] employed pre- and post-testing, as well as interviews, to demonstrate that AI-mediated platforms enhance English achievement, L2 motivation, and self-regulated learning. Ali et al. [14] surveyed learners and instructors and found that ChatGPT improved reading and writing motivation, though its influence on listening and speaking remained ambiguous. Evidence suggests that AI systems can foster personalized learning experiences and enhance key motivational factors.
Mixed-methods designs and statistical comparisons provide credible insights into learner engagement and achievement. Small sample sizes, single-site studies, convenience sampling, and heavy reliance on self-reported perceptions limit generalizability.
Research Gap: Existing evidence is limited to short-term outcomes. There is a lack of longitudinal, large-scale experimental studies examining how sustained AI use affects language development and motivational trajectories over time.

2.3. Conceptual Analyses of ChatGPT and Large Language Models

With the rapid advancement of ChatGPT and other LLMs, numerous conceptual studies have examined their pedagogical potential [17,18,19]. Koraishi [17] described how ChatGPT can assist with material development, personalized feedback, and automated assessment. These studies emphasize the enhanced natural language generation and human-like interactivity delivered by newer LLM architectures. Conceptual analyses highlight the potential instructional roles of LLMs but lack empirical evaluation.
They clearly articulate emerging use cases and map the capabilities of the newest LLMs to pedagogical tasks. Most works remain theoretical, raising concerns about accuracy, hallucination, bias, and alignment with curricular objectives.
Research Gap: There is a critical need for empirical studies evaluating how LLM-generated content or feedback affects actual learning performance, teacher workload, and classroom interaction dynamics.

2.4. AI in CALL: Technological Trends and Pedagogical Implications

Reviews by Son et al. [20] and Qiao & Zhao [21] synthesize research on AI applications within computer-assisted language learning (CALL), covering NLP-based tools, automated writing evaluation, computerized dynamic assessment, intelligent tutoring systems, ASR-based feedback, and chatbots. The literature highlights a diversified ecosystem of AI tools that support autonomy, adaptive learning, and data-driven instruction.
Their broad coverage provides a comprehensive overview of emerging AI technologies in language learning. Because most evidence is derived from secondary data, the long-term efficacy of these systems remains uncertain.
Research Gap: Despite extensive reviews, few studies offer quantitative meta-analytic evidence on AI’s impact across skill domains, nor do they examine how AI integrates with teacher expertise and classroom ecology.

2.5. Generative AI Frameworks and Ethical Challenges

Recent studies examine the educational implications of Generative AI [22,23,24,25,26,27]. Ghafar et al. [22] focused on the personalization and accessibility enabled by GenAI platforms. At the same time, Tapalova & Zhiyenbayeva [23] proposed a multi-component AIEd framework incorporating chatbots, expert systems, and virtual environments. Alier et al. [24] conceptualized GenAI’s roles in content creation and automated assessment. Across studies, GenAI is viewed as a catalyst for adaptive learning and multimodal instructional design.
These works provide foundational frameworks for integrating GenAI into educational practice. Most studies are theoretical, with little empirical evidence and insufficient discussion of long-term ethical, privacy, and dependency concerns.
Research Gap: The field lacks empirical investigations into how GenAI alters cognitive load, academic integrity, and teacher–student dynamics, as well as practical strategies for balancing automation with human-centered pedagogy.

2.6. Higher Education Perspectives on ChatGPT and GenAI

A growing line of research explores perceptions and adoption patterns of GenAI in higher education [28,29,30,31,32,33,34]. Baidoo-anu and Ansah [28] and Ali and Wardat [29] highlighted opportunities for personalized instruction and the ethical adoption of AI. Nikolopoulou [30] explored ChatGPT as a research assistant, while Alkolaly et al. [31] surveyed student and teacher attitudes toward GenAI. Conceptual works by Creely [32] and Kasimova [33] emphasized the need to balance benefits with risks such as bias, inaccuracy, and ethical concerns. Studies converge on the view that GenAI offers pedagogical value but raises unresolved questions about accuracy and ethics.
A timely exploration of GenAI adoption provides foundational knowledge for informed policy and practice. Most studies are perception-based, conceptual, or exploratory, lacking real-world classroom experiments.
Research Gap: There is an urgent need for experimental studies assessing how GenAI affects academic performance, cognitive effort, integrity, and student trust, as well as cross-cultural comparisons of adoption patterns.
While studies such as Jawaid et al. [9] and Wei [13] focus on structured frameworks, pre-/post-test designs, or the motivational effects of AI tools, and others like Koraishi [17] or Kasimova [33] examine conceptual or ChatGPT-based interventions, none combine dynamic scenario generation, real-time speech analysis, personalized feedback, and synchronized video-based virtual agents into a single, interactive system. This integration addresses cognitive and affective dimensions of language learning, reducing anxiety through realistic yet controlled interlocutor simulations while providing contextually relevant practice and individualized guidance. The CALL (Computer-Assisted Language Learning) system [21] typically relies on fixed, predetermined curricular materials, which restrict learners to a limited set of scenarios that may not align with their immediate communicative needs. To address this limitation, the proposed system integrates generative AI to enable a novel and highly flexible learning paradigm. Learners can input keywords, phrases, or contextual requirements based on their real-world communication needs, along with the desired number of dialog turns. The system then automatically generates a corresponding dialog script, which users may further revise or fine-tune. Once the learner confirms the script, the system automatically produces a video-based dialog lesson featuring virtual characters, enabling immersive and context-specific oral practice. This generative training mechanism enables learners to design personalized learning content, adapt to various conversational scenarios, and practice English in a highly relevant, diverse, and practical manner.
The major contributions of this study are as follows:
  • Scenario-driven conversational practice with ChatGPT: The system generates diverse and contextually tailored conversational scenarios, allowing learners to practice English in realistic and relevant situations, enhancing engagement and targeted skill development.
  • Intelligent assessment and personalized feedback: Spoken learner responses are processed through speech recognition and analyzed by a large language model to provide automated scoring and individualized guidance, supporting continuous improvement in oral proficiency.
  • Immersive AI-agent interaction via D-ID virtual characters: The system employs D-ID AI to create realistic virtual characters with synchronized lip movements, simulating a non-live interlocutor that reduces anxiety, maintains controlled emotional expression, and promotes natural and confident speech production.

3. Proposed Generative AI English-Speaking Training System

The proposed generative AI English-speaking training system creates a stress-free environment for English-speaking practice, encouraging users to speak naturally and improve their overall English proficiency. The system aims to create an oral learning environment using novel AI technology, enabling users to speak English confidently without worrying about making mistakes. The system features generative AI to evaluate whether the response is appropriate for the given situation, generative AI to generate video images of the AI agent avatar and the conversation, generative AI to offer suggestions for English responses, flexible scoring of the conversation, automatic archiving of learning records, graphical representation of the progress curve, and a simple and user-friendly interface.
The system combines various technologies and provides a GUI for straightforward use. D-ID generates the dialog image as an AI agent, and the video of the virtual human figure created through its webpage serves as the core component of the oral training system, enhancing the realism of the dialog to make the interactive process more memorable for the user. After the user answers the question, the response will be recognized and entered into the system’s answer area. ChatGPT performs intelligent scoring and suggestion generation.
Figure 1 shows the architecture of the proposed system. After the user starts the system, the training begins when the user enters their name and selects a conversation situation. Upon pressing the play button, the video image of the AI agent begins to play according to the selected situation. When the user responds to the AI agent’s question, the system uploads the received audio content to the cloud for speech recognition and processing. The system will pass the answer to the generative AI for intelligent scoring, providing the scoring basis and suggestions for improvement. Upon completion of the whole conversation, the system automatically generates an Excel file to store the responses and scoring data for the learning process. Users can press the Exercise Results button in the interface to view the responses, scores, and personalized suggestions for each training session, which can be used as a reference for improvement in the next exercise. Through repeated training, users can improve their English-speaking skills.

3.1. System Features

The proposed generative AI English-speaking training system integrates multiple functions to create an immersive and practical language learning experience. First, it provides realistic conversational videos generated with D-ID, simulating everyday English-speaking scenarios to enhance user engagement. Second, speech recognition is implemented by capturing audio through a microphone and processing it with cloud-based recognition services, which enables the accurate transcription of user responses. Third, based on the recognized text, ChatGPT offers flexible scoring and constructive feedback to guide learners toward improvement. Fourth, the system records the entire learning process and statistical data in Excel files for easy review and analysis. Fifth, after multiple practice sessions, it generates personalized learning curves based on the user’s average performance, which can be viewed through the learning history function. Finally, the system can generate customized training content by creating dialog scripts according to user-defined topics and the number of exchanges, storing them as training files.
Figure 2 shows the user interface. Area A is the image area of the dialog video; Area B allows the user to enter their name, and then they can click to start the lesson; Area C provides menus for the user to select the topic and difficulty level of the dialog; Area D provides the user with relevant prompts; Area E displays the text content recognized by the user after they replies through the microphone; Area F shows the result of GPT scores for each answer; Area G is used to display the relevant opinions generated by the AI; Area H is for starting and pausing the lesson, which can be started by typing in the user’s name, and can be paused and resumed at any time during the lesson; and Area I serves as a view of the scoring results and an end button.
Figure 3 shows the flowchart of the proposed system. First, log in with the user’s name and select the topic of conversation. If the user chooses to customize the topic, they must enter the topic and the number of sentences. The system will generate a conversation script using GPT, and then, after modifying the script, can start generating videos. If the user does not customize, a built-in topic can be selected. After entering the user’s name and selecting the conversation topic, the “Start lesson” button is activated, and the training begins after clicking the button. The AI agent then starts to play and interact with the user, capturing the user’s voice content through the microphone. The received speech is uploaded to the cloud for speech recognition, and then it gives flexible scores and suggestions by a Large Language Model (LLM), such as ChatGPT. Finally, the system compiles all the attempted phrases, scores, and suggestions into an Excel file, which can be viewed by the user through the Learning Portfolio button and used as a reference for evaluating their learning outcomes.
The system realizes the simulated conversation image, generates the conversation object through D-ID, and the AI agent plays back the conversation with the same speech pattern as the conversation content, which simulates the situation of using English oral speech in daily life, so that the user can be more immersed in the training process and deepen their impression. The subtitles for the current dialogs are provided below, allowing users with limited English listening ability to practice speaking English.
After the microphone captures the user’s voice content, the received voice will be uploaded to the cloud for speech recognition and entered into the response area. Finally, the client receives the recognition result and displays it in the E area of Figure 2, allowing the user to view the speech recognition result.
After integrating speech recognition results into the system, ChatGPT will give a flexible score; if the score is above 60, the conversation can continue. Otherwise, it is a failure, and the AI agent will ask the user to try again; the user has three chances to try until the answer is correct. Additionally, ChatGPT will synchronize the scoring with the user’s suggestions for answers. When the user’s answer meets the scoring criteria, the GPT-generated suggestions will be displayed in the comment area for the user’s reference, allowing them to understand the direction of their improvement.
The scoring is conducted in two stages. The GPT first compares the user’s answer with the correct answer and calculates the percentage of correctness. In the second stage, GPT evaluates whether the user’s answers are grammatically and semantically correct, assigning a flexible score. After comparing the scores of the two parts, the system displays the higher value in the scoring area and then determines whether the user has passed and can continue training.
Upon completion of the training, the system will automatically create an Excel file of all the answers given during the learning process, the scores and comments given by the GPT, as well as the average scores for statistical purposes, making it easy for the user to review in the future and use it as a reference basis for improvement. In addition, through the learning history of the interface, users can view the score progress line graph drawn using the Excel file of multiple training sessions, which enhances the users’ sense of achievement due to repeated practice and seeing their continuous progress, and increases the motivation to continue to use the system to practice.

3.2. Class Video Generation

The generative AI English-speaking training system enables users to enhance their English-speaking skills through interactive learning and engagement with AI agents. In addition to the existing training topics in the system, users can also create their own training content in the generation interface tailored to their specific needs. When the user completes the topic and number of sentences and presses the button to generate the script, the generated script is displayed. The user can read, study, and modify the details, or choose to click the button again to regenerate. After confirming the script, press the “Generate Video” button, and the system will start generating the files required for training and save them automatically.
Learning content is generated through an API connection. Users input information based on their needs and request GPT to generate text content that meets the requirements. Users can make detailed adjustments after generation. Once the content meets their expectations, they can generate the videos and other files required for training. The system automatically creates a folder and saves the files to it.
Figure 4 shows the GUI of script generation. Area A allows users to enter the topic and the number of sentences required to generate the script. Area B contains the “Generate Script and Video” button. The user first clicks “Generate Script” and can make detailed modifications. Then, they click the “Generate Video” button to generate the teaching video required for training. Area C displays the script generated by GPT for users to view and modify.
Figure 5a shows the generation interface. D-ID generates the class video. Area A is the status bar for video generation, indicating that videos are being generated. Once the generation is complete, the status bar for Area B will be displayed as shown in Figure 5b. Then, the user can access the training interface to begin the lesson. When user-generated videos are used in training lessons, they will be displayed as shown in Figure 5c.

3.3. Mobile Applications

The system also provides a mobile application to enhance user convenience, enabling ubiquitous accessibility. Figure 6 shows the interface of the proposed generative AI English-speaking training system. Figure 6a shows the training interface in the mobile version. Area A is the conversation video area; Area B allows the user to enter their name before starting; Area C provides topic and difficulty selection; Area D contains the start and pause controls; Area E gives user-friendly tips; Area F displays recognized speech text; and Area G shows AI-generated scores and feedback. When users want to generate videos by themselves, can switch the interface to Figure 6b, the generation interface, in which Area A is for entering the training topic and number of dialog sentences; Area B provides buttons to generate the script and video, allowing users to create, edit, or regenerate scripts, and then generate a video through D-ID; Area C displays the generated script for review; and Area D shows the video generation status. After practicing the lesson, Area A displays the complete training history in a table format, including user answers, scores, and suggestions; Area B shows the progress chart based on historical results, as shown in Figure 6c. Finally, Figure 6d illustrates the side panel interface, which enables easy switching between functions. Overall, the system forms a complete learning cycle—training, dialog generation, and performance review—making English learning structured and efficient. The slogan “keep on going, never give up” encourages persistence and improvement.

3.4. Learning Outcome

This study provides a rubric table to ChatGPT, which assigns scores based on users’ speech recognition results. Instead of relying on rigid standard answers, which may not fully capture the diversity of natural language expressions, the system uses ChatGPT to analyze the grammatical correctness and overall structure of the user’s responses. This system allows for assessing a wide range of valid answers, even if they differ in wording from expected responses. The system ensures that users receive the most appropriate and equitable scores by focusing on grammar, sentence fluency, and coherence. This method addresses the limitations of traditional answer-matching systems, provides more accurate feedback, and enables learners to identify areas for improvement without penalizing creative or alternative phrasing.
As shown in Figure 7a, the comment indicates that the user’s reply is missing the verb “book,” making the sentence grammatically incorrect and semantically incomplete. Nevertheless, the system recognizes partial correctness and the right intention, so it does not receive a failing score; instead, a score of 60 fairly reflects that the response demonstrates some understanding and structural accuracy but is weakened by missing words and lack of completeness. Similarly, in Figure 7b, the user’s reply also receives a score of 60 because, as noted in the comment, the sentence is incomplete and lacks the necessary context to be grammatically correct.
While the beginning demonstrates an understanding of basic sentence structure and intent, omitting the object after “have” renders the statement flawed and unclear, leading GPT to award only partial credit. In contrast, Figure 7c shows an example where GPT assigns a score of 75, as the response contains a less severe language issue. The incorrect use of the preposition “on” for months and the missing phrasing “for one week” reduce grammatical accuracy and fluency, though the overall meaning remains understandable. These errors justify a score of 75 rather than a perfect evaluation.

4. Experimental Results

The system is divided into three levels of difficulty for conversation. In Level 1, the simplest level, the dialog image displays the AI agent’s speech content. It provides the user with sentences that need to be answered, making it the best choice for beginners who want to improve their English-speaking skills. Level 2 is the moderately complex level. The dialog image no longer provides the user with the complete sentences. However, the system provides the sentences in the prompting area, and the user only needs to know the verb or unit of measure and recite the complete sentence to complete the entire oral training. Users only need to know the verb or unit of measure in the blanks and recite the complete sentence to complete the speaking exercise. The most challenging level is Level 3, where the system provides the user with only a few keywords, such as specific names or places, and the user must respond to the AI agent according to the given conditions.
The dialog image takes Level 1 as an example, as shown in Figure 8a. Since Level 1 is the simplest training, the system provides complete answer sentences in the video, and the subtitles indicate the AI agent’s speech and the sentences the user needs to answer. Users can follow the subtitles to complete the dialog training and familiarize themselves with basic responses through repeated practice. Figure 8b illustrates a Level 2 course, where the user’s correct response receives a score of 100. ChatGPT comments that it is grammatically and semantically correct, and provides full hints. In Figure 7a, the user’s response is flawed but still passes with a score above 60. ChatGPT suggests corrections and attaches the correct answer for reference. At medium difficulty, the prompt field provides most of the sentence structure, and the user only needs to supply elements such as verb changes to ensure semantic integrity. If the response is far from the answer, a red cross appears in the scoring area to indicate it is unacceptable, as shown in Figure 8c. The comment area then shows the number of attempts left, while ChatGPT’s suggestions and answers are hidden to prevent users from relying on them and to encourage independent recall. Finally, Figure 8d shows a case where the user answers incorrectly three times; at this point, ChatGPT provides suggestions and the correct answers in the comment area, and the conversation proceeds to the next exercise.
After training, Figure 9a shows the results of a single training session. All data is sorted by video number, and the sentences, ratings, comments, and average scores for each attempt are saved in an Excel file. The first row shows the video number, which is saved based on the number of attempts. The second row records the response text corresponding to the number; the third and fourth rows contain the ratings and comments generated by ChatGPT. Users can view these messages by clicking the “Exercise Results” button to identify areas for improvement and address any shortcomings. After completing multiple training sessions, users can click the “Learning Portfolio” button in Figure 2 and select a name to view their learning history across multiple practice sessions. The average conversation accuracy is plotted as a line chart, as shown in Figure 9b, to facilitate users’ review of past performance. The chart shows that the user’s accuracy rate improves after repeated practice, significantly enhancing their English-speaking ability.
The hyperlink of the demo video for the proposed system is given as https://youtu.be/OHk0kxVMQco?si=LPmteWxWPk1b3ksa (accessed on 20 December 2025).

4.1. Experimental Procedures

The participants in this study were undergraduate students enrolled in English courses at a university in Taiwan. To ensure a comparable baseline of language proficiency, five inclusion criteria were applied: (1) participants were third- or fourth-year undergraduates who had completed at least two semesters of compulsory university English courses; (2) they were classified as intermediate level according to the university’s English placement test; (3) their English proficiency was equivalent to IELTS scores of 6.0–6.5; (4) based on self-reported questionnaires, they had an average of approximately 15 years of English learning experience; and (5) they had never resided long-term in an English-speaking country.
In addition, three exclusion criteria were adopted: (1) the presence of hearing impairments or speech articulation disorders that could affect speech recognition performance; (2) current participation in intensive English tutoring programs or exchange programs; and (3) failure to complete the pre-test. Based on these criteria, a total of 100 students were ultimately recruited to participate in this study.
The participant screening procedure was conducted in four stages, as illustrated in Figure 10. First, volunteer students were recruited from intermediate-level English courses at the university to form the initial pool of potential participants. In the second stage, a self-reported background questionnaire was administered to screen students who met the inclusion criteria and to exclude those who did not satisfy the requirements. In the third stage, the researchers provided eligible students with a detailed explanation of the study objectives, procedures, and participant rights, and obtained written or online informed consent. Finally, the 100 qualified participants were randomly assigned to either the experimental group or the control group, with 50 participants in each group, while ensuring that the two groups were relatively balanced in terms of background characteristics and English proficiency levels.
To ensure that participants in the experimental group could effectively use the English-speaking training system developed in this study, we conducted a standardized system usage and operational training session prior to the formal experiment. First, the researcher demonstrated the system login procedure, task presentation interface, voice recording process, and the feedback mechanism provided after each practice session. Next, each participant in the experimental group was required to complete 3–5 trial questions to confirm that they could correctly activate the audio recording, complete spoken responses, and review the system feedback. During the trial phase, the researcher provided individual assistance to resolve technical or operational issues, such as microphone configuration problems, insufficient input volume, or incorrect procedures. Additionally, a concise user manual was provided, outlining the system’s workflow and standard troubleshooting guidelines, to ensure that all participants were familiar with the system’s operation before the formal training sessions began.
Figure 11 illustrates the experimental workflow of the system. This study adopted a pre-test–training–post-test experimental design. First, all participants completed a pre-test using the proposed system prior to the experiment to obtain baseline scores and average response times. The pre-test items and testing procedures were identical for both groups.
The experimental group then underwent a four-week English-speaking training period using the proposed system, during which they were required to complete ten training sessions. In contrast, the control group followed traditional English instruction, which included classroom reading, note-taking, oral discussions, and after-class review, without utilizing the proposed system. The system automatically recorded the experimental group’s responses and average response times for each training session for subsequent analysis.
After the experimental group completed the ten training sessions, all participants were administered a post-test within one week of the final session. The post-test content was identical to that of the pre-test. By comparing changes in test scores and average response times between the pre-test and post-test, the effectiveness of the proposed system in improving English speaking ability and language response fluency was evaluated.

4.2. Analysis of Learning Results

To evaluate the learning effectiveness of the proposed system, this study adopted a randomized controlled experimental design. Participants were randomly assigned to either an experimental group or a control group to compare learning outcomes under different instructional methods. The participants consisted of 100 junior and senior students from Feng Chia University, Taiwan. Their English proficiency levels were approximately equivalent to an IELTS score of 6.0–6.5, and they were placed in intermediate-level English courses based on the university’s English placement test. All participants had previously completed at least two semesters of required university English courses. According to self-reported language learning histories, the participants had studied English for an average of 14.2 years, and none had lived in an English-speaking country. This participant composition ensured a relatively consistent learning baseline, thereby minimizing the impact of individual differences on the experimental results.
Participants were randomly assigned to two groups: 50 in the experimental group and 50 in the control group. The experimental group used the English-speaking training system developed in this study and completed multiple practice sessions during the learning period. In contrast, the control group did not use the system; instead, they engaged in traditional English-learning activities, such as reading course materials, taking written notes, participating in classroom discussions, and conducting self-review.
To ensure experimental consistency, both groups were given the same learning duration, task content, and testing procedures. The only independent variable was whether the participants used the proposed system. Before formal learning began, all participants completed a pre-test using the system to obtain baseline performance scores and average response time, which served as the pre-test data. During the learning phase, the experimental group completed ten practice sessions, with the system automatically recording their performance scores and response times after each session. Learning effectiveness was evaluated using two key indicators:
(1)
Whether test scores exhibited an upward trend as the number of practice sessions increased.
(2)
Whether the average response time per item decreased progressively.
These two measures enabled us to evaluate the system’s impact on enhancing learners’ language proficiency and spoken fluency. After the learning phase, both groups completed the identical post-test, during which the system recorded their performance scores and average response time. By comparing the pre-test and post-test results, the effectiveness of the system in improving operational proficiency and language ability could be evaluated. The experimental results showed that participants in the experimental group exhibited significant improvements in test scores after repeated use of the system, and their average response time decreased substantially, indicating enhanced fluency in answering and task familiarity. These findings confirm the effectiveness of the proposed system in facilitating language learning and support our hypothesis that repeated interactive practice strengthens memory retention and improves operational efficiency, thereby demonstrating meaningful educational value.
Figure 12a illustrates the distribution of average response times for the 50 participants in the control group during the pre-test and post-test. Although the control group showed a slight reduction in response time in the post-test, the degree of improvement was minimal, and substantial heterogeneity was observed across individuals. Some participants responded faster in the post-test than in the pre-test, suggesting that natural familiarity with the test content or general self-directed learning can still enhance oral English fluency to some extent. However, several participants exhibited longer response times in the post-test, indicating that without systematic training, learning outcomes are inconsistent. This pattern shows that traditional learning approaches (e.g., in-class practice, reading, note-taking, or self-practice) may offer limited improvement in language output fluency but cannot ensure uniform progress among learners within a short period. Moreover, performance is highly influenced by individual differences and variable learning strategies. In other words, the control group demonstrated only limited and unstable improvement in language response speed. Figure 12b presents the changes in average response time for the 50 participants in the experimental group between the pre-test and post-test. All participants in the experimental group demonstrated reduced response times in the post-test, indicating that practice with the proposed oral training system led to substantial improvements in spoken fluency and reaction speed. The overall trend line shows a clear downward trajectory, with every participant responding more rapidly in the post-test. Furthermore, the variability in response times decreased compared with the pre-test, suggesting that the system produced consistent benefits across learners with different English proficiency levels. These results indicate that systematic, multi-round oral practice—incorporating speech-recognition feedback, real-time question responses, and speed-focused speaking exercises—effectively enhances the automaticity of language production, enabling participants to exhibit more fluent and faster spoken output in the post-test.
To estimate the accuracy distributions of the experimental and control groups in both the pre-test and post-test, this study employed Kernel Density Estimation (KDE). This non-parametric density estimation method does not require assumptions regarding the underlying probability distribution of the data. The kernel density estimator provides an estimated probability density function (PDF) for a random variable. For any real value x, the KDE is defined as
f h ^ x = 1 N B i = 1 N K x x i B
where N denotes the sample size, xi represents the ith sample point, and B is a smoothing parameter.
In this study, the Gaussian kernel was adopted, and its kernel function K(x) can be computed by
K x = 1 2 π e x 2 2
By substituting Equation (2) into Equation (1), the estimated probability density function (PDF) of the KDE can be expressed as
f h ^ x = 1 N B i = 1 N 1 2 π e ( x x i ) 2 2 B 2
where x denotes the point at which the density is to be estimated.
Using the KDE method allows the probability density function to be estimated from a finite set of samples and visualized in the form of kernel density plots. This method facilitates the comparison of distributional differences between pre-test and post-test scores as well as between groups. Figure 13 illustrates the accuracy distributions of both the experimental and control groups across the pre-test and post-test, estimated using Kernel Density Estimation (KDE) to visualize the shape and central tendency of the score distributions at each assessment stage.
The figure contains four density curves representing the experimental group (pre-test) and the experimental group (post-test), as well as the control group (pre-test) and the control group (post-test). At the pre-test stage, the two groups exhibit highly similar distributions, with density peaks concentrated around the 55–65% range, indicating comparable initial language proficiency prior to the intervention. In contrast, the post-test results reveal a pronounced divergence. The experimental group’s post-test distribution (dark red) shifts substantially to the right, with a sharp density peak near the 90% accuracy level. This result demonstrates not only a marked improvement in overall correctness but also a more concentrated distribution of scores, suggesting that learners consistently benefited from repeated practice with the proposed system. The sharper peak further indicates strengthened learning stability and homogeneity among participants. By comparison, although the control group’s post-test distribution (dark blue) also shifts to the right, its peak remains within the 70–85% range, reflecting a noticeably smaller improvement. Moreover, the control group’s distribution remains more dispersed, with the lower-accuracy tail extending further than that of the experimental group. This result suggests that some learners did not achieve meaningful progress through traditional or self-directed learning, resulting in greater variability in learning outcomes. Based on the KDE distributions shown in Figure 13, several conclusions can be drawn:
  • The experimental group exhibits a substantial rightward shift and increased concentration in the post-test distribution, indicating that the proposed training system effectively enhances response accuracy.
  • The control group shows only moderate improvement, with considerable individual variability.
  • The clear divergence between the two groups in the post-test phase demonstrates the significant learning benefits provided by the system developed in this study.
Figure 14a illustrates the distribution of accuracy improvement (post-test minus pre-test) for both the experimental and control groups. A notable disparity is apparent between the two groups. The median improvement for the experimental group is approximately 28%, substantially higher than the control group’s median improvement of about 18%, indicating that learners who used the proposed training system achieved more pronounced gains in spoken English accuracy. In addition, the interquartile range (IQR) of the experimental group is more compact and positioned noticeably higher than that of the control group, suggesting that the performance enhancement among experimental participants is more consistent and stable. Conversely, the control group exhibits a broader distribution, reflecting greater variability in learning outcomes; some participants even showed stagnation or decline in accuracy without systematic practice. Overall, the results presented in Figure 14a demonstrate that the proposed English-speaking training system effectively enhances learners’ response accuracy, yielding not only larger improvement magnitudes but also reduced variability, thereby outperforming traditional learning approaches.
Figure 14b compares the average response time on the post-test between the experimental and control groups to evaluate differences in spoken English reaction speed and answering fluency. The results show that the median response time of the experimental group is approximately 5.8 s, which is significantly faster than the 7.6 s median observed in the control group. This difference indicates that learners who engaged with the proposed training system demonstrated a clear advantage in spoken English speed. Furthermore, the boxplot distribution reveals that the experimental group exhibits a narrower interquartile range, with most data concentrated between 5.5 and 6.2 s, suggesting stable and consistent answering speed. In contrast, the control group shows a much wider distribution range (approximately 6 to 8.8 s), indicating greater variability in reaction speed and less improvement in fluency when relying on traditional learning methods. The results presented in Figure 14b support the hypothesis of this study: repeated practice using the proposed system can effectively enhance learners’ spoken English proficiency, leading to more fluent and rapid responses.
In this study, notched boxplots were employed to illustrate the differences in median performance between groups. The notches in the boxplot represent the 95% confidence interval of the median, estimated using the formula proposed by McGill et al. [35]. Therefore, non-overlapping notches indicate a statistically significant difference at the α = 0.05 level. Figure 15a presents the distribution of post-test accuracy scores for the experimental and control groups, using notched boxplots to facilitate group comparison. The plot shows that the median accuracy of the experimental group is approximately 93%, whereas the control group exhibits a significantly lower median of around 83%. The notches between the two groups do not overlap, indicating a statistically significant difference in median accuracy at the 95% confidence level.
Additionally, the experimental group demonstrates a more concentrated interquartile range and highly consistent performance, with maximum scores approaching 100%, suggesting that most learners achieved high accuracy after training with the proposed system. In contrast, the control group exhibits a broader distribution, with minimum scores falling below 60%, indicating substantial individual variability among learners using traditional study methods and suggesting that some participants did not achieve adequate proficiency.
Figure 15b compares the average post-test response times of the experimental and control groups using notched boxplots to highlight differences in median performance. The results show that the median response time of the experimental group is approximately 5.8 s, which is substantially faster than that of the control group, which is about 7.6 s. The non-overlapping notches between the two groups indicate a statistically significant difference in median response time at the 95% confidence level. Moreover, the experimental group exhibits a narrow distribution range, suggesting that most learners demonstrated consistent and improved oral response speed after using the proposed system. In contrast, the control group displays greater variability, with some participants requiring more than 8.5 s to respond even in the post-test, indicating that traditional learning strategies such as reading or independent practice are less effective in rapidly enhancing spoken fluency. Figure 15b demonstrates that learners who trained with the proposed system achieved significantly faster and more fluent English oral responses, confirming the system’s effectiveness in improving spoken language fluency.
Figure 16a presents the results of a one-way ANOVA conducted on the post-test accuracy scores of the experimental and control groups. The between-group sum of squares (SS) was 2614.8, with a corresponding mean square (MS) of 2614.77. The within-group error sum of squares was 8047.5, yielding a mean square of 82.12. The resulting F-value was 31.84, with a p-value of 1.6244 × 10−7, which is highly significant at the α = 0.001 level. These results indicate a statistically significant difference in post-test accuracy between the two groups, demonstrating that the proposed training system produced a clear and measurable improvement in learners’ English-speaking accuracy. This finding is consistent with the distribution patterns observed in the boxplot in Figure 15a, and the inferential statistical test further confirms that the observed group difference is unlikely to be attributed to random variation.
Figure 16b presents the ANOVA results for the post-test response times of the experimental and control groups. The between-group sum of squares was 79.975, with a corresponding mean square of 79.9748; the within-group error sum of squares was 40.235, yielding a mean square of 0.4106. The resulting F-value was 194.79, with an associated p-value of 5.01317 × 10−25, far below the 0.001 threshold. These statistical results indicate that the difference in response times between the two groups is not due to random variation but is substantially influenced by the use of the proposed system. In other words, the system significantly reduced learners’ English-speaking response times. This finding is entirely consistent with the trend observed in Figure 15b, further confirming that the proposed system effectively improves English-speaking fluency.
The experimental results show that participants’ test scores increased significantly after multiple sessions of practice using the proposed system, while their average response time per item decreased markedly. These improvements indicate enhanced answer fluency and operational proficiency. The findings verify the effectiveness of the system in promoting English-speaking ability and support our hypothesis that repeated interactive practice using the proposed system can indeed facilitate substantial gains in spoken English performance.

4.3. Discussions

This study involved the assistance of a human rater (an experienced English instructor) to evaluate the responses—the human rater, who conducted the evaluations based on a rubric table. Table 1 presents the rubric table used in this study to assess users’ English responses. The evaluation consists of four dimensions: sentence naturalness, semantic clarity, conformity of grammatical structure to native-speaker usage, and overall comprehensibility. The same rubric table was also provided to ChatGPT to ensure consistency in the scoring criteria.
For example, the reference answer provided by the AI agent was “I went to school yesterday.” When the participant responded with “I go to school yesterday,” the response was evaluated by a human rater using the rubric shown in Table 1.
As summarized in Table 2, the rubric score was 17 points (corresponding to an overall score of 85). The human rater’s qualitative feedback indicated that the response was generally understandable; however, it contained a grammatical error related to verb tense. Although the user’s intended message was conveyed successfully, semantic precision was compromised due to incorrect tense usage.
Table 3 presents the scoring results obtained using ChatGPT. The detailed explanations for each scoring dimension are as follows.
(1)
Sentence naturalness
The sentence follows a commonly used conversational structure. The only issue lies in the incorrect verb tense, which does not match the temporal adverb yesterday. While a native speaker would recognize the response as produced by a non-native speaker, the sentence does not sound unnatural or awkward in context. Therefore, ChatGPT assigns a rubric score of 4 points (equivalent to 20 points).
(2)
Semantic clarity
The meaning of the response is entirely clear, and the intended message—“going to school yesterday”—can be understood without any need for inference or guesswork. Accordingly, this dimension received the highest rubric score of 5 points (equivalent to 25 points).
(3)
Grammatical completeness
The sentence exhibits a complete grammatical structure, consisting of a subject, a verb, an object, and a temporal adverb. However, the verb tense is incorrect (go instead of went), which constitutes a grammatical error. As a result, this dimension was assigned 4 points (equivalent to 20 points).
(4)
Overall Comprehensibility
The response is entirely understandable and provides an appropriate answer to the question. Although it contains a grammatical tense error, the error does not hinder comprehension. The response can therefore be classified as “communicatively successful with minor grammatical errors.” This dimension received 5 points (equivalent to 25 points).
Although the sentence contains a verb tense error, its structure is complete, its meaning is clear, and it successfully addresses the prompt. The grammatical error does not impair overall understanding. Consequently, the response received the highest scores in semantic clarity and overall comprehensibility and is categorized as a comprehensible response with minor grammatical errors.
A comparison between the human rater’s evaluation and the ChatGPT assessment reveals both notable similarities and systematic differences. The similarities in scoring outcomes are as follows:
(1)
Consistent judgment of communicative success
Both ChatGPT and the human rater agreed that the response successfully addressed the question. The core meaning—“going to school yesterday”—was conveyed clearly and unambiguously. Consequently, both evaluators classified the response as “communicatively successful despite grammatical errors” and assigned the highest scores for semantic clarity and overall comprehensibility. This consistency suggests that ChatGPT’s judgment, in terms of global understanding and communicative effectiveness, closely aligns with that of an experienced human rater.
(2)
Shared interpretation of the nature of the grammatical error
The human rater noted that although the response contained an apparent grammatical defect, the error was isolated and did not affect the core meaning. Similarly, ChatGPT identified the verb tense error as a single grammatical mistake that did not interfere with the comprehension of the main message. Both evaluators therefore treated the error as a localized, non-fatal grammatical issue, rather than as evidence of a broader language ability deficiency. The differences in scoring outcomes are as follows:
(1)
Differences in penalty strategy and grading strictness
The primary divergence emerged in the overall comprehensibility dimension: ChatGPT assigned a full score (5), whereas the human rater assigned a score of 4. This difference reflects the human rater’s greater emphasis on semantic precision, particularly with respect to temporal accuracy. From a pedagogical perspective, the tense error was considered to slightly compromise the precision of meaning, resulting in an additional penalty to the overall comprehension level.
(2)
Divergent scoring orientations
ChatGPT employs a communication-oriented scoring logic, allowing for the clear inference of intended meaning. High scores are awarded, and errors are confined to the specific dimension in which they occur (e.g., grammar). In contrast, the human rater follows a more pedagogically oriented approach, whereby a single grammatical error may influence multiple evaluation dimensions, reflecting its potential instructional significance. This difference explains why the human rater’s total score (85) is slightly lower than ChatGPT’s score (90).
The above comparison demonstrates a high degree of agreement between ChatGPT and the human rater in terms of scoring trends and error characterization. Both evaluators judged the response to be communicatively successful and identified the grammatical issue as a single, non-critical error. The observed score discrepancy arises primarily from differing levels of strictness regarding the impact of grammatical errors on semantic precision, rather than from any fundamental disagreement about the learner’s language competence.
If no rubric table is provided and ChatGPT is used to assign scores directly, the resulting evaluations may exhibit huge variance. In contrast, when an explicit rubric table and scoring examples are provided, ChatGPT-4o demonstrates consistent and fair grading [36]. The current ChatGPT-5.2 version exhibits even higher scoring stability; when used in conjunction with the rubric table, the assigned scores are both reasonable and stable.
To illustrate the application of the scoring rubric, we provide an example of a severely incorrect response. The question posed by the AI agent was: “What did you do over the weekend?” The user’s response was: “I weekend tomorrow because friend happy no money yesterday.” Table 4 shows the scoring results.
The user’s final score was 20 points. The detailed evaluation by ChatGPT version 5.2 is as follows:
(1)
Sentence naturalness
The word order does not conform to standard English usage. The verb usage is incorrect (“I weekend” is not a valid verb phrase), and the tense usage is inconsistent, with “tomorrow” and “yesterday” appearing simultaneously. Native speakers would rarely use such a sentence structure. Score: 1 point.
(2)
Semantic clarity
It is impossible to determine with certainty what the user did over the weekend. The causal relationship expressed in “because friend happy no money” is semantically unclear, and the reader must rely heavily on guesswork to infer any meaning. Score: 1 point.
(3)
Grammatical completeness
The response lacks a complete sentence structure, with no correct subject–verb construction. Verb tense usage is incorrect, and conjunctions and relational words are misused. Score: 1 point.
(4)
Overall Comprehensibility
Even when considering the surrounding context, the response remains difficult to understand and fails to effectively answer the question (i.e., it does not state what was done over the weekend). This response falls into the category of communication failure. Score: 1 point.
Because the participant’s response exhibits severe deficiencies in grammatical structure, semantic expression, and sentence naturalness, and fails to address the question meaningfully, the overall comprehensibility is exceptionally low. Consequently, the response received the lowest rating (1 point) across all four evaluation dimensions, resulting in a rubric score of 4 points and a final scaled score of 20 points, representing a case of severe communication failure.
Although the primary scoring mechanism in this study is conducted automatically by the proposed ChatGPT-based evaluation system, human evaluation was included as a supplementary validation reference. It should be acknowledged that the human assessment relied on a single experienced rater, which may limit inter-rater reliability and the generalizability of the validation results. However, this design choice was intentional, as the role of human scoring in this study was not to serve as the primary evaluation standard, but rather to verify the consistency and plausibility of the AI-generated scores. The human rater followed a predefined and consistent rubric to minimize subjective bias. Future work will incorporate multiple human raters and inter-rater agreement analysis to further strengthen the robustness of the validation process.
Concerning the system latency, the response delay—from the moment the user finishes speaking to the system displaying the GPT-generated reply—ranges from a minimum of approximately 2 s to an average of 10 s, which is acceptable for interactive language-learning applications.
The cost of using GPT as the primary response-generation engine is relatively low. Under a credit-based usage model, each query consumes approximately 210 tokens, resulting in a monthly cost of less than USD 2.5. In contrast, video-generation services such as D-ID incur higher expenses, averaging USD 7.5 per month. The combined operational cost for running both services is approximately USD 10 per month, making the system cost-efficient and sustainable.
In terms of scalability, the system architecture allows for future expansion and enhancement. Planned improvements include integrating AI-driven pronunciation analysis and computer vision modules, expanding platform compatibility, leveraging cloud-based deployment, and supporting additional languages to meet broader educational and commercial needs.
Concerning the speech recognition reliability, the speech recognition module employed in this system achieves a recognition accuracy of approximately 97%, providing a reliable foundation for downstream tasks such as scoring, feedback generation, and learning analytics.
The reliability of the proposed system’s scoring remains limited by the accuracy of speech recognition and occasional inconsistencies in GPT-based evaluation. Future work will expand the scoring dimensions, improve the system’s robustness to diverse accents, and validate it with broader learner populations.
During the experimental phase, all participants read and signed an informed consent form prior to using the system. The consent form explained that their speech data would be collected for research purposes, clarified that the audio recordings would be transmitted to and processed by a third-party cloud service (ChatGPT), and stated explicitly that participants retained the right to withdraw from the study at any time.
This study falls under the category of exempt research, as it involves a small-scale, voluntary participation test during the proof-of-concept stage and does not include any sensitive personal information or interventional procedures. Accordingly, a formal Institutional Review Board (IRB) review was not required. Nevertheless, we adhered strictly to all applicable ethical principles throughout the study.
This study incorporates a comprehensive ethics statement and data protection protocol to safeguard the rights and privacy of participants. All participants provided written informed consent prior to participation. The ethical declaration clearly specifies that all collected speech data were used exclusively for research purposes and were de-identified (anonymized) before analysis. In accordance with the data protection policy, all research data will be permanently destroyed upon completion of the study.
Although the proposed system employs cloud-based speech recognition services, no personally identifiable information is collected, stored, or processed at any stage. The uploaded voice data are used solely for transient speech-to-text conversion. They are not linked to user names, identifiers, biometric information, or any metadata that could enable individual identification. The system does not perform speaker identification or voiceprint analysis, and audio inputs are not retained after recognition is completed. The recognized speech content is used only for linguistic evaluation and feedback generation. Consequently, the processed data cannot be traced back to specific individuals, and the risk of privacy infringement is minimal. The study, therefore, adheres to the relevant ethical guidelines for research involving non-identifiable data.

5. Conclusions

The proposed system is developed using generative AI, which generates the English content of the conversation through GPT, generates an AI agent through D-ID, and intelligently scores the conversation through GPT, which analyzes the content of the speech recognition and grammatical accuracy, and provides real-time scoring and feedback for improvement, to assist users in making timely corrections and improvements in their subsequent responses. In addition, the generative AI also dynamically adjusts the scoring according to the user’s voice response errors, making the scoring results more flexible and accurate, just like a real English teacher’s scoring situation. Experimental results show that the AI scoring mechanism is reasonable and helpful to users. Users’ English-speaking ability can be improved significantly after repeated use.

Author Contributions

Conceptualization, C.-T.L.; methodology, C.-T.L., Y.-J.C., T.-Y.W., and Y.-Y.L.; software, Y.-J.C. and T.-Y.W.; validation, C.-T.L., Y.-J.C., T.-Y.W., and Y.-Y.L.; formal analysis, C.-T.L., Y.-J.C., T.-Y.W., and Y.-Y.L.; investigation, C.-T.L., Y.-J.C., T.-Y.W., and Y.-Y.L.; resources, C.-T.L.; data curation, Y.-J.C. and T.-Y.W.; writing—original draft preparation, C.-T.L., Y.-J.C., T.-Y.W., and Y.-Y.L.; writing—review and editing, C.-T.L., Y.-J.C., T.-Y.W. and Y.-Y.L.; visualization, C.-T.L.; supervision, C.-T.L.; project administration, C.-T.L.; funding acquisition, C.-T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan, grant number NSTC 111-2410-H-035-059-MY3.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study contains no additional data.

Acknowledgments

We thank the reviewers for their valuable comments, which have significantly improved the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations were used in this paper:
AIArtificial Intelligence
GPTGenerative Pre-trained Transformer
GUIGraphical User Interface
LLMLarge Language Model
TESOLTeaching English to Speakers of Other Languages
NLPNatural Language Processing
OBEOutcome-Based Education
ESPEnglish for Specific Purposes
EFLEnglish as a Foreign Language
CALLComputer-Assisted Language Learning
DDLData-Driven Learning
AWEAutomated Writing Evaluation
CDAComputerized Dynamic Assessment
ITSsIntelligent Tutoring Systems
ASRAutomatic Speech Recognition
ELTEnglish Language Teaching
AIEdArtificial Intelligence in Education
GenAIGenerative Artificial Intelligence
UTAUTUnified Theory of Acceptance and Use of Technology

References

  1. Garzón, J.; Lampropoulos, G.; Burgos, D. Effects of mobile learning in English language learning: A meta-analysis and research synthesis. Electronics 2023, 12, 1595. [Google Scholar] [CrossRef]
  2. Semana, I.L.; Darong, H.C.; Menggo, S. Self-regulated learning method through smartphone assistance in promoting speaking ability. J. Lang. Teach. Res. 2022, 13, 772–780. [Google Scholar] [CrossRef]
  3. Meniado, J.C. Human-machine collaboration in language education in the age of artificial intelligence. RELC J. 2024, 55, 291–295. [Google Scholar] [CrossRef]
  4. Jantanukul, W. Empowering communities through lifelong learning: A case study of university initiatives for social engagement and personal development. J. Educ. Learn. Res. 2024, 1, 45–58. [Google Scholar] [CrossRef]
  5. Lu, C.-T.; Lu, Y.-Y.; Lu, Y.-R.; Pan, Y.-C.; Liu, Y.-C. Implementation of an AI English-speaking interactive training system using multi-model neural networks. IEEE Access 2025, 13, 132052–132066. [Google Scholar] [CrossRef]
  6. D-ID. D-ID Official Website. Available online: https://www.d-id.com/ (accessed on 20 December 2025).
  7. Ericsson, E.; Johansson, S. English speaking practice with conversational AI: Lower secondary students’ educational experiences over time. Comput. Educ. Artif. Intell. 2023, 5, 100164. [Google Scholar] [CrossRef]
  8. Kobashikawa, S.; Odakura, A.; Nakamura, T.; Mori, T.; Endo, K.; Moriya, T.; Masumura, R.; Aono, Y.; Minematsu, N. Does speaking training application with speech recognition motivate junior high school students in actual classroom?—A case study. In Proceedings of the 8th ISCA Workshop on Speech and Language Technology in Education (SLaTE), Graz, Austria, 20–21 September 2019; pp. 119–123. Available online: https://www.isca-archive.org/slate_2019/kobashikawa19_slate.html (accessed on 20 December 2025).
  9. Jawaid, A.; Batool, M.; Arshad, W.; ul Haq, M.I.; Kaur, P.; Sanaullah, S. AI and English language learning outcomes. Contemp. J. Soc. Sci. Rev. 2025, 3, 927–935. [Google Scholar]
  10. Jawaid, A. Benchmarking in TESOL: A study of the Malaysia education blueprint 2013. Engl. Lang. Teach. 2014, 7, 23–38. [Google Scholar] [CrossRef]
  11. Rohmiyati, Y. Enhancing English language learning through artificial intelligence: Opportunities, challenges and the future. DIAJAR J. Pendidik. Pembelajaran 2025, 4, 8–16. [Google Scholar] [CrossRef]
  12. Fitria, T.N. The use of technology based on artificial intelligence in English teaching and learning. ELT Echo 2021, 6, 213–223. [Google Scholar] [CrossRef]
  13. Wei, L. Artificial intelligence in language instruction: Impact on English learning achievement, L2 motivation, and self-regulated learning. Front. Psychol. 2023, 14, 1261955. [Google Scholar] [CrossRef] [PubMed]
  14. Ali, J.K.M.; Shamsan, M.A.A.; Hezam, T.A.; Mohammed, A.A.Q. Impact of ChatGPT on learning motivation: Teachers and students’ voices. J. Engl. Stud. Arab. Felix 2023, 2, 41–49. [Google Scholar] [CrossRef]
  15. Rusmiyanto, R.; Huriati, N.; Fitriani, N.; Tyas, N.K.; Rofi’i, A.; Sari, M.N. The role of artificial intelligence in developing English language learners’ communication skills. J. Educ. Online (JoE) 2023, 6, 750–757. [Google Scholar]
  16. Gultom, S.; Oktaviani, L. The correlation between students’ self-esteem and their English proficiency test result. J. Engl. Lang. Teach. Learn. 2022, 3, 52–57. [Google Scholar] [CrossRef]
  17. Koraishi, O. Teaching English in the age of AI: Embracing ChatGPT to optimize EFL materials and assessment. Lang. Educ. Technol. 2023, 3, 55–72. Available online: https://langedutech.com/letjournal/index.php/let/article/view/48 (accessed on 20 December 2025).
  18. Baskara, R.; Mukarto. Exploring the implications of ChatGPT for language learning in higher education. Indones. J. Engl. Lang. Teach. Appl. Linguist. 2023, 7, 343–358. [Google Scholar] [CrossRef]
  19. Liu, M. Exploring the application of artificial intelligence in foreign language teaching: Challenges and future development. SHS Web Conf. 2023, 168, 03025. [Google Scholar] [CrossRef]
  20. Son, J.-B.; Ružić, N.K.; Philpott, A. Artificial intelligence technologies and applications for language learning and teaching. J. China Comput. Assist. Lang. Learn. 2025, 5, 94–112. [Google Scholar] [CrossRef]
  21. Qiao, H.; Zhao, A. Artificial intelligence-based language learning: Illuminating the impact on speaking skills and self-regulation in Chinese EFL context. Front. Psychol. 2023, 14, 1255594. [Google Scholar] [CrossRef]
  22. Ghafar, Z.N.; Salh, H.F.; Abdulrahim, M.A.; Farxha, S.S.; Arf, S.F.; Rahim, R.I. The role of artificial intelligence technology on English language learning: A literature review. Can. J. Lang. Lit. Stud. 2023, 3, 17–31. [Google Scholar] [CrossRef]
  23. Tapalova, O.; Zhiyenbayeva, N. Artificial intelligence in education: AIEd for personalised learning pathways. Electron. J. e Learn. 2022, 20, 639–653. [Google Scholar] [CrossRef]
  24. Alier, M.; Peñalvo, F.J.G.; Camba, J.D. Generative artificial intelligence in education: From deceptive to disruptive. Int. J. Interact. Multimed. Artif. Intell. 2024, 8, 5–14. [Google Scholar] [CrossRef]
  25. Young, J.C.; Shishido, M. Investigating OpenAI’s ChatGPT potentials in generating chatbot’s dialogue for English as a foreign language learning. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 65–72. [Google Scholar] [CrossRef]
  26. Nguyen, A.; Ngo, H.N.; Hong, Y.; Dang, B.; Nguyen, B.-P.T. Ethical principles for artificial intelligence in education. Educ. Inf. Technol. 2023, 28, 4221–4241. [Google Scholar] [CrossRef]
  27. Ouyang, F.; Jiao, P. Artificial intelligence in education: The three paradigms. Comput. Educ. Artif. Intell. 2021, 2, 100020. [Google Scholar] [CrossRef]
  28. Baidoo-Anu, D.; Ansah, L.O. Education in the era of generative artificial intelligence: Understanding the potential benefits of ChatGPT in promoting teaching and learning. J. AI 2023, 7, 52–62. [Google Scholar] [CrossRef]
  29. AlAli, R.; Wardat, Y. Opportunities and challenges of integrating generative artificial intelligence in education. Int. J. Relig. 2024, 5, 784–793. [Google Scholar] [CrossRef]
  30. Nikolopoulou, K. Generative artificial intelligence in higher education: Exploring ways of harnessing pedagogical practices with the assistance of ChatGPT. Int. J. Changes Educ. 2024, 1, 103–111. [Google Scholar] [CrossRef]
  31. Alkolaly, M.; Zeid, F.; Al-Shamali, N.; Khasawneh, M.; Tashtoush, M. Comparing lecturers’ and students’ attitude towards the role of generative artificial intelligence systems in foreign language teaching and learning. Qubahan Acad. J. 2025, 5, 1–15. [Google Scholar] [CrossRef]
  32. Creely, E. Exploring the role of generative AI in enhancing language learning: Opportunities and challenges. Int. J. Changes Educ. 2024, 1, 158–167. [Google Scholar] [CrossRef]
  33. Kasimova, M. The implementation of artificial intelligence in teaching foreign languages. Ment. Enlight. Sci. Methodol. J. 2024, 5, 71–79. [Google Scholar]
  34. Hu, Y. Application and potential problems of generative artificial intelligence in foreign language teaching. Int. J. New Dev. Educ. 2024, 6, 60–65. [Google Scholar] [CrossRef]
  35. McGill, R.; Tukey, J.W.; Larsen, W.A. Variations of box plots. Am. Stat. 1978, 32, 12–16. [Google Scholar] [CrossRef]
  36. García-Varela, F.; Nussbaum, M.; Mendoza, M.; Martínez-Troncoso, C.; Bekerman, Z. ChatGPT as a stable and fair tool for automated essay scoring. Educ. Sci. 2025, 15, 946. [Google Scholar] [CrossRef]
Figure 1. System architecture of the proposed system.
Figure 1. System architecture of the proposed system.
Applsci 16 00189 g001
Figure 2. Graphic user interface of the proposed system: (a) GUI arrangement; (b) snapshot of the proposed system.
Figure 2. Graphic user interface of the proposed system: (a) GUI arrangement; (b) snapshot of the proposed system.
Applsci 16 00189 g002
Figure 3. Flowchart of the proposed system.
Figure 3. Flowchart of the proposed system.
Applsci 16 00189 g003
Figure 4. Graphical user interface of script generation.
Figure 4. Graphical user interface of script generation.
Applsci 16 00189 g004
Figure 5. Graphic user interface of video generation: (a) video generation interface; (b) finished video generation; and (c) generated videos.
Figure 5. Graphic user interface of video generation: (a) video generation interface; (b) finished video generation; and (c) generated videos.
Applsci 16 00189 g005aApplsci 16 00189 g005b
Figure 6. Graphical mobile user interface: (a) training interface; (b) generation interface; (c) result interface; (d) sidebar.
Figure 6. Graphical mobile user interface: (a) training interface; (b) generation interface; (c) result interface; (d) sidebar.
Applsci 16 00189 g006
Figure 7. Graphical mobile user interface of different scores: (a) mistake of missing words; (b) error in missing the rest of the sentence; (c) incorrect use of the preposition.
Figure 7. Graphical mobile user interface of different scores: (a) mistake of missing words; (b) error in missing the rest of the sentence; (c) incorrect use of the preposition.
Applsci 16 00189 g007
Figure 8. Snapshot of the conversation result: (a) example of level 1; (b) example of a fully correct user response; (c) example of user response error; (d) example of a user responding to an error and trying to answer it.
Figure 8. Snapshot of the conversation result: (a) example of level 1; (b) example of a fully correct user response; (c) example of user response error; (d) example of a user responding to an error and trying to answer it.
Applsci 16 00189 g008
Figure 9. English speaking training results report: (a) learning record report; (b) learning trajectory chart.
Figure 9. English speaking training results report: (a) learning record report; (b) learning trajectory chart.
Applsci 16 00189 g009
Figure 10. The procedure for screening participants.
Figure 10. The procedure for screening participants.
Applsci 16 00189 g010
Figure 11. System experimental flowchart.
Figure 11. System experimental flowchart.
Applsci 16 00189 g011
Figure 12. Average response time of in the pre-test and post-test for all participants: (a) control group; (b) experimental group.
Figure 12. Average response time of in the pre-test and post-test for all participants: (a) control group; (b) experimental group.
Applsci 16 00189 g012
Figure 13. Distribution of response accuracy for the experimental and control groups in the pre-test and post-test.
Figure 13. Distribution of response accuracy for the experimental and control groups in the pre-test and post-test.
Applsci 16 00189 g013
Figure 14. Boxplot comparisons between pre-test and post-test for the experimental and control groups: (a) accuracy improvement; (b) response times (+ denotes an outlier).
Figure 14. Boxplot comparisons between pre-test and post-test for the experimental and control groups: (a) accuracy improvement; (b) response times (+ denotes an outlier).
Applsci 16 00189 g014
Figure 15. Notched boxplot comparisons: (a) accuracy; (b) response time (+ denotes an outlier).
Figure 15. Notched boxplot comparisons: (a) accuracy; (b) response time (+ denotes an outlier).
Applsci 16 00189 g015
Figure 16. ANOVA analysis: (a) accuracy; (b) response time.
Figure 16. ANOVA analysis: (a) accuracy; (b) response time.
Applsci 16 00189 g016
Table 1. Rubric table for evaluating users’ English responses.
Table 1. Rubric table for evaluating users’ English responses.
ItemScale Score (Actual Score)
5 (25)4 (20)3 (15)2 (10)1 (5)
Sentence naturalnessNatural and appropriateMostly naturalSlightly awkwardAwkwardCompletely unnatural
Semantic clarityClear and preciseMostly clearSomewhat unclearVery unclearUnintelligible/irrelevant
Grammatical CompletenessGrammatically correctMinor errorsNoticeable errorsMajor errorsIncomprehensible
Overall comprehensibilityFully understandableEasy to understandUnderstandable with effortHard to understandNot understandable
Table 2. Scoring results evaluated by a human rater (O denotes the selection of the human rater).
Table 2. Scoring results evaluated by a human rater (O denotes the selection of the human rater).
ItemScale Score (Actual Score)
5 (25)4 (20)3 (15)2 (10)1 (5)
Sentence naturalness O
Semantic clarityO
Grammatical Completeness O
Overall comprehensibility O
Table 3. Scoring results evaluated by ChatGPT (O denotes the selection of ChatGPT).
Table 3. Scoring results evaluated by ChatGPT (O denotes the selection of ChatGPT).
ItemScale Score (Actual Score)
5 (25)4 (20)3 (15)2 (10)1 (5)
Sentence naturalness O
Semantic clarityO
Grammatical Completeness O
Overall comprehensibilityO
Table 4. An example of scoring results for a severely incorrect response (O denotes the selection of ChatGPT).
Table 4. An example of scoring results for a severely incorrect response (O denotes the selection of ChatGPT).
ItemScale Score (Actual Score)
5 (25)4 (20)3 (15)2 (10)1 (5)
Sentence naturalness O
Semantic clarity O
Grammatical Completeness O
Overall comprehensibility O
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, C.-T.; Chen, Y.-J.; Wu, T.-Y.; Lu, Y.-Y. An Intelligent English-Speaking Training System Using Generative AI and Speech Recognition. Appl. Sci. 2026, 16, 189. https://doi.org/10.3390/app16010189

AMA Style

Lu C-T, Chen Y-J, Wu T-Y, Lu Y-Y. An Intelligent English-Speaking Training System Using Generative AI and Speech Recognition. Applied Sciences. 2026; 16(1):189. https://doi.org/10.3390/app16010189

Chicago/Turabian Style

Lu, Ching-Ta, Yen-Ju Chen, Tai-Ying Wu, and Yen-Yu Lu. 2026. "An Intelligent English-Speaking Training System Using Generative AI and Speech Recognition" Applied Sciences 16, no. 1: 189. https://doi.org/10.3390/app16010189

APA Style

Lu, C.-T., Chen, Y.-J., Wu, T.-Y., & Lu, Y.-Y. (2026). An Intelligent English-Speaking Training System Using Generative AI and Speech Recognition. Applied Sciences, 16(1), 189. https://doi.org/10.3390/app16010189

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop