1. Introduction
Speaking proficiency is widely recognised as one of the most demanding skills for learners of English as a Foreign Language (EFL). Unlike receptive skills such as reading and listening, speaking requires learners to process linguistic input in real time while simultaneously managing accuracy, fluency, pronunciation, and affective factors such as confidence and anxiety [
1,
2]. In many EFL contexts, particularly those characterised by limited exposure to sustained or authentic interaction, learners continue to struggle to develop spoken competence despite years of formal instruction [
3]. These persistent challenges have prompted renewed interest in instructional approaches that foreground learners’ access to meaningful language exposure as a foundation for oral language development.
Input-oriented approaches have long emphasised the role of comprehensible input in facilitating language acquisition, proposing that learners benefit when exposure to language precedes pressured or premature output [
4,
5]. From this perspective, sustained listening, reading, and interactional exposure enable learners to internalise linguistic patterns and establish form–meaning connections that support spoken production. Empirical research suggests that such approaches can contribute to gains in both fluency and accuracy in speaking [
6]. However, in many instructional settings—particularly large, examination-driven classrooms—providing sufficiently rich, individualised, and frequent input remains a longstanding pedagogical constraint.
Recent advances in artificial intelligence (AI) have expanded the ways in which speaking opportunities and language exposure can be provided in EFL classrooms. Technologies such as conversational chatbots, automatic speech recognition systems, and large language models allow learners to engage in responsive, repeatable, and low-anxiety interaction beyond the temporal and spatial limits of classroom instruction [
7,
8]. A growing body of empirical research has examined the use of AI-mediated tools for speaking practice, pronunciation training, and automated feedback, often reporting improvements in oral fluency, accuracy, and learner confidence [
9,
10,
11]. Despite this expanding evidence base, existing studies vary considerably in how AI is pedagogically implemented and theoretically interpreted.
As a result, the pedagogical role of AI in EFL speaking development remains insufficiently synthesised. In particular, there is limited clarity regarding how different forms of AI-mediated speaking support relate to established constructs in second language acquisition, including learners’ engagement with language input, interaction, and affective conditions for learning. Prior studies are frequently fragmented across technologies, learner populations, and outcome measures, with many emphasising short-term performance gains or learner perceptions rather than offering integrated, theory-informed interpretations [
12]. This fragmentation has constrained the field’s ability to draw coherent conclusions about how AI-mediated speaking instruction functions across instructional contexts.
This systematic review addresses this gap by synthesising empirical studies on AI-mediated instruction for EFL speaking development. The review identifies recurring instructional functions, pedagogical approaches, and learning outcomes associated with AI-supported speaking activities. Drawing on input-oriented and task-based perspectives as interpretive lenses, the review further examines how AI-mediated practices may support learners’ engagement with spoken language and oral development under different instructional conditions. Rather than advancing prescriptive claims about AI’s instructional role, the review provides an evidence-informed synthesis intended to support theoretically grounded research and principled pedagogical decision-making.
This review is informed by three complementary theoretical perspectives from second language acquisition research: input-based theory, interactionist perspectives on language learning, and sociocultural approaches to mediated learning. Input-oriented frameworks emphasise the importance of comprehensible and meaningful exposure to language as a foundation for acquisition, while interactionist accounts highlight how participation in dialogue and feedback processes supports linguistic development. Sociocultural perspectives further stress the role of mediation, scaffolding, and instructional design in shaping learning outcomes. Together, these perspectives provide a coherent interpretive framework for analysing how AI-mediated speaking environments influence learner engagement with instructional input, interaction, and affective conditions for language development.
Accordingly, the review addresses the following research questions:
RQ1: What AI technologies and pedagogical approaches have been employed to support EFL/ESL speaking development, and how are these pedagogically positioned (practice, feedback, or interaction)?
RQ2: What linguistic and affective outcomes are associated with AI-supported speaking instruction, and under what instructional conditions are these outcomes sustained?
RQ3: What affordances and limitations of AI-mediated speaking instruction emerge when interpreted through input-oriented and task-based perspectives?
From an educational perspective, understanding how AI-mediated speaking activities are designed and embedded within instructional contexts is essential for translating technological potential into sustainable classroom practice.
2. Literature Review
2.1. Speaking Development and the Input–Output Relationship
Speaking is widely recognized as one of the most demanding skills in second language learning because it requires learners to process linguistic input in real time while coordinating multiple dimensions of performance, including fluency, accuracy, pronunciation, and interactional competence [
1,
2]. Unlike receptive skills, speaking places immediate cognitive and affective demands on learners, often resulting in heightened anxiety and reduced willingness to communicate, particularly in EFL contexts where opportunities for authentic interaction are limited [
3,
8]. As a result, many learners struggle to develop spoken competence despite prolonged exposure to formal instruction.
Research in second language acquisition has long debated the relationship between input and output in speaking development. While output-oriented perspectives emphasize the role of pushed production in promoting linguistic accuracy and noticing [
13], input-based perspectives argue that oral proficiency is fundamentally grounded in sustained exposure to meaningful and comprehensible input [
4]. From this view, speaking emerges as a consequence of internalized linguistic knowledge rather than as its primary driver.
2.2. Input-Based Instruction as a Foundation for Oral Proficiency
Input-based instruction (IBI) builds on the assumption that learners acquire language most effectively when they are first exposed to structured, meaningful input before being required to produce output. Central to this perspective is the concept of comprehensible input, often described as language slightly beyond the learner’s current proficiency level (i + 1), which promotes acquisition through understanding rather than explicit rule learning [
4].
Empirical studies have demonstrated that input-oriented approaches can support speaking development by strengthening form–meaning connections and reducing processing load during production. Processing Instruction, for example, has been shown to improve grammatical accuracy in spoken output by guiding learners to interpret linguistic forms more effectively [
5]. Similarly, approaches such as Input Flood and Input Enhancement increase exposure to target forms and promote noticing, which can lead to improvements in both fluency and accuracy [
6,
14].
Recent research further suggests that meaning-focused and lexical input approaches can support oral fluency and spontaneous speech, while more form-focused input enhances accuracy and pronunciation control [
6]. Together, these findings suggest that speaking proficiency develops most effectively when learners are given sufficient time and support to process input before engaging in output.
2.3. Technology-Enhanced Input and AI-Mediated Language Learning
Advances in educational technology have expanded the possibilities for delivering rich, repeated, and contextualized language input beyond traditional classroom constraints. Earlier forms of technology-enhanced language learning, including digital storytelling, video-based instruction, and mobile-assisted language learning (MALL), have been shown to increase learner engagement while broadening access to spoken language in meaningful contexts. These tools help address a persistent limitation of classroom-based instruction, namely learners’ restricted exposure to authentic and frequent input.
More recently, artificial intelligence (AI) has introduced a further shift in how input is delivered, personalized, and experienced. Conversational chatbots, automatic speech recognition systems, and large language model-based tools enable learners to engage in interactive dialogue, receive adaptive responses, and practice speaking in relatively low-anxiety environments. Existing studies increasingly suggest that AI-mediated speaking activities can support gains in fluency, pronunciation, and learner confidence, particularly by expanding opportunities for repeated practice and immediate feedback [
15,
16,
17].
Recent research published in 2024–2025 provides further evidence that AI-mediated speaking support is most effective when it combines low-stakes interaction, immediate feedback, and pedagogically guided task design. Studies of AI chatbots and mobile conversational agents report improvements in speaking confidence, reduced anxiety, and greater willingness to communicate, especially among learners who may be hesitant to speak in teacher-fronted or peer-fronted settings [
18,
19,
20]. Other recent studies show that AI-supported speech evaluation and feedback tools can improve fluency, pronunciation, speaking performance, and confidence, while also increasing motivation and classroom willingness to communicate [
21,
22,
23,
24].
Beyond learner performance outcomes, recent scholarship has also examined broader pedagogical implications of artificial intelligence for language learning and classroom practice. For example, ref. [
25] discusses how AI-mediated environments influence language acquisition and linguistic development, highlighting both opportunities and emerging instructional challenges. Similarly, another study analyses EFL teachers’ perceptions of AI in relation to academic integrity and classroom pedagogy, emphasising the need for responsible and pedagogically informed integration of AI technologies. Related research on technology-supported engagement further indicates that digital innovations such as gamification can strengthen teacher–student interaction and enhance learners’ willingness to communicate in language classrooms [
26,
27].
Despite these promising developments, recent scholarship cautions against assuming that increased interaction time automatically produces fuller communicative development. Critical studies note that AI feedback may be inaccurate, overly generic, or too heavily focused on surface-level form. They also emphasize that pragmatic competence, discourse management, and socially situated communication still depend strongly on teacher mediation, blended pedagogy, and opportunities for transfer to human interaction [
28,
29,
30,
31,
32].
These findings suggest that AI should not be viewed merely as a tool for feedback, assessment, or isolated speaking practice. Rather, it can be more productively conceptualized as a mediated input-and-interaction resource that provides learners with repeated exposure, adaptive response, and affective support. This perspective offers a stronger theoretical bridge between AI-mediated instruction and established second language acquisition frameworks, while also acknowledging that the educational value of AI depends on how it is pedagogically designed and integrated.
2.4. Conceptualizing AI as a Source of Comprehensible Input
From an input-based perspective, AI-mediated interaction can be viewed as a dynamic form of comprehensible input rather than merely a technological supplement. AI systems are capable of adjusting linguistic complexity, providing repeated exposure to lexical and grammatical patterns, and sustaining interaction over extended periods. These features closely align with key input characteristics identified in Second Language Acquisition (SLA)research, including frequency, salience, comprehensibility, and interactional relevance [
6,
33].
In addition, AI-mediated environments often reduce affective barriers associated with speaking, such as fear of negative evaluation, which has been shown to inhibit oral production [
3]. By lowering anxiety and increasing opportunities for risk-free interaction, AI tools may create conditions that allow learners to process input more deeply before producing speech. This suggests that improvements in speaking performance may result not only from increased practice, but from enhanced quality and accessibility of input.
In this review, AI-mediated input is understood as linguistically meaningful language exposure that is generated or shaped by AI systems and that learners must process in order to understand, interpret, or make use of it. Such input does not occur only in traditionally receptive activities, but often emerges within speaking-oriented tasks. Examples include model responses produced by conversational agents, reformulations or recast-like replies to learner output, repeated exposure to lexical and syntactic patterns across interactions, and increased perceptual salience created through feedback and repetition. Importantly, the framework does not suggest that all AI-mediated speaking activities are input-based. Rather, it proposes that many production-oriented AI tasks incorporate input functions that influence spoken development indirectly by supporting learners’ processing, noticing, and engagement with language under more affectively accessible conditions.
A comparison between earlier studies (2018–2020) and more recent research (2024–2025) reveals several important developments in AI-supported speaking instruction. Earlier studies primarily focused on automatic speech recognition systems and structured chatbot interactions designed to improve pronunciation and fluency through repetitive practice. In contrast, recent research increasingly examines generative AI tools, conversational agents, and adaptive feedback systems that enable more interactive and personalised speaking practice. Moreover, contemporary studies place greater emphasis on affective variables such as speaking anxiety, motivation, confidence, and willingness to communicate. This shift reflects a broader movement toward learner-centred and socially mediated perspectives on technology-enhanced speaking instruction.
2.5. Research Gap and Direction for the Present Review
Although recent studies demonstrate growing interest in AI-mediated speaking instruction, the literature remains conceptually dispersed and lacks a unified theoretical synthesis. Many investigations focus on short-term performance gains without explicitly linking findings to input-based theory or examining how AI-mediated interaction functions as comprehensible input. Consequently, there is limited synthesis explaining which input characteristics consistently contribute to speaking development across contexts.
To address this gap, the present systematic review examines empirical studies on AI-mediated instruction for EFL speaking development through the lens of input-based theory. By synthesizing findings across studies and proposing a conceptual framework that positions AI as a dynamic provider of comprehensible input, the review seeks to clarify the theoretical role of AI in speaking development and to inform future research and pedagogical practice.
By synthesising empirical findings across studies and proposing a conceptual framework that positions AI as a dynamic provider of comprehensible input, the review aims to clarify the theoretical role of AI-mediated interaction in speaking development and to inform future research and pedagogical practice.
The present review therefore addresses the following research question: How does AI-mediated interaction function as a source of comprehensible input for the development of EFL speaking skills?
3. Methodology
3.1. Research Design
This study adopts a systematic review methodology conducted in full accordance with the PRISMA 2020 (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines to ensure transparency, rigour, and replicability [
28]. A completed PRISMA checklist is provided as
Supplementary Materials, and the study selection process is illustrated in the PRISMA flow diagram (
Figure 1). The review protocol was not registered in a public registry prior to data extraction.
3.2. Search Strategy
A structured and systematic search strategy was employed to identify empirical studies examining the role of artificial intelligence (AI) in supporting English as a Foreign Language (EFL) speaking development. The search focused on studies addressing AI-mediated speaking practice, conversational agents, automated feedback systems, and related forms of technology-enhanced oral language learning.
The literature search was conducted in the Scopus database. Scopus was selected because of its broad interdisciplinary coverage, strong representation of applied linguistics, educational technology, and language education research, and its rigorous indexing standards. The use of a single database also supported consistency and transparency in the identification and screening process.
The search was conducted between November 2024 and January 2025. A combination of keywords and Boolean operators was used to retrieve relevant studies. The search terms were designed to capture three core dimensions of the review: artificial intelligence technologies, language learning context, and speaking-related outcomes. A representative search string was as follows:
(“artificial intelligence” OR AI OR chatbot OR “speech recognition” OR “generative AI” OR “large language model”) AND (“EFL” OR “ESL” OR “second language”) AND (“speaking skills” OR “oral proficiency” OR pronunciation OR fluency OR “spoken communication”)
To improve relevance, the search was limited to English-language publications and focused on peer-reviewed journal articles and conference proceedings. The review considered studies published between 2018 and 2025, a period chosen to capture both earlier work on AI-supported speaking practice and more recent developments related to generative AI and advanced conversational systems.
Following retrieval, titles and abstracts were screened to identify studies directly related to AI-supported speaking instruction or speaking-related learning outcomes in EFL or ESL contexts. Studies that appeared relevant were then examined in full text and assessed according to the inclusion and exclusion criteria described in the following section.
This search strategy was intended to provide a focused and analytically robust body of literature for examining how AI-mediated technologies support speaking development, while maintaining methodological transparency in the review process.
3.3. Study Selection Process
The initial search yielded 119 records. As all records were retrieved from a single database and screening was conducted manually, no duplicate records were identified. Title screening excluded 74 studies that did not focus on AI-mediated instruction or interaction related to EFL speaking, leaving 45 studies for abstract screening.
Abstract screening was conducted against the predefined inclusion criteria. Nine records were excluded due to the absence of speaking-related outcomes, lack of learner-focused AI intervention, assessment-only applications without instructional or feedback components, or non-empirical study designs. A total of 36 empirical studies were retained for inclusion in the qualitative synthesis. Studies by the same authors were retained as separate records where they represented distinct publications. The study selection process is illustrated in the PRISMA flow diagram (
Figure 1).
3.4. Inclusion and Exclusion Criteria
The inclusion and exclusion criteria were designed to ensure alignment with the objectives of the review and to support a transparent, theory-informed synthesis of empirical research on AI-mediated speaking instruction in EFL contexts. Studies were screened based on study design, participant profile, instructional context, AI technology, and outcome focus.
Specifically, the review included empirical studies (quantitative, qualitative, or mixed-methods) investigating the use of AI-mediated instructional or interactional tools to support speaking development among EFL or ESL learners in formal or semi-formal educational settings. Eligible studies reported speaking-related outcomes such as fluency, accuracy, pronunciation, oral performance, or speaking anxiety and were published in English-language, peer-reviewed journals between 2015 and 2025.
Studies were excluded if they were non-empirical, focused exclusively on non-speaking skills, examined AI applications limited to assessment or scoring without instructional or feedback components, did not involve learner participants, or were inaccessible in full-text form.
Table 1 summarises the eligibility criteria applied during the study selection process.
3.5. Data Extraction and Analysis
Data extraction focused on key characteristics of each study, including research context, participant profile, type of AI technology, instructional design, and reported speaking-related outcomes. The extracted information was analysed using thematic synthesis, which enabled patterns to be identified across studies regarding how AI-mediated instruction has been used to support speaking development in EFL and ESL contexts.
The synthesis was guided by four analytical dimensions: (a) AI modality, (b) pedagogical integration, (c) characteristics of instructional input, and (d) speaking-related outcomes. These dimensions were used to organise findings across heterogeneous studies and to support comparison at both descriptive and interpretive levels.
Given the diversity of research designs, participant populations, and outcome measures among the included studies, formal quality appraisal or risk-of-bias scoring was not undertaken. Instead, greater emphasis was placed on patterns that recurred across multiple studies, instructional designs, and learning contexts. As a result, the synthesis advances interpretive and explanatory insights into the pedagogical role of AI-mediated speaking instruction rather than making causal or broadly generalisable claims about effectiveness.
Patterns reported in the results were derived through qualitative thematic synthesis and frequency comparison across the included studies. Recurring outcomes were identified by examining the distribution of reported linguistic and affective effects across the dataset, allowing the analysis to distinguish between consistently reported outcomes and more variable or context-dependent findings.
This analytic approach supports a pedagogically oriented synthesis by highlighting how instructional design choices mediate the educational value of AI-supported speaking activities across diverse learning contexts.
Accordingly, the review adopts a design-sensitive and theory-building orientation. Rather than estimating aggregate effect sizes across heterogeneous studies, the objective of the synthesis is to identify recurring pedagogical patterns and to interpret how AI-mediated speaking practices function within instructional systems. Given the diversity of research designs, participant populations, and outcome measures across the included studies, a theory-building synthesis provides a more appropriate analytical approach than purely quantitative aggregation. This orientation enables the review to connect empirical findings to established constructs in second language acquisition, particularly those related to instructional input, interaction, and pedagogical mediation.
3.6. Methodological Limitations
Despite its strengths, this review is subject to certain limitations. The reliance on a single database (Scopus) may have resulted in the omission of relevant studies indexed exclusively in other databases such as ERIC or Web of Science. Nevertheless, given Scopus’s broad disciplinary coverage and rigorous indexing standards, the included studies provide a robust and reliable foundation for synthesis. Future reviews may benefit from multi-database search strategies to further expand coverage. As a result, the findings should be interpreted as analytically representative of pedagogically oriented AI-mediated speaking research rather than as a comprehensive census of all AI-related speaking studies.
As summarised in
Table 2, the reviewed studies are characterised by a strong concentration in higher education EFL contexts and a predominant focus on affective outcomes such as anxiety reduction and learner engagement. This distribution provides important context for interpreting the speaking outcomes reported across AI modalities.
4. Results
This review synthesised findings from 36 empirical studies investigating the use of artificial intelligence (AI) to support EFL learners’ speaking development [
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67]. The studies were conducted predominantly in higher education and secondary school contexts, with strong representation from Asian EFL settings. Methodologically, the corpus comprised experimental and quasi-experimental designs, mixed-methods investigations, and qualitative case studies, reflecting the heterogeneous nature of AI-mediated language learning research.
Across the dataset, AI technologies were positioned in three principal pedagogical roles: interactional speaking partners (e.g., chatbots and large language models), feedback providers (e.g., automatic speech recognition-based systems), and hybrid instructional tools embedded within structured pedagogical frameworks such as task-based learning or blended instruction [
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67]. While most studies reported positive trends in speaking-related outcomes, the magnitude and durability of these effects varied, suggesting that the evidence should be interpreted as contextually bounded rather than universally generalisable.
Table 3 provides a structured overview of the included studies, summarising AI modality, research design, speaking focus, affective outcomes, instructional context, and publication source.
4.1. AI as Interactional Speaking Partner
A substantial subset of the reviewed studies conceptualised AI as an interactional speaking partner, enabling learners to engage in simulated dialogue through chatbots, conversational agents, or large language models [
34,
36,
37,
47]. Across these studies, AI-mediated interaction was consistently associated with increased learner engagement, enhanced speaking confidence, and greater willingness to communicate, particularly among learners who reported anxiety in human-mediated speaking contexts.
In both adolescent and university-level settings, AI-supported dialogue was linked to increased speaking frequency, longer turns at talk, and more voluntary participation [
32,
49]. Several studies attributed these effects to the perceived psychological safety of AI interlocutors, which appeared to reduce fear of negative evaluation and encourage risk-taking in spoken production [
29,
34]. These participation-related gains were especially pronounced among lower-proficiency learners.
With respect to linguistic development, improvements were most frequently reported in fluency-related dimensions, including reduced hesitation, smoother delivery, and greater continuity of speech. However, evidence for gains in interactional complexity, discourse management, and pragmatic appropriateness was less consistent. Some studies observed that AI exchanges tended to remain structurally predictable or lexically constrained, thereby limiting opportunities for negotiation of meaning or context-sensitive adaptation [
16,
24].
Indeed, the findings indicate that AI interlocutors are effective in lowering participation barriers and increasing speaking practice. Nevertheless, linguistic development beyond fluency appears more variable and contingent on instructional framing, suggesting that interactional quantity alone does not guarantee deeper communicative competence.
4.2. AI as Feedback Provider
Another prominent strand of research conceptualised AI as a provider of automated speaking feedback, most commonly through ASR-based pronunciation tools, speech evaluation systems, and corrective feedback technologies [
45,
46,
50,
56,
60].
Across these studies, learners demonstrated measurable improvements in pronunciation accuracy and fluency, particularly in segmental features, stress patterns, and speech rate. Gains in phonological control were among the most consistently reported linguistic outcomes across the dataset.
The immediacy, repeatability, and consistency of AI-generated feedback were frequently identified as key affordances, enabling repeated practice and fostering self-regulated learning [
45,
50,
56]. In several studies, sustained exposure to ASR-mediated feedback was also associated with reductions in speaking anxiety and increased learner confidence [
38,
45,
46].
However, the scope of linguistic development supported by automated systems appeared narrower than that observed in hybrid instructional designs. Feedback systems primarily targeted measurable phonological or fluency features, with limited attention to discourse-level competence, interactional responsiveness, or pragmatic appropriateness [
50,
56,
60].
Taken together, the evidence indicates that AI-based feedback systems are particularly effective in supporting foundational speaking skills, especially pronunciation and fluency, but are less consistently associated with higher-order communicative development.
4.3. Hybrid and Task-Based AI-Integrated Instruction
Studies adopting hybrid instructional designs, in which AI tools were embedded within pedagogically structured tasks, reported comparatively more robust and sustained speaking outcomes than studies relying on AI interaction or feedback alone [
40,
42,
54].
In these contexts, AI functioned as a scaffold rather than a replacement for instruction. Tools were used for preparatory rehearsal, input enhancement, guided practice, or reflective feedback within clearly sequenced learning activities. For example, AI-assisted task-based and production-oriented approaches were associated with improvements in impromptu speaking performance, pragmatic competence, and sustained learner engagement [
40,
42]. Mobile-assisted and blended designs further extended opportunities for structured out-of-class practice while maintaining alignment with curricular goals [
41,
44].
Compared to stand-alone chatbot or ASR implementations, hybrid designs more frequently reported gains beyond pronunciation and fluency, including improvements in communicative appropriateness and task performance. However, even within this group, outcome strength varied depending on the clarity of instructional integration. Where AI tools were introduced without explicit task alignment or pedagogical sequencing, gains were uneven and learner uptake inconsistent [
39,
54].
These findings suggest that structured pedagogical integration is associated with broader and more transferable speaking outcomes.
4.4. Affective Outcomes Associated with AI-Mediated Speaking
Affective outcomes emerged as a consistent cross-cutting theme across the reviewed studies. Reductions in speaking anxiety, increases in willingness to communicate, enhanced learner enjoyment, and improved confidence were reported in studies employing AI interlocutors, feedback systems, and hybrid instructional designs [
34,
38,
45,
50,
51,
52]. These affective gains were observed across both university and secondary contexts and were particularly pronounced among learners who initially reported high levels of speaking apprehension.
Notably, affective improvements were frequently documented even in cases where measurable linguistic gains were modest or limited to specific dimensions such as fluency or pronunciation [
45,
50]. This pattern suggests that emotional and motivational benefits may emerge independently of, or prior to, broader communicative development. Several studies further emphasised that reductions in anxiety and increased willingness to communicate were associated with learners’ perceptions of AI interlocutors as non-judgmental and repeatable practice environments [
34,
38,
45].
At the same time, a number of investigations cautioned that affective improvements were context-sensitive and dependent on sustained exposure and instructional support [
38,
39]. Without structured opportunities to transfer AI-mediated confidence and participation to human-mediated speaking tasks, affective gains risk remaining situational rather than enduring.
4.5. Summary of Findings
Overall, the reviewed evidence indicates that AI technologies can play a meaningful supportive role in EFL speaking instruction by expanding practice opportunities, reducing affective barriers, and providing immediate feedback [
34,
36,
38,
45,
50]. However, the effectiveness of AI-mediated speaking support is neither uniform nor automatic. Studies that positioned AI within pedagogically grounded, task-oriented designs reported more consistent and transferable outcomes than those relying on AI interaction or feedback in isolation [
40,
41,
42,
54]. Conversely, research examining stand-alone chatbot or feedback implementations reported more variable gains, particularly in higher-order communicative competence [
34,
48,
52].
Accordingly, the findings support a cautious, context-aware interpretation of AI’s role in speaking development. AI appears most effective when functioning as a mediating instructional resource embedded within structured pedagogical frameworks rather than as a standalone technological solution [
40,
41,
42,
54]. Its pedagogical value therefore remains contingent on thoughtful instructional integration and sustained human guidance.
5. Discussion
5.1. Interpreting the Pedagogical Role of AI in EFL Speaking Development
The findings of this review invite a more precise interpretation of how AI-mediated tools function within EFL speaking instruction. While many reviewed studies report positive speaking-related outcomes, these effects cannot be straightforwardly attributed to AI as an autonomous instructional agent. Rather, the evidence suggests that AI most often functions as a pedagogical mediator, with its impact emerging through interaction with instructional design, learner engagement, and affective conditions. Interpreted in this way, the findings complicate technologically deterministic narratives that portray AI as inherently transformative and align with long-standing arguments in ELT and CALL, which assert that learning outcomes are shaped primarily by pedagogical orchestration rather than technological affordances alone [
68,
69,
70].
Importantly, variation in outcome magnitude and durability across the 36 studies indicates that AI does not exert a uniform or self-sustaining influence on speaking development. Studies reporting stronger and more sustained gains typically embedded AI within structured instructional sequences, whereas stand-alone AI interventions were more often associated with limited, short-term, or context-bound effects [
34,
36,
38,
45]. From a theoretical perspective, this pattern supports the view that technologies become pedagogically meaningful when normalised within instructional systems rather than introduced as external innovations. Accordingly, AI is better conceptualised not as a disruptive replacement for speaking instruction, but as a contingent instructional resource whose pedagogical value depends on alignment with learning objectives, task design, and learner mediation processes, as synthesised in the proposed mediated input framework.
To synthesise these patterns,
Figure 2 presents a mediated input framework that organises how AI-supported speaking instruction functions through the interaction of pedagogical design, affective conditions, and instructional input.
5.2. AI-Mediated Interaction and the Nature of Speaking Practice
A substantial proportion of the reviewed literature conceptualises AI as an interactional speaking partner, most commonly through chatbots and large language models [
3,
13,
14,
71]. From an interactionist perspective, reported increases in speaking frequency, turn length, and willingness to communicate provide tentative support for the claim that expanded interactional opportunities may facilitate oral development by encouraging output and engagement [
9,
10]. At a surface level, these findings appear to align with interaction-based explanations of AI-supported speaking gains.
Prevailing interpretations of AI-mediated speaking instruction tend to conceptualise AI primarily as an interactional partner or a feedback mechanism, implicitly assuming that increased output opportunities or corrective feedback are sufficient drivers of speaking development. However, the synthesis presented in this review indicates that such interpretations do not adequately explain three recurring empirical patterns: (a) why affective gains often precede and exceed linguistic gains, (b) why increased interaction frequency does not consistently result in higher interactional complexity or pragmatic development, and (c) why pedagogical sequencing and task integration exert a stronger influence on outcomes than the technological sophistication of AI systems themselves. The mediated input framework addresses these explanatory gaps by repositioning AI-mediated interaction as a source of accessible, repeatable instructional input whose effectiveness is contingent on pedagogical mediation rather than interaction quantity alone.
However, closer examination reveals important theoretical constraints. While AI-mediated interaction reliably increased participation, evidence for sustained development in interactional complexity, pragmatic appropriateness, and discourse management was uneven. Several studies examining earlier AI language learning systems between 2018 and 2020 reported that AI exchanges were often lexically repetitive, structurally predictable, or limited in contingent responsiveness, thereby constraining opportunities for negotiation of meaning and interactionally driven learning [
59,
64,
65]. These earlier systems were typically based on rule-based chatbots or limited conversational architectures, which restricted the depth and variability of learner interaction.
More recent studies conducted between 2024 and 2025, however, indicate that advances in large language models and AI conversational agents have improved the naturalness and responsiveness of interaction, allowing learners to engage in longer and more varied exchanges. Nevertheless, even in these newer systems, researchers note that AI-mediated conversations may still fall short in supporting pragmatic negotiation, discourse management, and socially situated communication. In interactionist terms, therefore, increased output quantity does not necessarily correspond to qualitatively richer interactional work [
11].
This distinction is theoretically consequential. It suggests that AI-mediated dialogue may function primarily as low-stakes or preparatory interaction, supporting fluency, confidence, and willingness to communicate, without fully reproducing the sociocognitive demands of human interaction. Spoken interaction, as described in discourse and sociolinguistic research, involves emergent meaning-making, pragmatic calibration, and sensitivity to social cues that remain difficult for AI systems to simulate consistently [
12,
72]. Without pedagogical mediation, AI interaction therefore risks privileging surface-level engagement over deeper communicative competence, highlighting the limits of interaction alone as an explanatory mechanism for AI-supported speaking development.
5.3. Automated Feedback, Accuracy, and the Limits of Measurement
Another prominent strand of literature positions AI as a provider of automated speaking feedback, particularly through ASR-based pronunciation and fluency tools [
12,
68,
69]. Across studies, improvements in segmental pronunciation accuracy, speech rate, and learner confidence were frequently reported, often alongside reductions in speaking anxiety [
45,
50]. These findings suggest that AI-mediated feedback is well suited to supporting form-level aspects of spoken performance, particularly those amenable to repeated practice and self-regulated learning.
At the same time, the reviewed studies highlight a structural limitation inherent in AI-driven feedback systems: feedback is constrained to features that are computationally detectable. Consequently, discourse-level competence, pragmatic appropriateness, and interactional responsiveness remain underrepresented in both feedback provision and outcome measurement [
22,
68]. This limitation is not merely technical but theoretical, reflecting a broader misalignment between accuracy-oriented metrics and the multidimensional nature of spoken communication [
13,
14].
From an SLA perspective, these patterns echo long-standing concerns regarding the privileging of measurable accuracy gains at the expense of communicative competence [
13,
14]. While AI feedback systems appear effective in supporting foundational phonological and fluency development, they cannot substitute for human-mediated evaluation of meaning-making, pragmatic intent, and interactional appropriateness. Accordingly, AI-mediated feedback is best understood as complementary rather than comprehensive, reinforcing the need for pedagogical frameworks that integrate automated feedback with human judgement and discourse-level instruction.
5.4. Pedagogical Integration as a Key Mechanism of Effectiveness
Across the reviewed corpus, pedagogical integration emerged as a consistently differentiating factor between more effective and more limited forms of AI-mediated speaking instruction. Studies embedding AI within task-based, production-oriented, or blended instructional designs tended to report more stable and transferable speaking outcomes than those employing AI in isolation [
40,
41,
42,
54]. In these contexts, AI served clearly defined pedagogical functions, such as task rehearsal, input enhancement, or reflective feedback.
This pattern aligns closely with sociocultural perspectives that foreground mediation, scaffolding, and goal-directed activity as central to language development [
71]. By contrast, studies lacking clear pedagogical integration frequently reported uneven learner uptake and fragile outcomes [
39], underscoring that technological sophistication alone does not guarantee instructional effectiveness.
Taken together, these findings support a design-sensitive interpretation in which learning outcomes are co-constructed through the interaction of tools, tasks, learners, and instructional intent, rather than being driven by technological affordances in isolation.
5.5. Affective Gains as Enabling Conditions for Speaking Development
Affective outcomes emerged as one of the most consistently reported patterns across the reviewed studies. Reductions in speaking anxiety and increases in willingness to communicate were observed across AI interlocutor, feedback-based, and hybrid instructional designs [
34,
38,
45,
50], with particularly notable effects among lower-proficiency learners and those with prior negative speaking experiences. These findings suggest that AI-mediated environments may exert their most immediate influence at the level of affective accessibility to speaking opportunities rather than directly on higher-order communicative competence.
From an affective-filter perspective, the reviewed evidence indicates that AI-mediated environments can lower psychological barriers to participation, thereby increasing learners’ readiness to engage with spoken input and output [
4]. Across studies, reduced anxiety and enhanced willingness to communicate appeared to function as enabling conditions that facilitated sustained engagement with speaking tasks, particularly in contexts where fear of negative evaluation constrained participation [
3,
8]. In this sense, affect operates less as an outcome and more as a condition that shapes learners’ access to instructional input, interaction, and feedback.
At the same time, affective improvements were not consistently accompanied by proportional gains in higher-order communicative competence. Several studies cautioned that without opportunities to transfer AI-mediated confidence and fluency to human-mediated interaction, affective gains may remain situational and context-bound [
34,
38]. This pattern aligns with interactionist and sociocultural perspectives, which emphasise that speaking development is ultimately shaped through socially situated meaning-making rather than isolated participation [
10,
71]. Reduced anxiety may enable participation, but it does not guarantee the development of pragmatic control, discourse management, or interactional sensitivity.
5.6. Conceptual Framework: Interpreting AI as a Mediated Input Resource
Building on the patterns identified across the reviewed studies, this section presents a conceptual framework that organises how AI-mediated instruction supports EFL speaking development through the interaction of instructional input, affective conditions, and pedagogical design (see
Figure 2). Rather than positioning AI as an autonomous instructional agent, the framework reflects how prior research has used AI as a mediated input resource whose pedagogical value depends on alignment with instructional goals and task design.
At the centre of the framework is AI-mediated input characterised by adaptability, repeatability, and accessibility. Across the reviewed studies, AI tools provided learners with sustained exposure to level-appropriate linguistic input through simulated interaction, shadowing, and feedback-driven practice. This input appeared most effective when it was embedded within pedagogically structured activities, including sequenced tasks, goal-oriented speaking activities, and teacher-guided integration.
Surrounding this input component are affective conditions, particularly reduced speaking anxiety and increased willingness to communicate. Within the framework, affective support is interpreted as an enabling condition that facilitates learners’ engagement with input and interaction rather than as an instructional outcome in its own right. Lower affective barriers were associated with increased access to speaking opportunities and greater learner participation.
Together, these elements contribute to speaking development outcomes, including gains in fluency, accuracy, pronunciation, and speaking confidence. The framework also highlights that higher-order communicative competence, such as pragmatic appropriateness and discourse management, was most consistently reported when AI-mediated input was integrated into socially meaningful and pedagogically structured speaking tasks rather than used as a stand-alone practice tool.
Overall, the framework offers an integrative way of organising existing evidence on AI-mediated speaking instruction. It clarifies the pedagogical conditions under which AI-supported practices appear most effective, while also delineating the limits of technology-driven approaches when instructional mediation is weak. The framework is not intended to account for all instructional contexts, and its explanatory power appears strongest in settings where learners possess sufficient receptive proficiency to benefit from AI-mediated input and where pedagogical structures support transfer to human-mediated interaction.
6. Conclusions and Implications
This systematic review synthesised empirical research on AI-mediated instruction for EFL speaking development, with particular attention to how AI-supported practices are pedagogically designed and how they function within instructional contexts. Across the 36 studies reviewed, AI was most commonly employed as an interactional partner, a source of automated feedback, or a hybrid form of instructional support. Rather than functioning as an autonomous instructional agent, AI-mediated tools were most effective when embedded within purposeful pedagogical designs that structured learners’ engagement with spoken language.
One pattern emerging consistently from the review is that the most stable benefits of AI-mediated speaking instruction were observed in affective domains, including reductions in speaking anxiety, increased willingness to communicate, and enhanced learner confidence. While gains in fluency, pronunciation, and accuracy were frequently reported, evidence for sustained development of higher-order interactional competence was more variable. These findings suggest that AI-supported speaking activities may function primarily as preparatory or enabling environments that lower affective barriers to participation, rather than as substitutes for socially situated communicative interaction.
Importantly, the effectiveness of AI-mediated input appeared to depend less on technological features per se than on how AI tools were pedagogically integrated. Studies that embedded AI within task-based, production-oriented, or blended instructional designs tended to report more transferable and durable outcomes than those relying on stand-alone or tool-driven implementations. This design-sensitive pattern highlights the importance of instructional intent, task structure, and teacher mediation in shaping the pedagogical value of AI-supported speaking practice.
From a theoretical perspective, the review offers a clearer account of how AI-mediated speaking activities can be interpreted through input-oriented and task-based perspectives. AI environments may provide adaptive, repeatable, and affectively supportive forms of language exposure that facilitate learners’ engagement with spoken input. However, affective gains should be understood as enabling conditions rather than endpoints of acquisition, and opportunities for goal-oriented, socially meaningful interaction remain essential for the development of communicative competence.
For future research, several directions emerge from the findings of this review. First, more longitudinal investigations are required to examine whether improvements observed in AI-mediated speaking environments lead to sustained development in real communicative settings over time. Many studies included in this review measured short-term improvements in fluency, pronunciation, or learner confidence, but fewer examined whether these gains transfer to authentic human interaction in classroom or professional contexts. Second, future research should explore higher-order communicative competence, including pragmatic appropriateness, discourse management, interactional responsiveness, and negotiation of meaning. While AI tools appear effective in supporting foundational speaking skills, their role in fostering more complex communicative abilities remains less well understood. Third, greater attention should be given to teacher mediation and instructional design in AI-supported speaking environments. Investigating how teachers scaffold AI-supported activities, integrate them with classroom interaction, and guide learners’ reflection on AI-generated feedback may provide deeper insights into effective pedagogical models. Finally, future studies should address the ethical and pedagogical implications of AI-based speaking assessment, including issues related to feedback reliability, algorithmic bias, transparency of evaluation criteria, and the responsible use of automated scoring systems in language learning contexts.
For practice, the review suggests that AI is best adopted as a supportive instructional resource rather than a replacement for communicative pedagogy or teacher expertise. AI tools appear especially well suited for preparatory speaking practice, pronunciation and fluency rehearsal, anxiety-sensitive scaffolding, and extending speaking opportunities beyond classroom time. In addition, institutions and instructors should critically evaluate the ethical implications of AI-supported feedback and assessment systems, ensuring transparency, fairness, and responsible use of automated evaluation tools in language education.
Therefore, this review contributes to a more nuanced understanding of the pedagogical role of AI in EFL speaking development by clarifying how instructional design mediates effectiveness and by situating AI-supported speaking practice within established educational principles. Such an approach supports more principled integration of AI into language teaching while avoiding technologically deterministic assumptions.
As AI continues to enter mainstream educational settings, design-sensitive and pedagogy-first syntheses such as this review are essential for ensuring that technological adoption remains aligned with educational rather than purely technical priorities.