Next Article in Journal
Aspectual Architecture of the Slavic Verb and Its Nominal Analogies
Previous Article in Journal
Parallels Between Second Language Mastery and Musical Proficiency: Individual Differences in Auditory Phonological Pattern Recognition
 
 
Article
Peer-Review Record

Developing an AI-Powered Pronunciation Application to Improve English Pronunciation of Thai ESP Learners

Languages 2025, 10(11), 273; https://doi.org/10.3390/languages10110273
by Jiraporn Lao-un and Dararat Khampusaen *
Reviewer 1: Anonymous
Reviewer 3:
Languages 2025, 10(11), 273; https://doi.org/10.3390/languages10110273
Submission received: 9 June 2025 / Revised: 6 October 2025 / Accepted: 14 October 2025 / Published: 28 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

 

This study presents on the effects of an AI-powered Computer-Assisted Pronunciation Training (CAPT) intervention. The study is situated in contrastive analysis (i.e., sounds that are difficult for Thai learners), Skill Acquisition Theory (SAT), and ESP. The manuscript is written well in certain sections, but lacks clarity about what was actually done by the researchers in others. Furthermore, it has minimal implications for ESP and very few for SAT. The literature review could be enhanced as well. The foundational issues are outlined below, as well as an itemized list below that.

Substantial and important information is missing about the study’s measurement, including details about the test, the number of items per phoneme, the raters, rater training, the rating design, rater consistency, and the use of the ratings they produced for the study. There is a section that talks about sampling five participants for comparison with ASR, but no quantitative results. Furthermore, a more common statistical approach with speakers and raters (especially in longitudinal designs) is with linear mixed effects models. In sum, a major overhaul of the measurement section is needed with an eye on providing enough detail for the study to contain enough information to be replicated.

Second, the theoretical issue of phonetic perception and production is missing. The intervention developed in this study included both, but no theory is cited nor are the intervention or measures defined clearly enough to understand the potential relationship between the two.

Third, please clarify what is meant by “AI.” While ASR applications of deep learning most certainly can be considered AI, they should be introduced as such and then clarified that ASR is being used. Specifically, if any additional steps beyond the standard Google API ASR is used, those should be highlighted in the introduction and throughout the paper.

Here are a few line-by-line comments

Line 111 – The way this line reads suggest that Skill Acquisition Theory first came to SLA in 2007. The more common citation would be Dekeyser’s 1997 SSLA paper.

Line 159 – Please cite and use the concepts of Dekeyser as well when discussing the predictions of learning based on SAT

Line 175 – Dennis (2024) is missing from the references list

Lines 178, 181 – Please check the punctuation

Line 191 – Please elaborate on which comparisons are being made (what type of AI and which traditional methods)

Line 196 – Yang and Chang (2024) is missing from the references list

Line 198 – Kang et al. (2024) is missing from the references list

Line 199 – Lee et al. is missing year from in-text citation and references list entry

Line 212 – The comment of CAPT being ubiquitous  is originally attributed to Cucchiarini. Please review foundational work by Neri, Cucchiarini, and Strik on the accessibility of CAPT.

Line 219 – Kim et al. missing year from in-text citation and references list entry

Line 225 – These seem like a rather haphazard collection of study reports on CAPT. Some of these are commercial or public domain products, but others were one-time research interventions. Please clarify the purpose of including each study. It may be more effective to remove the table as it does not show how each study contributes to the present study.

CAPT review section – it seems remiss to not mention Mahdi and Al KAhateeb’s (2019) meta analysis on CAPT

Line 322 – Please signpost to the description of the pre-test. It is impossible to have any confidence that the statistical test is meaningful without know what the test is.

Line 326 – Some key details were missing about the English pronunciation test, such as the format of the items. Was this a read aloud test? If so was it scored? Was it a listening test? How were the items divided into the nine target consonant phonemes? Were they all presented in initial, medial, and final positions?

Line 413 – How was ASR recalibration completed? Google’s API is one of the less commonly used CAPT technologies, but we do have some evidence of its efficacy for L2 pronunciation. See McCrocklin’s work, as well as issues identified by Inceoglu et al. (2022) https://doi.org/10.1017/S0958344022000192

Line 498 – Following APA JARS, please give the exact p value.

Line 519 – Please use field-specific effect sizes outlines in Plonsky and Oswald (2014)

Line 521 – It looks like the template text was left in it

(I am largely skipping the results section as this is not interpretable without additional information of the process completed that took data from audio files to raters to a dataset to statistical results.)

Line 547 – It is quite dangerous to look only at the mean differences after conducting the statistical test. The large SD of the control group means that there is little confidence in the effect size. It would be wise to also compute CIs of Cohen’s d. In sum, the numbers presented here do not convince me that there is a difference between the experimental and control groups.

Line 603 – The experimental group’s effect size is not substantially larger.

Line 620 – Without knowing more about how the pronunciation test was administered, there is no support in the results specific to SAT. Did the test attempt to determine if something was procedural/declarative/automatic? If so, how?


If the authors are able to address these issues, I would be happy to consider reviewing this article again. Barring any major threats to validity, I believe it could serve as an important contribution to SLA research.

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents an interesting study which explores the effects of integrating computer assisted pronunciation training with AI in the ESP class. The pre/post and experimental/control design is interesting as well conducted in general and provides interesting results. I have some relevant comments which I display below.

 

Lines 44-47: Research on Pronunciation teaching and learning has concluded that it must be acknowledged as a complex phenomenon (relevance of socio-emotional factors such as identify, foe example).  I suggest you acknowledge the complexity of this language aspect and or mention how it was not successfully interpreted/integrated it in the communicative framework rather than just say that drills have not worked in the past.

Some refs:

Pennington, M. C. (2021). Teaching pronunciation: The state of the art 2021. Relc Journal, 52(1), 3-21.

 

Line 53: eliminate ‘as is generally known’ . Your readership will not know in a journal such as Languages.

Lines 59-60: ‘To provide an alternative to conventional instruction’ sounds as if AI would be ready to displace or substitute pronunciation instruction. Careful about this. AI still presents a lot of limitations in speech recognition  and L2 error feedback.

Pennington, M. C., & Rogerson-Revell, P. (2018). Using technology for pronunciation teaching, learning, and assessment. In English pronunciation teaching and research: Contemporary perspectives (pp. 235-286). London: Palgrave Macmillan UK.

This claim should be softened.

Line 65: Examined à has examined

Line 64 and 69: extra spaces?

Lines 67: I would use a full stop, not a semicolon here.

Line 90: give some detail about the phonetic ‘distintions between similar sounds’ or eliminate.

Line 93: write ‘in the Thai context’ at the end of the sentence rather than at the beginning

 

Lines 100-105: white th interplay between segmental and supasegmental features is undeniable, it falls outside the scope of your study. I would eliminate. It distracts the reader.

Line 119: Lee et al. requires date of publication in text and needs to go to reference list.

Line 178: Accordingly,. à Accordingly,

Lines 181-184:  One of the limitations of ASR is actually L2 speech errors and not being able where the pronunciation mistake is or how to fix it. I think you should be a bit more tentative  in these lines and acknowledge its current limitations.

Ngo, T. T. N., Chen, H. H. J., & Lai, K. K. W. (2024). The effectiveness of automatic speech recognition in ESL/EFL pronunciation: A meta-analysis. ReCALL, 36(1), 4-21.

Liu, Y., binti Ab Rahman, F., & binti Mohamad Zain, F. (2025). A systematic literature review of research on automatic speech recognition in EFL pronunciation. Cogent Education, 12(1), 2466288.

 

Line 213: eliminate a full stop.

Line 221: Zhang et al. requires date of publication in text and needs to go to reference list.

 

Table 1: I would suggest eliminating Table 1 and ask the authors to review these studies in text highlighting those features which they find relevant to their own study. The table distracts the reader. Also, I find the limitations information biased in this table. There is too much information.

 

Lone 285: app;ication à application

Line 312: eliminate ‘the’

Lines 323: You should say that you checked for normality here as well and decided to use parametric analyses

 

Lines 326:  The pronunciation test should be displayed as an appendix.

Lines 342-343: what exactly are you checking here with Cronbach? Consistency among experts?

Line 348: time,. à time.

Line 378: Maybe explain what API is, whenever it is the first time you mention it.

Figure 4: the use of ‘conventional’ to describe pronunciation teaching led by the teacher seems vague. To my knowledge, you should describe this phase with key SAT elements which should be common to both practices: corrective feedback, articulatory detail, etc…

I do interpret well the last bubbles in Figure 4: would they not be part of the previous one? I think this Figure is well explained in text in section Intervention phase. It could be eliminated.

Line 486:  you need to specify how many raters you had and whether they were consistent among them. If possible, this should be quantified using ICC or Cohen’s Kappa.

 488:  you should say at this point that you will use both automated and human assessment methods. I only learned later on in the text.

Line 492: Table 2 does not display learner performance. Do you mean Table 3?

Table 3 and data analysis: it is not clear to me how the analysis were computed. You need to describe in detail what each human rater did: code each word for correct/incorrect? How did you calculate percentages?

Lines 501-505:  I am curious to know why you only measured consistency between the computer ratings and the human ratings on a subgroup of 5 subjects. Would it not have been interesting to perform a consistency test on the whole sample? Did you do this because human raters did not perform assessment on the whole sample? You should say so if this is the case.

Lines 572-592: for the sake of conciseness we do not need to replicate data in a manuscript. This paragraph can do without the M, SD, t, d and p figures given that they appear in Table 6.

Lines 615 and 616: you mention the lack of effects on these 4 fricatives here but do not elaborate. I would eliminate from here. The reader can wait to read about the case of these 4 fricatives later in the discussion, when you display the phonological-pedagogical details for why they may not have been affected.

Lines 622: in your discussion you interpret your data in favour of the SAT having pushed the transition stage from declarative to procedural knowledge. However, you have not tested speech in a typical procedural context such as spontaneous speech. While your data do suggest some improvement I wonder if they can go as far as suggesting that these learners have automated and do not need to conscious monitor these fricatives. Maybe a more tentative interpretation of SAT is required here.

Lines 637-638: there is no need to highlight that the phonological differences between Thai and English are unique. This statement seems to single out this particular group of L2 speakers but the truth is that many L2 speakers struggle with features of English all over the world.  

Lines 653-687: this is the first time you mention detailed phonetic cross-linguistic differences between Thai and English fricatives. This information should be given before in the text. Researchers often refer to L2 sound acquisition models such as SLM (Flege & Bohn,2021) or PAM (Best & Tyler, 2007) to predict acquisition success. Maybe you could explore this. The reader is confronted with the cross-linguistic information too late in the manuscript right now. You need to describe it before; maybe in section 2.1? you mention some substitutions but more information should be added here.

Flege, J. E., & Bohn, O.-S. (2021). The revised Speech Learning Model (SLM-r). In R. Wayland (Ed.), Second language speech learning: Theoretical and empirical progress (pp. 3–83). Cambridge University Press. https://doi.org/10.1017/9781108886901[1](https://pure.au.dk/portal/en/publications/the-revised-speech-learning-model-slm-r)

 

Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: Commonalities and complementarities. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). Amsterdam: John Benjamins.

 

Line 714: you say in your limitations that you have not explored qualitative data. Are you sure you can claim that AI is sustaining motivation in your study?

Lines 727-733: should this limitation not be closer linked to the SAT model? Given that you have explores such a progressive account, it seems that you should connect these lines with the framework you have used somehow.

 

 

 

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

1. Summary of the Article’s Aims, Contributions, Strengths, and Points for Constructive Feedback

This manuscript investigates the efficacy of a custom-designed, AI-powered pronunciation application tailored to the needs of Thai ESP (English for Specific Purposes) learners, specifically targeting their production of English fricative consonants. Using a quasi-experimental design with intact classes, the study compares AI-mediated pronunciation instruction against traditional teacher-led instruction, grounded in Skill Acquisition Theory (SAT). The authors report statistically significant gains in both groups, with the AI-mediated group showing superior improvements, especially for fricatives absent from the Thai phonemic inventory.

The article makes several noteworthy contributions:

  • It addresses a specific and under-researched learner population: Thai ESP students in performance arts.
  • It applies SAT in a structured, pedagogically relevant way to pronunciation instruction.
  • It introduces and tests a customized AI application that leverages speech recognition for tailored feedback.
  • It includes robust experimental design features such as pre-/post-testing, piloting, expert-reviewed materials, and effect size reporting.

Strengths:

  • The literature review is thorough, providing theoretical and empirical support.
  • The methods section is detailed and replicable.
  • The statistical analysis is appropriate and well reported.
  • The AI application appears thoughtfully aligned with the needs of the learner population.

Constructive feedback:

  • The manuscript contains a significant number of typographical and grammatical errors that need careful proofreading.
  • Some claims regarding AI capabilities require further specification or evidence (e.g., detailed feedback vs. intelligibility judgment).
  • The authors’ reliance on SAT should be balanced with phonological theories (e.g., Speech Learning Model) to explain cross-linguistic segmental acquisition patterns.
  • More detail on learner perception, engagement, and individual variation in gains would enhance the pedagogical implications.

 

2. General Concept Comments

  • The conceptual foundation of applying Skill Acquisition Theory to pronunciation training is strong, but the paper could benefit from integrating perceptual phonology frameworks to interpret differential gains across fricatives.
  • The argument that declarative knowledge must precede procedural development in fricative production needs to be qualified or supported by empirical evidence. Learners may also develop implicit articulatory representations through exposure and usage.
  • While the paper offers a clear description of the AI system's components, there is limited information on its accuracy in detecting mispronunciations. A brief validation study of its output would strengthen confidence in the results.
  • The description of the fricative test and scoring rubric is helpful, but it is unclear how phoneme-specific scores were derived from sentence-embedded and isolated word contexts. More detail on the rater training and inter-rater reliability process would be welcome.

 

3. Specific Comments by Section

Abstract

  • The abstract accurately summarizes the study but could be edited for conciseness. Consider removing redundant phrases and clarifying “connaturalized practices.”

Introduction (Lines 26–81)

  • Strong rationale and context, but the introduction would benefit from slightly more precision in its terminology (e.g., “connaturalized” is unclear).
  • Numerous spacing and formatting inconsistencies throughout this section.

Literature Review (Lines 82–275)

  • The SAT section is well developed but lacks critical engagement with alternative models of L2 segmental learning (e.g., Flege’s SLM).
  • Suggest clarifying which tools provide “fine-tuned phonetic feedback” as not all ASR-based tools do this (comment on line 135).
  • The summary of Table 1 should explicitly state that it synthesizes empirical studies, not just tool features (comment on line 225).

Methodology (Lines 303–487)

  • Well structured and methodologically sound.
  • Provide more detail on the “diverse disciplinary backgrounds” of participants (line 313) – the current examples (dance and music) are not clearly distinct.
  • Specify that the application is iOS-only (line 398), which may affect scalability or equity.

Results (Lines 520–572)

  • The statistical analysis is clear and well contextualized. Great use of effect sizes.
  • Include mention of the greater standard deviation in the control post-test group and its possible implications for individual variation (comment on line 548).

Discussion (Lines 593–end)

  • Insightful analysis but could benefit from incorporating additional theoretical perspectives such as SLM or PAM-L2 to interpret segment-specific outcomes (comment on line 620).
  • Consider further exploring whether improvements were driven more by perceptual salience or articulatory training.
  • The discussion could also reflect more on the potential role of learner motivation and engagement, especially since gamified and self-paced tools may have boosted these factors.

 

4. Overall Recommendation

Recommendation: Minor Revisions

This article presents a well-conceived and executed study that is highly relevant to the growing intersection of language education and AI-mediated instruction. It offers strong empirical support for the benefits of tailored, tech-enhanced pronunciation instruction for Thai ESP learners. However, revisions are necessary to:

  • Improve clarity, grammar, and formatting throughout the text.
  • Add theoretical depth regarding cross-linguistic phonology.
  • Clarify methodological details about the AI system, scoring, and sample characteristics.

With these revisions, the manuscript would make a valuable contribution to the field of L2 pronunciation pedagogy and technology-assisted language learning.

*Please see the attached PDF with my specific comments and edit suggestions.*

Comments for author File: Comments.pdf

Comments on the Quality of English Language

There are some minor language edits that need to be made and these are indicated in my direct comments within the attached PDF. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for addressing the issues I outlined in the previous reveiw -- the manuscript is much improved. There is now solid theoretical justification, a much more robust methods section, and comprehensive results section. However, in my version, there are some issues with consistency in tables 5, 6, and 7 in terms of the presentation of mean differences and CIs. I'm not sure if that is due to MDPI template, but these need to be clarified before publication. One other consideration that, to me, needs to be mentioned in the limitations section, is that the test items seem to have been trained in the intervention. This is not a bad thing -- but it does limit the generalizability of pronunciation learning to other words and contexts, even in scripted speech. This could be connected back to stages of the Skill Acquisition Theory, if desired. 

I also noticed a strange issue with the references at the bottom of page 28 and reference numbers 23 and 24 are the same. Other language issues can be addressed at the proofing stage to align with the journal's standards. 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop