MDPI - Publisher of Open Access Journals

27 pages, 6914 KB

Open AccessArticle

A New Serious Game (e-SoundWay) for Learning English Phonetics

by Alfonso Lago-Ferreiro, María Ángeles Gómez-González and José Carlos López-Ardao

Multimodal Technol. Interact. 2025, 9(6), 54; https://doi.org/10.3390/mti9060054 - 4 Jun 2025

Viewed by 2340

This paper presents the design and evaluation of e-SoundWay, a cross-platform serious game developed to improve English phonetic competence through a multimodal and narrative-driven approach. While the platform is specifically tailored to meet the needs of Spanish-speaking learners, it is adaptable for [...] Read more.

This paper presents the design and evaluation of e-SoundWay, a cross-platform serious game developed to improve English phonetic competence through a multimodal and narrative-driven approach. While the platform is specifically tailored to meet the needs of Spanish-speaking learners, it is adaptable for a wider range of English as a Foreign Language (EFL) users. e-SoundWay offers over 600 interactive multimedia minigames that target three core competencies: perception, production, and transcription. Learners progress along a gamified version of the Camino de Santiago, interacting with characters representing diverse English accents. A mixed-methods evaluation combining pre- and post-tests with a user experience questionnaire revealed statistically significant improvements across all domains, particularly in perception. Reduced post-test variability indicated more equitable learning outcomes. User satisfaction was high, with 64% of participants reporting satisfaction with their phonetic progress and 91% stating they would recommend the platform. These findings highlight the educational effectiveness, accessibility, and motivational value of e-SoundWay, reinforcing the role of serious games and multimodal technologies in delivering inclusive and engaging pronunciation instruction. Full article

(This article belongs to the Special Issue Video Games: Learning, Emotions, and Motivation)

► Show Figures

Figure 1

28 pages, 1093 KB

Open AccessArticle

Blended Phonetic Training with HVPT Features for EFL Children: Effects on L2 Perception and Listening Comprehension

by KyungA Lee and Hyunkee Ahn

Languages 2025, 10(6), 122; https://doi.org/10.3390/languages10060122 - 26 May 2025

Viewed by 2201

Abstract

Despite being fundamental for speech processing, L2 perceptual training often lacks attention in L2 classrooms, especially among English as Foreign Language (EFL) learners navigating complex English phonology. The current study investigates the impact of the blended phonetic training program incorporating HVPT features on [...] Read more.

Despite being fundamental for speech processing, L2 perceptual training often lacks attention in L2 classrooms, especially among English as Foreign Language (EFL) learners navigating complex English phonology. The current study investigates the impact of the blended phonetic training program incorporating HVPT features on enhancing L2 perception and listening comprehension skills in Korean elementary EFL learners. Fifty-seven learners, aged 11 to 12 years, participated in a four-week intervention program. They were trained on 13 challenging consonant phonemes for Korean learners, using multimedia tools for practice. Pre- and posttests assessed L2 perception and listening comprehension. They are grouped into three proficiency levels based on listening comprehension tests. The results showed significant improvements in L2 perception (p = 0.01) with small and in listening comprehension (p < 0.001) with small-to-medium effects. The lower proficiency students demonstrated the largest gains. The correlation between L2 perception and listening comprehension was observed both in pre- (r = 0.427 **) and posttests (r = 0.479 ***). Findings underscore the importance of integrating explicit phonetic instruction with HVPT to enhance L2 listening skills among EFL learners. Full article

(This article belongs to the Special Issue L2 Speech Perception and Production in the Globalized World)

► Show Figures

Figure 1

20 pages, 2042 KB

Open AccessArticle

Second Language (L2) Learners’ Perceptions of Online-Based Pronunciation Instruction

by Mohammadreza Dalman

Languages 2025, 10(4), 62; https://doi.org/10.3390/languages10040062 - 27 Mar 2025

Viewed by 1189

Abstract

The COVID-19 pandemic resulted in the widespread adoption of online instruction all around the world. In fact, in the post-pandemic era, online teaching and learning are proliferating and are considered as alternatives to traditional learning. The current study investigated L2 learners’ perceptions of [...] Read more.

The COVID-19 pandemic resulted in the widespread adoption of online instruction all around the world. In fact, in the post-pandemic era, online teaching and learning are proliferating and are considered as alternatives to traditional learning. The current study investigated L2 learners’ perceptions of an online pronunciation course. Sixty L2 learners, ranging in age from 18 to 60, were recruited from different intensive English programs (IEPs) across the United States and six other countries, including India, Brazil, China, France, Russia, and Canada. The participants received online-based computer-assisted pronunciation training (CAPT) on Moodle over a period of three weeks and completed an online survey on Qualtrics. The results of the quantitative and qualitative data collected from the learners at the end of the course showed that the learners were highly satisfied with their own performance and that they found the online course highly useful and preferred it over a face-to-face pronunciation course. The findings provide valuable insights into the design and delivery of online courses for pronunciation teachers. The findings also suggest that CAPT can effectively support asynchronous L2 pronunciation teaching. Full article

(This article belongs to the Special Issue L2 Speech Perception and Production in the Globalized World)

► Show Figures

Figure 1

18 pages, 2088 KB

Open AccessArticle

After Self-Imitation Prosodic Training L2 Learners Converge Prosodically to the Native Speakers

by Elisa Pellegrino

Languages 2024, 9(1), 33; https://doi.org/10.3390/languages9010033 - 22 Jan 2024

Cited by 1 | Viewed by 4034

Abstract

Little attention is paid to prosody in second language (L2) instruction, but computer-assisted pronunciation training (CAPT) offers learners solutions to improve the perception and production of L2 suprasegmentals. In this study, we extend with acoustic analysis a previous research showing the effectiveness of [...] Read more.

Little attention is paid to prosody in second language (L2) instruction, but computer-assisted pronunciation training (CAPT) offers learners solutions to improve the perception and production of L2 suprasegmentals. In this study, we extend with acoustic analysis a previous research showing the effectiveness of self-imitation training on prosodic improvements of Japanese learners of Italian. In light of the increased degree of correct match between intended and perceived pragmatic functions (e.g., speech acts), in this study, we aimed at quantifying the degree of prosodic convergence towards L1 Italian speakers used as a model for self-imitation training. To measure convergence, we calculated the difference in duration, F0 mean, and F0 max syllable-wise between L1 utterances and the corresponding L2 utterances produced before and after training. The results showed that after self-imitation training, L2 learners converged to the L1 speakers. The extent of the effect, however, varied based on the speech act, the acoustic measure, and the distance between L1 and L2 speakers before the training. The findings from perceptual and acoustic investigations, taken together, show the potential of self-imitation prosodic training as a valuable tool to help L2 learners communicate more effectively. Full article

(This article belongs to the Special Issue Speech Analysis and Tools in L2 Pronunciation Acquisition)

► Show Figures

Figure 1

24 pages, 2631 KB

Open AccessArticle

The ProA Online Tool for Prosody Assessment and Its Use for the Definition of Acoustic Models for Prosodic Evaluation of L2 Spanish Learners

by Juan-María Garrido and Daniel Ortega

Languages 2024, 9(1), 28; https://doi.org/10.3390/languages9010028 - 15 Jan 2024

Viewed by 2747

Abstract

Assessment of prosody is not usually included in the evaluation of oral expression skills of L2 Spanish learners. Some of the factors that probably explain this fact are the lack of adequate materials, correctness models and tools to carry out this assessment. This [...] Read more.

Assessment of prosody is not usually included in the evaluation of oral expression skills of L2 Spanish learners. Some of the factors that probably explain this fact are the lack of adequate materials, correctness models and tools to carry out this assessment. This paper describes one of the results of the ProA (Prosody Assessment) project, a web tool for the online assessment of Spanish prosody. The tool allows the online development of evaluation tests and rubrics, the completion of these tests and their remote scoring. An example of use of this tool for research purposes is also presented: three prosodic parameters (global energy, speech rate, F0 range) of a set of oral productions of two L2 Spanish learners, collected using the tests developed in the project, were evaluated by three L2 Spanish teachers using the web tool and the rubrics developed also in the ProA project, and the obtained ratings were compared with the results of the acoustic analysis of these parameters in the material to determine to what extent there was a correlation between evaluators’ judgements and prosodic parameters. The results obtained may be of interest, for example, for the development of future automatic prosody assessment systems. Full article

(This article belongs to the Special Issue Speech Analysis and Tools in L2 Pronunciation Acquisition)

► Show Figures

Figure 1

20 pages, 2132 KB

Open AccessArticle

An Open CAPT System for Prosody Practice: Practical Steps towards Multilingual Setup

by John Blake, Natalia Bogach, Akemi Kusakari, Iurii Lezhenin, Veronica Khaustova, Son Luu Xuan, Van Nhi Nguyen, Nam Ba Pham, Roman Svechnikov, Andrey Ostapchuk, Dmitrei Efimov and Evgeny Pyshkin

Languages 2024, 9(1), 27; https://doi.org/10.3390/languages9010027 - 12 Jan 2024

Cited by 4 | Viewed by 3254

Abstract

This paper discusses the challenges posed in creating a Computer-Assisted Pronunciation Training (CAPT) environment for multiple languages. By selecting one language from each of three different language families, we show that a single environment may be tailored to cater for different target languages. [...] Read more.

This paper discusses the challenges posed in creating a Computer-Assisted Pronunciation Training (CAPT) environment for multiple languages. By selecting one language from each of three different language families, we show that a single environment may be tailored to cater for different target languages. We detail the challenges faced during the development of a multimodal CAPT environment comprising a toolkit that manages mobile applications using speech signal processing, visualization, and estimation algorithms. Since the applied underlying mathematical and phonological models, as well as the feedback production algorithms, are based on sound signal processing and modeling rather than on particular languages, the system is language-agnostic and serves as an open toolkit for developing phrasal intonation training exercises for an open selection of languages. However, it was necessary to tailor the CAPT environment to the language-specific particularities in the multilingual setups, especially the additional requirements for adequate and consistent speech evaluation and feedback production. In our work, we describe our response to the challenges in visualizing and segmenting recorded pitch signals and modeling the language melody and rhythm necessary for such a multilingual adaptation, particularly for tonal syllable-timed and mora-timed languages. Full article

(This article belongs to the Special Issue Speech Analysis and Tools in L2 Pronunciation Acquisition)

► Show Figures

Figure 1

21 pages, 1349 KB

Open AccessArticle

A RTL Implementation of Heterogeneous Machine Learning Network for French Computer Assisted Pronunciation Training

by Yanjing Bi, Chao Li, Yannick Benezeth and Fan Yang

Appl. Sci. 2023, 13(10), 5835; https://doi.org/10.3390/app13105835 - 9 May 2023

Cited by 1 | Viewed by 2013

Abstract

Computer-assisted pronunciation training (CAPT) is a helpful method for self-directed or long-distance foreign language learning. It greatly benefits from the progress, and of acoustic signal processing and artificial intelligence techniques. However, in real-life applications, embedded solutions are usually desired. This paper conceives a [...] Read more.

Computer-assisted pronunciation training (CAPT) is a helpful method for self-directed or long-distance foreign language learning. It greatly benefits from the progress, and of acoustic signal processing and artificial intelligence techniques. However, in real-life applications, embedded solutions are usually desired. This paper conceives a register-transfer level (RTL) core to facilitate the pronunciation diagnostic tasks by suppressing the mulitcollinearity of the speech waveforms. A recently proposed heterogeneous machine learning framework is selected as the French phoneme pronunciation diagnostic algorithm. This RTL core is implemented and optimized within a very-high-level synthesis method for fast prototyping. An original French phoneme data set containing 4830 samples is used for the evaluation experiments. The experiment results demonstrate that the proposed implementation reduces the diagnostic error rate by 0.79–1.33% compared to the state-of-the-art and achieves a speedup of

10.89 \times

relative to its CPU implementation at the same abstract level of programming languages. Full article

► Show Figures

Figure 1

18 pages, 972 KB

Open AccessArticle

Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection

by Md. Anwar Hussen Wadud, Mohammed Alatiyyah and M. F. Mridha

Appl. Sci. 2023, 13(1), 109; https://doi.org/10.3390/app13010109 - 22 Dec 2022

Cited by 12 | Viewed by 3962

Abstract

A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, [...] Read more.

A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, such as forced alignment and extended recognition networks, for model development or for enhancing system performance. The incorporation of earlier texts into model training has recently been attempted using end-to-end (E2E)-based approaches, and preliminary results indicate efficacy. Attention-based end-to-end models have shown lower speech recognition performance because multi-pass left-to-right forward computation constrains their practical applicability in beam search. In addition, end-to-end neural approaches are typically data-hungry, and a lack of non-native training data will frequently impair their effectiveness in MDD. To solve this problem, we provide a unique MDD technique that uses non-autoregressive (NAR) end-to-end neural models to greatly reduce estimation time while maintaining accuracy levels similar to traditional E2E neural models. In contrast, NAR models can generate parallel token sequences by accepting parallel inputs instead of left-to-right forward computation. To further enhance the effectiveness of MDD, we develop and construct a pronunciation model superimposed on our approach’s NAR end-to-end models. To test the effectiveness of our strategy against some of the best end-to-end models, we use publicly accessible L2-ARCTIC and SpeechOcean English datasets for training and testing purposes where the proposed model shows the best results than other existing models. Full article

(This article belongs to the Special Issue Deep Learning for Speech Processing)

► Show Figures

Figure 1

17 pages, 1095 KB

Open AccessArticle

Video Self-Modeling (VSM) as a Strategy to Instruct CFL Students’ Sentence-Level Stress

by Linghong Li, Martin Valcke, Linda Badan and Christoph Anderl

Sustainability 2022, 14(23), 15509; https://doi.org/10.3390/su142315509 - 22 Nov 2022

Cited by 1 | Viewed by 1949

Abstract

Sentence-level stress is one of the major means of expressing information focus in oral speaking, and it is of importance for Chinese as a foreign language (CFL) learners to accurately receive and send the right information in conversation. However, research related to teaching [...] Read more.

Sentence-level stress is one of the major means of expressing information focus in oral speaking, and it is of importance for Chinese as a foreign language (CFL) learners to accurately receive and send the right information in conversation. However, research related to teaching stress, especially sentence-level stress, is indeed scarce. In this study, we investigate whether video self-modeling (VSM) is applicable to improve CFL students’ sentence-level stress. VSM, as an innovative strategy, only shows the positive targeted behavior by using videos or audios of oneself, and aims to decrease students’ frustration and the negative influence caused by failed accomplishments. Twelve beginning-level CFL students, taken as the experimental group, received the edited perfect pronunciation audios with their own voice and used these own-voice audios to train their sentence-level stress. At the same time, another twelve advanced-level CFL students were taken as the control group, and received traditional instructional strategies from their class teacher. The whole training continued for ten sessions during a period of two and half months. Quantitative results show that with the help of VSM, CFL students’ sentence-level stress improved significantly as compared to the control group, with increased scores on the pronunciation of sentence-level stress words and increased scores in all three parameters: pitch, intensity, and duration. A post-training survey revealed that the participants’ preference for using their own voice as instructional material resulted in a feeling of success and satisfaction. The findings corroborate the importance of computer-assisted language learning in the second language (L2) field, and add solid evidence of using VSM in foreign-language training. Full article

► Show Figures

Figure 1

24 pages, 3512 KB

Open AccessArticle

Mispronunciation Detection and Diagnosis with Articulatory-Level Feedback Generation for Non-Native Arabic Speech

by Mohammed Algabri, Hassan Mathkour, Mansour Alsulaiman and Mohamed A. Bencherif

Mathematics 2022, 10(15), 2727; https://doi.org/10.3390/math10152727 - 2 Aug 2022

Cited by 15 | Viewed by 5018

Abstract

A high-performance versatile computer-assisted pronunciation training (CAPT) system that provides the learner immediate feedback as to whether their pronunciation is correct is very helpful in learning correct pronunciation and allows learners to practice this at any time and with unlimited repetitions, without the [...] Read more.

A high-performance versatile computer-assisted pronunciation training (CAPT) system that provides the learner immediate feedback as to whether their pronunciation is correct is very helpful in learning correct pronunciation and allows learners to practice this at any time and with unlimited repetitions, without the presence of an instructor. In this paper, we propose deep learning-based techniques to build a high-performance versatile CAPT system for mispronunciation detection and diagnosis (MDD) and articulatory feedback generation for non-native Arabic learners. The proposed system can locate the error in pronunciation, recognize the mispronounced phonemes, and detect the corresponding articulatory features (AFs), not only in words but even in sentences. We formulate the recognition of phonemes and corresponding AFs as a multi-label object recognition problem, where the objects are the phonemes and their AFs in a spectral image. Moreover, we investigate the use of cutting-edge neural text-to-speech (TTS) technology to generate a new corpus of high-quality speech from predefined text that has the most common substitution errors among Arabic learners. The proposed model and its various enhanced versions achieved excellent results. We compared the performance of the different proposed models with the state-of-the-art end-to-end technique of MDD, and our system had a better performance. In addition, we proposed using fusion between the proposed model and the end-to-end model and obtained a better performance. Our best model achieved a 3.83% phoneme error rate (PER) in the phoneme recognition task, a 70.53% F1-score in the MDD task, and a detection error rate (DER) of 2.6% for the AF detection task. Full article

(This article belongs to the Special Issue Recent Advances in Artificial Intelligence and Machine Learning)

► Show Figures

Figure 1

28 pages, 8971 KB

Open AccessArticle

An Acoustic Feature-Based Deep Learning Model for Automatic Thai Vowel Pronunciation Recognition

by Niyada Rukwong and Sunee Pongpinigpinyo

Appl. Sci. 2022, 12(13), 6595; https://doi.org/10.3390/app12136595 - 29 Jun 2022

Cited by 5 | Viewed by 3142

Abstract

For Thai vowel pronunciation, it is very important to know that when mispronunciation occurs, the meanings of words change completely. Thus, effective and standardized practice is essential to pronouncing words correctly as a native speaker. Since the COVID-19 pandemic, online learning has become [...] Read more.

For Thai vowel pronunciation, it is very important to know that when mispronunciation occurs, the meanings of words change completely. Thus, effective and standardized practice is essential to pronouncing words correctly as a native speaker. Since the COVID-19 pandemic, online learning has become increasingly popular. For example, an online pronunciation application system was introduced that has virtual teachers and an intelligent process of evaluating students that is similar to standardized training by a teacher in a real classroom. This research presents an online automatic computer-assisted pronunciation training (CAPT) using deep learning to recognize Thai vowels in speech. The automatic CAPT is developed to solve the inadequacy of instruction specialists and the complex vowel teaching process. It is a unique system that develops computer techniques integrated with linguistic theory. The deep learning model is the most significant part of recognizing vowels pronounced for the automatic CAPT. The major challenge in Thai vowel recognition is the correct identification of Thai vowels when spoken in real-world situations. A convolutional neural network (CNN), a deep learning model, is applied and developed in the classification of pronounced Thai vowels. A new dataset for Thai vowels was designed, collected, and examined by linguists. The result of an optimal CNN model with Mel spectrogram (MS) achieves the highest accuracy of 98.61%, compared with Mel frequency cepstral coefficients (MFCC) with the baseline long short-term memory (LSTM) model and MS with the baseline LSTM model have an accuracy of 94.44% and 90.00% respectively. Full article

► Show Figures

Figure 1

16 pages, 1746 KB

Open AccessArticle

Automatic Speech Recognition (ASR) Systems Applied to Pronunciation Assessment of L2 Spanish for Japanese Speakers

by Cristian Tejedor-García, Valentín Cardeñoso-Payo and David Escudero-Mancebo

Appl. Sci. 2021, 11(15), 6695; https://doi.org/10.3390/app11156695 - 21 Jul 2021

Cited by 17 | Viewed by 7020

Abstract

General-purpose automatic speech recognition (ASR) systems have improved in quality and are being used for pronunciation assessment. However, the assessment of isolated short utterances, such as words in minimal pairs for segmental approaches, remains an important challenge, even more so for non-native speakers. [...] Read more.

General-purpose automatic speech recognition (ASR) systems have improved in quality and are being used for pronunciation assessment. However, the assessment of isolated short utterances, such as words in minimal pairs for segmental approaches, remains an important challenge, even more so for non-native speakers. In this work, we compare the performance of our own tailored ASR system (kASR) with the one of Google ASR (gASR) for the assessment of Spanish minimal pair words produced by 33 native Japanese speakers in a computer-assisted pronunciation training (CAPT) scenario. Participants in a pre/post-test training experiment spanning four weeks were split into three groups: experimental, in-classroom, and placebo. The experimental group used the CAPT tool described in the paper, which we specially designed for autonomous pronunciation training. A statistically significant improvement for the experimental and in-classroom groups was revealed, and moderate correlation values between gASR and kASR results were obtained, in addition to strong correlations between the post-test scores of both ASR systems and the CAPT application scores found at the final stages of application use. These results suggest that both ASR alternatives are valid for assessing minimal pairs in CAPT tools, in the current configuration. Discussion on possible ways to improve our system and possibilities for future research are included. Full article

(This article belongs to the Special Issue IberSPEECH 2020: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

22 pages, 1420 KB

Open AccessArticle

Speech Processing for Language Learning: A Practical Approach to Computer-Assisted Pronunciation Teaching

by Natalia Bogach, Elena Boitsova, Sergey Chernonog, Anton Lamtev, Maria Lesnichaya, Iurii Lezhenin, Andrey Novopashenny, Roman Svechnikov, Daria Tsikach, Konstantin Vasiliev, Evgeny Pyshkin and John Blake

Electronics 2021, 10(3), 235; https://doi.org/10.3390/electronics10030235 - 20 Jan 2021

Cited by 40 | Viewed by 7786

Abstract

This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible [...] Read more.

This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible by technological improvements in signal processing algorithms. We discuss an approach and propose a holistic solution to teaching the phonological phenomena which are crucial for correct pronunciation, such as the phonemes; the energy and duration of syllables and pauses, which construct the phrasal rhythm; and the tone movement within an utterance, i.e., the phrasal intonation. The working prototype of StudyIntonation Computer-Assisted Pronunciation Training (CAPT) system is a tool for mobile devices, which offers a set of tasks based on a “listen and repeat” approach and gives the audio-visual feedback in real time. The present work summarizes the efforts taken to enrich the current version of this CAPT tool with two new functions: the phonetic transcription and rhythmic patterns of model and learner speech. Both are designed on a base of a third-party automatic speech recognition (ASR) library Kaldi, which was incorporated inside StudyIntonation signal processing software core. We also examine the scope of automatic speech recognition applicability within the CAPT system workflow and evaluate the Levenstein distance between the transcription made by human experts and that obtained automatically in our code. We developed an algorithm of rhythm reconstruction using acoustic and language ASR models. It is also shown that even having sufficiently correct production of phonemes, the learners do not produce a correct phrasal rhythm and intonation, and therefore, the joint training of sounds, rhythm and intonation within a single learning environment is beneficial. To mitigate the recording imperfections voice activity detection (VAD) is applied to all the speech records processed. The try-outs showed that StudyIntonation can create transcriptions and process rhythmic patterns, but some specific problems with connected speech transcription were detected. The learners feedback in the sense of pronunciation assessment was also updated and a conventional mechanism based on dynamic time warping (DTW) was combined with cross-recurrence quantification analysis (CRQA) approach, which resulted in a better discriminating ability. The CRQA metrics combined with those of DTW were shown to add to the accuracy of learner performance estimation. The major implications for computer-assisted English pronunciation teaching are discussed. Full article

(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)

► Show Figures

Figure 1

24 pages, 1703 KB

Open AccessArticle

End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture

by Long Zhang, Ziping Zhao, Chunmei Ma, Linlin Shan, Huazhi Sun, Lifen Jiang, Shiwen Deng and Chang Gao

Sensors 2020, 20(7), 1809; https://doi.org/10.3390/s20071809 - 25 Mar 2020

Cited by 44 | Viewed by 8232

Abstract

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to [...] Read more.

Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task. Full article

(This article belongs to the Special Issue Speech, Acoustics, Audio Signal Processing and Applications in Sensors)

► Show Figures

Figure 1

Search Results (14)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (14)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI