MDPI - Publisher of Open Access Journals

15 pages, 1359 KiB

Open AccessArticle

Phoneme-Aware Hierarchical Augmentation and Semantic-Aware SpecAugment for Low-Resource Cantonese Speech Recognition

by Lusheng Zhang, Shie Wu and Zhongxun Wang

Sensors 2025, 25(14), 4288; https://doi.org/10.3390/s25144288 - 9 Jul 2025

Viewed by 328

Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments [...] Read more.

Cantonese Automatic Speech Recognition (ASR) is hindered by tonal complexity, acoustic diversity, and a lack of labelled data. This study proposes a phoneme-aware hierarchical augmentation framework that enhances performance without additional annotation. A Phoneme Substitution Matrix (PSM), built from Montreal Forced Aligner alignments and Tacotron-2 synthesis, injects adversarial phoneme variants into both transcripts and their aligned audio segments, enlarging pronunciation diversity. Concurrently, a semantic-aware SpecAugment scheme exploits wav2vec 2.0 attention heat maps and keyword boundaries to adaptively mask informative time–frequency regions; a reinforcement-learning controller tunes the masking schedule online, forcing the model to rely on a wider context. On the Common Voice Cantonese 50 h subset, the combined strategy reduces the character error rate (CER) from 26.17% to 16.88% with wav2vec 2.0 and from 38.83% to 23.55% with Zipformer. At 100 h, the CER further drops to 4.27% and 2.32%, yielding relative gains of 32–44%. Ablation studies confirm that phoneme-level and masking components provide complementary benefits. The framework offers a practical, model-independent path toward accurate ASR for Cantonese and other low-resource tonal languages. This paper presents an intelligent sensing-oriented modeling framework for speech signals, which is suitable for deployment on edge or embedded systems to process input from audio sensors (e.g., microphones) and shows promising potential for voice-interactive terminal applications. Full article

(This article belongs to the Special Issue Advances in Automatic Speech Recognition, Audio and Underwater Acoustic Signal Analysis)

► Show Figures

Figure 1

23 pages, 2410 KiB

Open AccessArticle

A Semi-Automatic Framework for Practical Transcription of Foreign Person Names in Lithuanian

by Gailius Raškinis, Darius Amilevičius, Danguolė Kalinauskaitė, Artūras Mickus, Daiva Vitkutė-Adžgauskienė, Antanas Čenys and Tomas Krilavičius

Mathematics 2025, 13(13), 2107; https://doi.org/10.3390/math13132107 - 27 Jun 2025

Viewed by 268

Abstract

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for [...] Read more.

We present a semi-automatic framework for transcribing foreign personal names into Lithuanian, aimed at reducing pronunciation errors in text-to-speech systems. Focusing on noisy, web-crawled data, the pipeline combines rule-based filtering, morphological normalization, and manual stress annotation—the only non-automated step—to generate training data for character-level transcription models. We evaluate three approaches: a weighted finite-state transducer (WFST), an LSTM-based sequence-to-sequence model with attention, and a Transformer model optimized for character transduction. Results show that word-pair models outperform single-word models, with the Transformer achieving the best performance (19.04% WER) on a cleaned and augmented dataset. Data augmentation via word order reversal proved effective, while combining single-word and word-pair training offered limited gains. Despite filtering, residual noise persists, with 54% of outputs showing some error, though only 11% were perceptually significant. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

22 pages, 5083 KiB

Open AccessFeature PaperArticle

Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback

by Fengqin Liu, Korawit Orkphol, Natthapon Pannurat, Thanat Sooknuan, Thanin Muangpool, Sanya Kuankid and Montri Phothisonothai

Inventions 2025, 10(4), 46; https://doi.org/10.3390/inventions10040046 - 24 Jun 2025

Viewed by 523

Abstract

This paper introduces an innovative mobile-assisted language-learning (MALL) system that harnesses deep learning technology to analyze pronunciation patterns and deliver real-time, personalized feedback. Drawing inspiration from how the human brain processes speech through neural pathways, our system analyzes multiple speech features using spectrograms, [...] Read more.

This paper introduces an innovative mobile-assisted language-learning (MALL) system that harnesses deep learning technology to analyze pronunciation patterns and deliver real-time, personalized feedback. Drawing inspiration from how the human brain processes speech through neural pathways, our system analyzes multiple speech features using spectrograms, mel-frequency cepstral coefficients (MFCCs), and formant frequencies in a manner that mirrors the auditory cortex’s interpretation of sound. The core of our approach utilizes a convolutional neural network (CNN) to classify pronunciation patterns from user-recorded speech. To enhance the assessment accuracy and provide nuanced feedback, we integrated a fuzzy inference system (FIS) that helps learners identify and correct specific pronunciation errors. The experimental results demonstrate that our multi-feature model achieved 82.41% to 90.52% accuracies in accent classification across diverse linguistic contexts. The user testing revealed statistically significant improvements in pronunciation skills, where learners showed a 5–20% enhancement in accuracy after using the system. The proposed MALL system offers a portable, accessible solution for language learners while establishing a foundation for future research in multilingual functionality and mobile platform optimization. By combining advanced speech analysis with intuitive feedback mechanisms, this system addresses a critical challenge in language acquisition and promotes more effective self-directed learning. Full article

(This article belongs to the Special Issue Advances and Innovations in Deep Learning: Unveiling Multidisciplinary Applications and Challenges)

► Show Figures

Figure 1

13 pages, 1136 KiB

Open AccessArticle

Machine Learning-Driven Acoustic Feature Classification and Pronunciation Assessment for Mandarin Learners

by Gulnur Arkin, Tangnur Abdukelim, Hankiz Yilahun and Askar Hamdulla

Appl. Sci. 2025, 15(11), 6335; https://doi.org/10.3390/app15116335 - 5 Jun 2025

Viewed by 429

Abstract

Based on acoustic feature analysis, this study systematically examines the differences in vowel pronunciation characteristics among Mandarin learners at various proficiency levels. A speech corpus containing samples from advanced, intermediate, and elementary learners (N = 50) and standard speakers (N = 10) was [...] Read more.

Based on acoustic feature analysis, this study systematically examines the differences in vowel pronunciation characteristics among Mandarin learners at various proficiency levels. A speech corpus containing samples from advanced, intermediate, and elementary learners (N = 50) and standard speakers (N = 10) was constructed, with a total of 5880 samples. Support Vector Machine (SVM) and ID3 decision tree algorithms were employed to classify vowel formant parameters (F1-F2) patterns. The results demonstrate that SVM significantly outperforms the ID3 algorithm in vowel classification, with an average accuracy of 92.09% for the three learner groups (92.38% for advanced, 92.25% for intermediate, and 91.63% for elementary), an improvement of 2.05 percentage points compared to ID3 (p < 0.05). Learners’ vowel production exhibits systematic deviations, particularly pronounced in complex vowels for the elementary group. For instance, the apical vowel “ẓ” has a deviation of 2.61 Bark (standard group: F1 = 3.39/F2 = 8.13; elementary group: F1 = 3.42/F2 = 10.74), while the advanced group’s deviations are generally less than 0.5 Bark (e.g., vowel “a” deviation is only 0.09 Bark). The difficulty of tongue position control strongly correlates with the deviation magnitude (r = 0.87, p < 0.001). This study confirms the effectiveness of objective assessment methods based on formant analysis in speech acquisition research, provides a theoretical basis for algorithm optimization in speech evaluation systems, and holds significant application value for the development of Computer-Assisted Language Learning (CALL) systems and the improvement of multi-ethnic Mandarin speech recognition technology. Full article

(This article belongs to the Collection Fishery Acoustics)

► Show Figures

Figure 1

18 pages, 585 KiB

Open AccessArticle

Improving Diacritical Arabic Speech Recognition: Transformer-Based Models with Transfer Learning and Hybrid Data Augmentation

by Haifa Alaqel and Khalil El Hindi

Information 2025, 16(3), 161; https://doi.org/10.3390/info16030161 - 20 Feb 2025

Viewed by 1555

Abstract

Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness [...] Read more.

Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness of automatic speech recognition (ASR) systems, particularly in applications requiring high semantic precision, such as voice-enabled translation services. Despite its importance, leveraging advanced machine learning techniques to enhance ASR for diacritical Arabic has remained underexplored. A key challenge in developing DA ASR is the limited availability of training data. This study introduces a transformer-based approach leveraging transfer learning and data augmentation to address these challenges. Using a cross-lingual speech representation (XLSR) model pretrained on 53 languages, we fine-tune it on DA and integrate connectionist temporal classification (CTC) with transformers for improved performance. Data augmentation techniques, including volume adjustment, pitch shift, speed alteration, and hybrid strategies, further mitigate data limitations, significantly reducing word error rates (WER). Our methods achieve a WER of 12.17%, outperforming traditional ASR systems and setting a new benchmark for DA ASR. These findings demonstrate the potential of advanced machine learning to address longstanding challenges in DA ASR and enhance its accuracy. Full article

(This article belongs to the Collection Natural Language Processing and Applications: Challenges and Perspectives)

► Show Figures

Figure 1

16 pages, 1512 KiB

Open AccessArticle

An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation

by Yi Qin and Feifan Yu

Sensors 2025, 25(2), 341; https://doi.org/10.3390/s25020341 - 9 Jan 2025

Viewed by 840

Abstract

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end [...] Read more.

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

20 pages, 4970 KiB

Open AccessArticle

Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT

by Fawaz S. Al-Anzi and S. T. Bibin Shalini

Appl. Sci. 2024, 14(22), 10498; https://doi.org/10.3390/app142210498 - 14 Nov 2024

Cited by 1 | Viewed by 1342

Abstract

Arabic raw audio datasets were initially gathered to produce a corresponding signal spectrum, which was further used to extract the Mel-Frequency Cepstral Coefficients (MFCCs). The pronunciation dictionary, language model, and acoustic model were further derived from the MFCCs’ features. These output data were [...] Read more.

Arabic raw audio datasets were initially gathered to produce a corresponding signal spectrum, which was further used to extract the Mel-Frequency Cepstral Coefficients (MFCCs). The pronunciation dictionary, language model, and acoustic model were further derived from the MFCCs’ features. These output data were processed into Baidu’s Deep Speech model (ASR system) to attain the text corpus. Baidu’s Deep Speech model was implemented to precisely identify the global optimal value rapidly while preserving a low word and character discrepancy rate by attaining an excellent performance in isolated and end-to-end speech recognition. The desired outcome in this work is to forecast the next word and character in a sequential and systematic order that applies under natural language processing (NLP). This work combines the trained Arabic language model ARABERT with the potential of Long Short-Term Memory (LSTM) networks to predict the next word and character in an Arabic text. We used the pre-trained ARABERT embedding to improve the model’s capacity and, to capture semantic relationships within the language, we educated LSTM + CNN and Markov models on Arabic text data to assess the efficacy of this model. Python libraries such as TensorFlow, Pickle, Keras, and NumPy were used to effectively design our development model. We extensively assessed the model’s performance using new Arabic text, focusing on evaluation metrics like accuracy, word error rate, character error rate, BLEU score, and perplexity. The results show how well the combined LSTM + ARABERT and Markov models have outperformed the baseline models in envisaging the next word or character in the Arabic text. The accuracy rates of 64.9% for LSTM, 74.6% for ARABERT + LSTM, and 78% for Markov chain models were achieved in predicting the next word, and the accuracy rates of 72% for LSTM, 72.22% for LSTM + CNN, and 73% for ARABERET + LSTM models were achieved for the next-character prediction. This work unveils a novelty in Arabic natural language processing tasks, estimating a potential future expansion in deriving a precise next-word and next-character forecasting, which can be an efficient utility for text generation and machine translation applications. Full article

► Show Figures

Figure 1

17 pages, 3457 KiB

Open AccessArticle

Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression

by Jun Yang, Liyan Wang, Yong Qi, Haifeng Chen and Jian Li

Appl. Sci. 2024, 14(19), 9121; https://doi.org/10.3390/app14199121 - 9 Oct 2024

Viewed by 1595

Abstract

This study aims to develop an emotion evaluation method for second language learners, utilizing multimodal information to comprehensively evaluate students’ emotional expressions. Addressing the limitations of existing emotion evaluation methods, which primarily focus on the acoustic features of speech (e.g., pronunciation, frequency, and [...] Read more.

This study aims to develop an emotion evaluation method for second language learners, utilizing multimodal information to comprehensively evaluate students’ emotional expressions. Addressing the limitations of existing emotion evaluation methods, which primarily focus on the acoustic features of speech (e.g., pronunciation, frequency, and rhythm) and often neglect the emotional expressions conveyed through voice and facial videos, this paper proposes an emotion evaluation method based on multimodal information. The method includes the following three main parts: (1) generating virtual data using a Large Language Model (LLM) and audio-driven facial video synthesis, as well as integrating the IEMOCAP dataset with self-recorded student videos and audios containing teacher ratings to construct a multimodal emotion evaluation dataset; (2) a graph convolution-based emotion feature encoding network to extract emotion features from multimodal information; and (3) an emotion evaluation network based on Kolmogorov–Arnold Networks (KAN) to compare students’ emotion features with standard synthetic data for precise evaluation. The emotion recognition method achieves an unweighted accuracy (UA) of 68.02% and an F1 score of 67.11% in experiments with the IEMOCAP dataset and TTS data. The emotion evaluation model, using the KAN network, outperforms the MLP network, with a mean squared error (MSE) of 0.811 compared to 0.943, providing a reliable tool for evaluating language learners’ emotional expressions. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

11 pages, 1212 KiB

Open AccessArticle

Building a Speech Dataset and Recognition Model for the Minority Tu Language

by Shasha Kong, Chunmei Li, Chengwu Fang and Peng Yang

Appl. Sci. 2024, 14(15), 6795; https://doi.org/10.3390/app14156795 - 4 Aug 2024

Cited by 1 | Viewed by 1230

Abstract

Speech recognition technology has many applications in our daily life. However, for many low-resource languages without written forms, acquiring sufficient training data remains a significant challenge for building accurate ASR models. The Tu language, spoken by an ethnic minority group in Qinghai Province [...] Read more.

Speech recognition technology has many applications in our daily life. However, for many low-resource languages without written forms, acquiring sufficient training data remains a significant challenge for building accurate ASR models. The Tu language, spoken by an ethnic minority group in Qinghai Province in China, is one such example. Due to the lack of written records and the great diversity in regional pronunciations, there has been little previous research on Tu-language speech recognition. This work seeks to address this research gap by creating the first speech dataset for the Tu language spoken in Huzhu County, Qinghai. We first formulated the relevant pronunciation rules for the Tu language based on linguistic analysis. Then, we constructed a new speech corpus, named HZ-TuDs, through targeted data collection and annotation. Based on the HZ-TuDs dataset, we designed several baseline sequence-to-sequence deep neural models for end-to-end Tu-language speech recognition. Additionally, we proposed a novel SA-conformer model, which combines convolutional and channel attention modules to better extract speech features. Experiments showed that our proposed SA-conformer model can significantly reduce the character error rate from 23% to 12%, effectively improving the accuracy of Tu language recognition compared to previous approaches. This demonstrates the effectiveness of our dataset construction and model design efforts in advancing speech recognition technology for this low-resource minority language. Full article

► Show Figures

Figure 1

15 pages, 1575 KiB

Open AccessArticle

Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition

by Geon Woo Lee and Hong Kook Kim

Sensors 2024, 24(8), 2573; https://doi.org/10.3390/s24082573 - 17 Apr 2024

Viewed by 1633

Abstract

This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. [...] Read more.

This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores. Full article

(This article belongs to the Special Issue Feature Papers in Intelligent Sensors 2024)

► Show Figures

Graphical abstract

20 pages, 6480 KiB

Open AccessArticle

Arabic Mispronunciation Recognition System Using LSTM Network

by Abdelfatah Ahmed, Mohamed Bader, Ismail Shahin, Ali Bou Nassif, Naoufel Werghi and Mohammad Basel

Information 2023, 14(7), 413; https://doi.org/10.3390/info14070413 - 16 Jul 2023

Cited by 7 | Viewed by 2417

Abstract

The Arabic language has always been an immense source of attraction to various people from different ethnicities by virtue of the significant linguistic legacy that it possesses. Consequently, a multitude of people from all over the world are yearning to learn it. However, [...] Read more.

The Arabic language has always been an immense source of attraction to various people from different ethnicities by virtue of the significant linguistic legacy that it possesses. Consequently, a multitude of people from all over the world are yearning to learn it. However, people from different mother tongues and cultural backgrounds might experience some hardships regarding articulation due to the absence of some particular letters only available in the Arabic language, which could hinder the learning process. As a result, a speaker-independent and text-dependent efficient system that aims to detect articulation disorders was implemented. In the proposed system, we emphasize the prominence of “speech signal processing” in diagnosing Arabic mispronunciation using the Mel-frequency cepstral coefficients (MFCCs) as the optimum extracted features. In addition, long short-term memory (LSTM) was also utilized for the classification process. Furthermore, the analytical framework was incorporated with a gender recognition model to perform two-level classification. Our results show that the LSTM network significantly enhances mispronunciation detection along with gender recognition. The LSTM models attained an average accuracy of 81.52% in the proposed system, reflecting a high performance compared to previous mispronunciation detection systems. Full article

(This article belongs to the Collection Natural Language Processing and Applications: Challenges and Perspectives)

► Show Figures

Figure 1

21 pages, 1349 KiB

Open AccessArticle

A RTL Implementation of Heterogeneous Machine Learning Network for French Computer Assisted Pronunciation Training

by Yanjing Bi, Chao Li, Yannick Benezeth and Fan Yang

Appl. Sci. 2023, 13(10), 5835; https://doi.org/10.3390/app13105835 - 9 May 2023

Cited by 1 | Viewed by 1841

Abstract

Computer-assisted pronunciation training (CAPT) is a helpful method for self-directed or long-distance foreign language learning. It greatly benefits from the progress, and of acoustic signal processing and artificial intelligence techniques. However, in real-life applications, embedded solutions are usually desired. This paper conceives a [...] Read more.

Computer-assisted pronunciation training (CAPT) is a helpful method for self-directed or long-distance foreign language learning. It greatly benefits from the progress, and of acoustic signal processing and artificial intelligence techniques. However, in real-life applications, embedded solutions are usually desired. This paper conceives a register-transfer level (RTL) core to facilitate the pronunciation diagnostic tasks by suppressing the mulitcollinearity of the speech waveforms. A recently proposed heterogeneous machine learning framework is selected as the French phoneme pronunciation diagnostic algorithm. This RTL core is implemented and optimized within a very-high-level synthesis method for fast prototyping. An original French phoneme data set containing 4830 samples is used for the evaluation experiments. The experiment results demonstrate that the proposed implementation reduces the diagnostic error rate by 0.79–1.33% compared to the state-of-the-art and achieves a speedup of

10.89 \times

relative to its CPU implementation at the same abstract level of programming languages. Full article

► Show Figures

Figure 1

18 pages, 972 KiB

Open AccessArticle

Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection

by Md. Anwar Hussen Wadud, Mohammed Alatiyyah and M. F. Mridha

Appl. Sci. 2023, 13(1), 109; https://doi.org/10.3390/app13010109 - 22 Dec 2022

Cited by 11 | Viewed by 3638

Abstract

A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, [...] Read more.

A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, such as forced alignment and extended recognition networks, for model development or for enhancing system performance. The incorporation of earlier texts into model training has recently been attempted using end-to-end (E2E)-based approaches, and preliminary results indicate efficacy. Attention-based end-to-end models have shown lower speech recognition performance because multi-pass left-to-right forward computation constrains their practical applicability in beam search. In addition, end-to-end neural approaches are typically data-hungry, and a lack of non-native training data will frequently impair their effectiveness in MDD. To solve this problem, we provide a unique MDD technique that uses non-autoregressive (NAR) end-to-end neural models to greatly reduce estimation time while maintaining accuracy levels similar to traditional E2E neural models. In contrast, NAR models can generate parallel token sequences by accepting parallel inputs instead of left-to-right forward computation. To further enhance the effectiveness of MDD, we develop and construct a pronunciation model superimposed on our approach’s NAR end-to-end models. To test the effectiveness of our strategy against some of the best end-to-end models, we use publicly accessible L2-ARCTIC and SpeechOcean English datasets for training and testing purposes where the proposed model shows the best results than other existing models. Full article

(This article belongs to the Special Issue Deep Learning for Speech Processing)

► Show Figures

Figure 1

17 pages, 2364 KiB

Open AccessArticle

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

by Zeyu Ren, Nurmemet Yolwas, Wushour Slamu, Ronghe Cao and Huiru Wang

Sensors 2022, 22(19), 7319; https://doi.org/10.3390/s22197319 - 27 Sep 2022

Cited by 12 | Viewed by 3451

Abstract

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive [...] Read more.

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

18 pages, 833 KiB

Open AccessReview

Code-Switching in Automatic Speech Recognition: The Issues and Future Directions

by Mumtaz Begum Mustafa, Mansoor Ali Yusoof, Hasan Kahtan Khalaf, Ahmad Abdel Rahman Mahmoud Abushariah, Miss Laiha Mat Kiah, Hua Nong Ting and Saravanan Muthaiyah

Appl. Sci. 2022, 12(19), 9541; https://doi.org/10.3390/app12199541 - 23 Sep 2022

Cited by 13 | Viewed by 7938

Abstract

Code-switching (CS) in spoken language is where the speech has two or more languages within an utterance. It is an unsolved issue in automatic speech recognition (ASR) research as ASR needs to recognise speech in bilingual and multilingual settings, where the accuracy of [...] Read more.

Code-switching (CS) in spoken language is where the speech has two or more languages within an utterance. It is an unsolved issue in automatic speech recognition (ASR) research as ASR needs to recognise speech in bilingual and multilingual settings, where the accuracy of ASR systems declines with CS due to pronunciation variation. There are very few reviews carried out on CS, with none conducted on bilingual and multilingual CS ASR systems. This study investigates the importance of CS in bilingual and multilingual speech recognition systems. To meet the objective of this study, two research questions were formulated, which cover both the current issues and the direction of the research. Our review focuses on databases, acoustic and language modelling, and evaluation metrics. Using selected keywords, this research has identified 274 papers and selected 42 experimental papers for review, of which 24 (representing 57%) have discussed CS, while the rest look at multilingual ASR research. The selected papers cover many well-resourced and under-resourced languages, and novel techniques to manage CS in ASR systems, which are mapping, combining and merging the phone sets of the languages experimented with in the research. Our review also examines the performance of those methods. This review found a significant variation in the performance of CS in terms of word error rates, indicating an inconsistency in the ability of ASRs to handle CS. In the conclusion, we suggest several future directions that address the issues identified in this review. Full article

(This article belongs to the Special Issue Advances in Speech and Language Processing)

► Show Figures

Figure 1

Search Results (30)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (30)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI