Previous Article in Journal
Prediction of Environment-Related Operation and Maintenance Events in Small Hydropower Plants
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition

1
The Higher School of Technology in Meknes, Moulay Ismail University, Meknes 50000, Morocco
2
Faculty of Sciences Dhar Mahraz, Sidi Mohamed Ben Abdellah University, Fes 30000, Morocco
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2025, 7(4), 164; https://doi.org/10.3390/make7040164
Submission received: 23 October 2025 / Revised: 28 November 2025 / Accepted: 3 December 2025 / Published: 10 December 2025
(This article belongs to the Section Learning)

Abstract

This work investigates how different language modeling techniques affect the performance of an end-to-end automatic speech recognition (ASR) system for the Amazigh language. A (CNN-BiLSTM-CTC) model enhanced with an attention mechanism was used as the baseline. During decoding, two external language models were integrated using shallow fusion: a trigram N-gram model built with KenLM and a recurrent neural network language model (RNN-LM) trained on the same Tifdigit corpus. Four decoding methods were compared: greedy decoding; beam search; beam search with an N-gram language model; and beam search with a compact recurrent neural network language model. Experimental results on the Tifdigit dataset reveal a clear trade-off: the N-gram language model produces the best results compared to RNN-LM, with a phonetic error rate (PER) of 0.0268, representing a relative improvement of 4.0% over the greedy baseline model, and translates into an accuracy of 97.32%. This suggests that N-gram models can outperform neural approaches when reliable, limited data and lexical resources are available. The improved N-gram approach notably outperformed both simple beam search and the RNN neural language model. This improvement is due to higher-order context modeling, its optimized interpolation weights, and its adaptive lexical weighting tailored to the phonotactic structure of the Amazigh language.

1. Introduction

The field of automatic speech recognition (ASR) has seen significant advancements with the emergence of end-to-end (E2E) neural architectures, such as connectionist temporal classification (CTC) and encoder–decoder models with attention mechanisms [1]. Traditional Hidden Markov Model (HMM)-based systems and recent end-to-end (E2E) architectures differ in how they represent linguistic information. HMM systems consist of the acoustic, pronunciation, and language models. In contrast, end-to-end (E2E) systems simplify the ASR pipeline by replacing separate components with a single trainable network that directly maps acoustic signals to text sequences. However, despite this architectural simplicity, the end-to-end models have often been limited in their ability to represent the linguistic richness and syntactic regularity of natural language. This limitation is particularly noticeable in contexts where resources are limited and annotated text and speech data are scarce, like the Amazigh language [2].
The integration of an external language model (LM) has been shown to be an effective approach to address this limitation. Linguistic knowledge is injected during decoding by one of the existing fusion techniques. The most straightforward and widely adopted of these is known as shallow fusion [3]. There are two primary categories of language models that can be distinguished. Statistical N-gram models, such as trigrams, estimate the probability of a word based on its preceding words. Recurrent neural network language models (RNN-LMs) capture long-term dependencies beyond the fixed context windows of N-gram models, allowing for more accurate contextual modeling. Existing studies of low-resource languages, such as Amazigh, have focused primarily on the acoustic and architectural aspects of recognition systems, while the integration of external language models during decoding has been limited [4,5]. Furthermore, the comparative impact of statistical and neural language models has yet to be examined. Against this backdrop, our research endeavors to conduct a comparative analysis of the influence of these two language modeling paradigms, N-gram and RNN-LM, on the decoding performance of an Amazigh speech recognition system. Our objective is to determine whether the enhanced contextual modeling capability of neural language models leads to measurable improvements in recognition accuracy and whether these improvements justify their additional computational complexity.
This study aims to leverage the strengths of different language modeling approaches, which could be particularly beneficial for recognizing Amazigh speech, a linguistically rich yet underdeveloped language in the field of speech recognition.
The main contributions of this work are as follows:
  • Development of a baseline ASR system for the Amazigh language based on a convolutional neural network, bidirectional long short-term memory, and connectionist temporal classification model CNN-BiLSTM-CTC enhanced with an attention mechanism, designed for recognizing the 33-letter Tifinagh alphabet.
  • Compact RNN-LM architecture optimized for low-resource scenarios.
  • Integration of an external N-gram language model via shallow fusion during decoding, effectively combining acoustic and linguistic knowledge and improving recognition accuracy.
  • Integration of an external recurrent neural network language model (RNN-LM) via shallow fusion.
  • Utilization of the Tifdigit corpus, comprising 8940 speech samples recorded by 50 native speakers, as a key resource in the study of low-resource languages such as Amazigh.
  • First comprehensive comparison of four decoding methods for Amazigh ASR. an in-depth analysis was carried out of decoding strategies and error patterns, including word-level improvements, error correction mechanisms (insertion, substitution, and deletion), and syllabic structures.
The paper is organized as follows. Section 1 is the introduction. Section 2 provides a review of related work on automatic speech recognition (ASR). Section 3 details the proposed CNN-BiLSTM-CTC methodology. Section 4 presents the Amazigh language. Section 5 describes the language model integration for the Amazigh speech recognition system. Section 6 reports and discusses the results, and Section 7 concludes the study and suggests future research directions.

2. Motivation

For automatic speech recognition (ASR) systems in low-resource languages, such as Amazigh, most previous work focuses on acoustic modeling, while the integration of external language models during decoding remains limited, despite these languages being morphologically rich. Furthermore, comparative evaluation of language models is limited.
To address this gap, we propose an in-depth analysis of statistical N-gram and neural RNN-LM language models. Our objective is to analyze the impact of these models, taken separately, on decoding accuracy, efficiency, and robustness, in order to improve the performance of ASR systems for low-resource languages.

3. Related Works

Amazigh automatic speech recognition (ASR) has evolved from conventional statistical HMM-based models to modern deep learning and transformer-based architectures.
Barkani et al. [6] developed an Amazigh speech recognition (ASR) system that recognizes isolated words using the Kaldi tool and HMM statistical models. This system has achieved good results with small vocabularies. Subsequently, several studies have emerged on deep learning architectures, such as convolutional neural networks (CNNs), which leverage representations. Boulal et al. [7] constructed a Mel-spectrogram-CNN system for the Amazigh language with data augmentation to improve the robustness of the CNN models.
Boulal et al. [8] developed hybrid CNN-LSTM and CNN-CTC models that showed better performance in recognizing Amazigh numbers and commands.
Telmem et al. [9] conducted comparative studies on Amazigh speech recognition, evaluating CNN, LSTM, and hybrid CNN-LSTM architectures. The results on the Tifdigit corpus in terms of accuracy highlight the superiority of CNNs in Amazigh speech recognition, with an accuracy rate of 88% when CNNs are trained with the female category subset of data.
Another work by Telmem et al. [10] presents a comparative study with different feature extraction techniques, such as MFCCs, spectrograms, and Mel-spectrograms. Their results showed that CNN and Bi-LSTM models offer the best performance.
Most state-of-the-art studies in automatic speech recognition (ASR) for low-resource languages, such as Amazigh, have mainly focused on improving acoustic modeling, while the integration of external language models remains rare despite the morphological richness of these languages.
Guan et al. [11] proposed an integrated approach combining the deep learning models DNN, CNN, LSTM, and a transformer language model. The evaluation was carried out on three corpora (TIMIT, LibriSpeech, and Common Voice) with the aim of improving accuracy.
Anh and Sy [12] integrated a language model based on LLM approaches. The results show that the system achieved a relative reduction in WER of up to 33%, particularly in noisy contexts.
Similarly, Zhang et al. [13] integrated an LLM-type language model directly using deep fusion during training. The results achieved reduced the CER from 51.1% to 17.65% with only 10 h of data.

4. Methodology

4.1. Principle of Speech Recognition (ASR)

Automatic speech recognition (ASR) aims to convert an acoustic signal (X) into a text sequence (Y) according to [14]:
Y ^ = arg   max P ( Y | X )
Applying Bayes’ theorem, this expression can be rewritten as follows:
Y ^ = arg   max P ( X | Y ) .   P ( Y )
where
  • P(X|Y): describes the probability of an audio signal given a sequence of words (acoustic model).
  • P(Y): expresses the probability of a sequence of words in the language (language model).

4.2. Language Models Used in This Work

The language model (LM) is an essential component of an ASR system. It plays an important role in constraining the recognition process by estimating the probability of a sequence of words in the language. It captures grammatical and syntactic regularities that help resolve acoustic ambiguities [15]. The probability of a sentence composed of M words is defined as follows:
P ( Y )   =   i = 1 M P (   y i | y 1 ,   y , ,   y i 1 )
This formulation estimates the likelihood of each word based on the words that precede it, thus ensuring linguistic consistency during decoding.

N-Gram Model

The N-gram model estimates the probability of a word using the (n − 1) preceding words. Thus, the probability of a sequence Y = y , , y N , is approximated by [15,16]:
P ( Y )   = P (   y 1 ,   y , ,   y N ) i = 1 N P (   y i n 1 | y 1 ,   y , ,   y i 1 )
where
  • n = 1: unigram model.
  • n = 2: bigram model.
  • n = 3: trigram model

4.3. Shallow Fusion Strategy Connectionist Temporal Classification

The CTC method automatically aligns audio sequences X = [ x 1 , x 2 , , x T ] and text sequences Y = [ y 1 , y 2 , y U ] without manual segmentation. It introduces an empty symbol (ε) to handle silences. The probability of a transcription y resulting from an input X is [17,18]:
P ( y | X )   =   π β 1 ( y ) P ( π | X )
where β is a function that merges repeated labels and removes empty symbols [19]. The CTC loss function is defined by
L C T C = ln P ( y | X )

4.4. Shallow Fusion Strategy

Shallow fusion is a strategy that combines the acoustic scores of the recognition model and the probabilities of the language model during decoding [20]. The final decision is given by
Y ^ = arg   max [ P AM ( X | Y ) + λ P L M P ( Y ) ]
where λ is a weighting parameter that balances the contribution of the language model. This technique improves decoding reliability and reduces transcription errors.

4.5. CNN-BiLSTM-CTC Architecture

A CNN-BiLSTM-CTC model enhanced with an attention mechanism designed for recognizing the 33-letter Tifinagh alphabet was used as the baseline in this work.
CNNs are particularly effective at extracting local and global features from acoustic signals, thanks to their convolutions and pooling layers, which reduce dimensionality while preserving essential spectro-temporal information. ReLU activation functions introduce the necessary non-linearity, while normalization and subsampling improve the robustness and generalization of the model [21].
The Bi-LSTM network consists of two layers of LSTM oriented in opposite directions in order to model temporal dependencies; one processing data from left to right in the forward direction and the other from right to left in the reverse direction. This double propagation makes it possible to better capture the relationships between successive phonemes and reinforce the temporal consistency of the prediction [22,23].
Temporal connectionist classification (TCC) enables the model to facilitate the learning of all phoneme–time correspondences by calculating the probability of all possible alignments and selecting the most probable sequence.
Finally, the addition of the attention mechanism allows processing to focus on the most relevant parts of a sequence and assembles relevant information to improve the overall representation [24,25,26]. In automatic speech recognition (ASR), it calculates normalized similarity scores using a SoftMax function, which highlights the most informative acoustic segments. This operation improves audio–text alignment and the robustness of the system in the face of noise and linguistic variations.

5. Amazigh Language

Ircam offers a graphics system consisting of the following [27]:
  • 27 Consonants labial: ⴼ, ⴱ, ⵎ dental ⵜ, ⴷ, ⵟ, ⴹ, ⵏ, ⵔ, ⵕ, ⵍ alveolar ⵙ, ⵣ, ⵚ, ⵥ palatal ⵛ, ⵊ velar ⴽ, ⴳ labiovelar ⴽⵯ, ⴳⵯ uvular ⵇ, ⵅ, ⵖ pharyngeal ⵃ, and ⵄ laryngeal ⵀ;
  • 2 Semi-consonants: ⵢ and ⵡ;
  • 4 Vowels: full vowel ⴰ, ⵉ, and ⵓ, neuter vowel (schwa) ⴻ.
Syllabic structure plays a crucial role in the phonological organization of Amazigh, which helps to maintain the regularity and fluidity of the phonetic system. A fundamental phonotactic principle underlies the segmentation of syllables in Amazigh: each syllable must contain a vowel, which forms its nucleus, and may be preceded and/or followed by one or more consonants [28,29]. V means ‘vowel alone’, CV ‘consonant + vowel’, VC ‘vowel + consonant’, CVC ‘consonant + vowel + consonant’, and CCV ‘complex onset followed by a vowel’, the latter being more marginal (Table 1).

6. Language Model Integration for Amazigh Speech Recognition System

6.1. System Description

The proposed Amazigh speech recognition system, illustrated in Figure 1, adopts an end-to-end CNN-BiLSTM-CTC enhanced with an attention mechanism trained with ctc loss. The input audio converted into 80-dimensional Mel-spectrograms. The model consisted of three convolutional layers (80→320 channels), a three-layer BiLSTM (320 units per direction), and a sixteen-head self-attention module, followed by a fully connected classifier (640→26).
Training uses the AdamW optimizer with cosine annealing, dropout = 0.15, and speed perturbation for augmentation. Two language models guide decoding: a 4-gram KenLM with Kneser–Ney smoothing and an LSTM-RNN LM (embedding = 64, hidden = 128) trained with the AdamW optimizer and label smoothing = 0.1.
Final decoding employs CTC beam search with shallow fusion (λ_LM = 0.25–0.30, beam = 5–8), combining acoustic and linguistic probabilities for optimal phoneme sequence prediction.

6.2. Corpus

In these experiments, we trained and tested an enhanced CNN-BiLSTM-CTC model with an attention mechanism using the Tifdigit dataset [29], which was designed to recognize the 33 letters of the Amazigh alphabet. The Tifdigit corpus includes the Amazigh alphabet (33 letters) and the first 10 Amazigh digits. Approximately fifty native Tarifit speakers from Morocco were invited to pronounce each letter and digit ten times. Consequently, the corpus contains ten repetitions of each character per speaker, yielding a total of 21,500 audio files (10 digits × 33 letters × 10 repetitions × 50 speakers).
The recordings were made in a noise-free environment using the same microphone, as well as the open-source software Audacity and Wavesurfer, on a Linux system. Table 2 presents the recording parameters and dataset configuration used for the preparation of the Tifdigit corpus.
To evaluate our proposed models, we used a subset of the corpus consisting of 8942 audio files representing the 33 Amazigh letters, divided into training and testing sets, comprising 82% and 18% of the data, respectively, using a balanced, stratified approach. To ensure speaker independence and prevent data leakage, all recordings from the same speaker were assigned to the same partition (either the training or testing set).
All experiences were conducted on a CUDA-enabled GPU using PyTorch 2.0.1, in the Google Colaboratory Jupyter Notebook Service with Python 3 on an Ubuntu 18.04 Lenovo laptop with a 13th Gen Intel (R) Core (TM) i7-13620H 2.40 GHz and 16 GB RAM.

6.3. Language Model

The language model is a key element in our strategy. It plays an essential role in improving the linguistic consistency of the output sequences generated by the acoustic model. It expresses the probability of a sequence of words in the language. In this work, we conduct a comparative analysis of the influence of two language modeling paradigms, N-gram and RNN-LM, on the decoding performance of an Amazigh speech recognition system.
The N-gram language model was trained using the 33 letter Amazigh alphabet proposed by IRCAM, and employs a 4-gram interpolation with smoothing α = 0.1 and weights λ4 = 0.40, λ3 = 0.30, λ2 = 0.20, and λ1 = 0.10.
The RNN language model, built with an LSTM layer (128 hidden units) and 64-dimensional embedding, is trained with AdamW, dropout = 0.3, and early stopping at 100 epochs.
The RNN language model has a compact architecture that is suited to low-resource environments. It consists of a single-layer, unidirectional long short-term memory (LSTM) network, an encoding module, and a fully connected classifier. Training uses the AdamW optimizer with cosine scheduling, as well as several regularization strategies such as label smoothing, dropout, and an early stopping mechanism. Training the model on heavily augmented phonemic sequences from a 33-word Amazigh lexicon resulted in stable convergence. The detailed architectural and training hyperparameters of the RNN language model are summarized in Table 3.
A dedicated Amazigh lexicon was created manually to facilitate the decoding process presented in Table 4. It provides standardized phonetic transcriptions based on the 33-letter Amazigh alphabet defined by IRCAM.

6.4. Decoding Strategies

Four decoding strategies were used to evaluate the performance: (1) greedy decoding, which selects the most probable phoneme at each time step, providing a fast baseline; (2) beam search decoding, which maintains multiple hypotheses to improve accuracy; (3) beam search with language model, integrating a 4-gram LM, and (4) beam search with language model, integrating a RNN-LM with an optimized weight (beam width = 8 and λ = 0.25) to combine acoustic and linguistic information for enhanced recognition performance.

6.5. Evaluation Metrics

The performance of the CNN-BiLSTM-CTC model was evaluated using the phone error rate (PER) and exact match accuracy, which were computed using the JiWER package [30]. PER is defined as follows:
PER = (S + D + I)/N
This measures the proportion of incorrectly recognized phonemes, where S: substitutions; D: deletions; I: insertions; and N: total phonemes. The exact match accuracy represents the percentage of sequences recognized perfectly and provides a stringent measure of transcription precision.

7. Results and Discussion

To investigate the influence of external language models on the performance of our end-to-end Amazigh ASR system, we integrated two types of language models, a 4-gram statistical model and a recurrent neural network language model (RNN-LM), via shallow fusion into the CNN-BiLSTM-CTC acoustic model during the decoding process. The training results of the acoustic model are illustrated in Figure 2. Four decoding approaches were evaluated on 1584 test utterances: greedy decoding, beam search, beam search with a 4-gram LM, and beam search with an RNN-LM.
The results in Table 5 show that the improved 4-gram language model achieved the best overall performance, with a phonetic error rate (PER) of 0.0268, representing a relative improvement of 4.0% over the greedy reference model. This translates into an accuracy of 97.32%, and an exact word match rate of 93.4%. This gain is due to higher-order context modeling, optimized interpolation weights, and adaptive lexical weighting tailored to the phonotactic structure of Amazigh. The improved N-gram approach notably outperformed simple beam search and the RNN neural language model.
Although the RNN-LM achieved modest improvements with a phonetic error rate (PER) of 0.0277, this represents a relative improvement of +0.9% compared to the greedy baseline and the N-gram model, which improved by +4.0%. The underperformance of the RNN-LM is primarily the result of data limitations rather than inherent algorithmic limitations. The evolution of the RNN-LM training loss is presented in Figure 3, highlighting the model’s convergence behavior under limited data conditions. The 33 words in the lexicon generate only 1447 training sequences, which is insufficient for a neural model to learn robust phonotactic regularities.
Furthermore, there are certain limitations to our augmentation strategy. The 33 words in the lexicon were expanded through 15 repetitions of each sequence and by extracting the prefixes and suffixes of two to five phonemes, which were then divided into sliding windows. This approach generates a total of 1447 sequences, which is insufficient to provide the model with exploitable phonotactic diversity. Conversely, the N-gram model makes better use of an extremely limited context (less than 50 words and less than 2000 sequences), which explains its significant advantage.

7.1. Word-Level Performance Analysis

A detailed analysis by word, summarized in Table 6, reveals the heterogeneous effects of decoding strategies on Amazigh vocabulary. Notable improvements were observed particularly for five words: yach, yak, yas, yazz, and yey. The word yach showed the most marked improvement, achieving complete correction with the N-gram model (with a PER error rate of 0, representing a 100% improvement over the greedy reference model).
The RNN-LM also had a positive effect on yak and yas, with relative reductions in phonetic error rate (PER) of 57.1% and 40.0%, respectively. Similarly, yey and yazz saw moderate gains of 50.0% and 33.3%, confirming the effectiveness of the recurrent model in capturing longer temporal dependencies.
On the other hand, a large set of words—such as ya, yaa, yab, yan, and yam—showed stable performance regardless of the decoding type. Three words, yakh, yi, and you, achieved perfect recognition (PER = 0.0000) with all decoding strategies, leading to the conclusion that the acoustic model alone was sufficiently robust for these sequences. This robustness can be explained by three main factors:
  • A high phonemic distinctiveness: the vowels /i/ and /u/ in these words are at opposite ends of the vowel spectrum.
  • Simple syllabic structures (CV/CVV): ‘yi’, ‘you’, and ‘yef’ have limited coarticulation and long, stable vowels, which facilitates the extraction of acoustic features.
  • Low lexical confusability: their closest phonetic neighbors are at an editing distance of at least 2, unlike ambiguous pairs such as ‘yas’ and ‘yaz’ (distance 1, PER = 0.0347).
  • Therefore, the CNN-BiLSTM-CTC model effectively captures the spectro-temporal characteristics of these distinctive and simple words. Linguistic strategies become particularly useful in more complex phonetic contexts.
Detailed word-by-word analysis confirms that N-gram statistical models can outperform neural approaches in low-resource conditions, especially when based on a limited but high-quality lexicon. Indeed, N-gram language models mainly improve the recognition of ambiguous or phonetically similar words, while the acoustic model is sufficient for simple words.

7.2. Error Correction Mechanisms

Phoneme Insertion Correction: Insertions occur when the acoustic model predicts spurious phonemes. The N-gram model is more effective at correcting insertions because it penalizes phototactically improbable sequences through its explicit lexical constraints and high probability sequences. It removes insertions in short words such as yach and yal up to 100% correction. The RNN-LM partially improves results by approximately 16% in cases such as yahh by exploiting distributed representations of phonetic features.
Phoneme Substitution Correction: The N-gram model also performs better at correcting substitutions, which correspond to confusions between acoustically similar phonemes, thanks to its ability to model long-range dependencies and capture implicit phonetic features “voicing, vowel height”. These substitutions are better corrected by the RNN-LM, which achieves gains of 40 to 57% on words such as yak, yas, and yey.
Recovery of Phoneme Deletions: Deletions remain the most difficult to correct, as the language model cannot reintroduce phonemes that are absent from the acoustic signal. Only the word ‘yad’ shows a slight improvement (14%) with the RNN-LM. The persistent cases of yan and yam illustrate nasal confusion, /m/ and /n/, and require improvement in the acoustic model rather than linguistic correction.
As summarised in Table 7, the error-type analysis reveals the differences in the ability of the N-gram and RNN-LM models to correct insertion, substitution and deletion errors.

7.3. Influence of Language Models Across Amazigh Syllabic Structures

A syllabic structure analysis was performed to assess the impact of the language model on phonemic recognition performance. We studied the performance of different decoding strategies according to Amazigh syllabic structures. The results presented in Table 8 show that the contribution of the language model varies according to the phonotactic structure of Amazigh words.
Indeed, for the simplest syllables, CVV syllables have a PER of around 0,016, without benefiting significantly from linguistic models. CVC syllables, which represent the most frequent syllabic structure with 22 words, benefit from a moderate gain of 3,6% with the N-gram language model, particularly for ambiguous consonants (/k/, /g/, /s/, /z/), while the RNN-LM improves context-sensitive cases, such as yak, yas, and yey, by correcting vowel or voicing confusion.
Finally, CVCC syllables, which are the most complex structures, record the maximum relative gain of 7.7% with the N-gram, particularly on the consonantal syllables yach and yagw, as the explicit modeling of four-phoneme dependencies fits perfectly with the maximum syllabic length of Amazigh words.
This analysis highlights an inverse relationship between syllabic simplicity and linguistic modeling efficiency. While simple syllables (CVV) are acoustically sufficient, linguistically complex structures (CVCC) benefit greatly from the explicit constraints of the linguistic model. The 4-gram model thus proves to be structurally optimal for Amazigh, corresponding precisely to the dominant syllabic window of four phonemes in this language.

7.4. Comparative Analysis with State-of-the-Art

The Table 9 illustrates the Comparative Performance of the Proposed Amazigh ASR System with State-of-the-Art Studies.

8. Conclusions

This study presented a comprehensive comparative analysis of statistical and neural language model integration using the shallow fusion technique in an end-to-end Amazigh speech recognition framework. For a baseline ASR system for the Amazigh language based on a CNN-BiLSTM-CTC enhanced with an attention mechanism and designed for recognizing the 33-letter Tifinagh alphabet, we demonstrated that shallow fusion with a 4-gram language model achieved the most consistent improvements, reducing the phoneme error rate by 4% relative to the greedy baseline. The RNN-based model showed moderate gains, limited by the small corpus size.
At the linguistic level, the N-gram model proved to be more optimal for complex syllabic structures (CVCC) in the Amazigh language, while simple forms (CVV) were already sufficient from an acoustic point of view. This leads to the conclusion that N-gram models can outperform neural approaches when lexical resources are limited but reliable. In the future, we plan to develop a novel hybrid decoding framework that combines the strengths of both N-gram and RNN-based models through shallow fusion and deep fusion.

Author Contributions

Conceptualization, M.T. and N.L.; methodology, M.T. and N.L.; software and implementation, M.T. and N.L.; validation and experiments, M.T., N.L., Y.G. and H.S.; writing—original draft, M.T.; formal analysis, M.T., N.L., Y.G. and H.S.; visualization, M.T. and N.L.; supervision, Y.G. and H.S.; project administration; H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Tifdigit datasets used in this research are publicly available at https://github.com/mtel-dotcom (accessed on 2 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ASRAutomatic Speech Recognition System
HMMsHidden Markov Model
E2EEnd-to-End
CNNConvolutional Neural Network
BiLSTMBidirectional Long Short-Term Memory
LSTMLong Short-Term Memory
CTCConnectionist Temporal Classification
LMLanguage Model
RNNRecurrent Neural Network

References

  1. Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11. [Google Scholar] [CrossRef]
  2. Kandji, A.K.; Ba, C.; Ndiaye, S. State-of-the-Art Review on Recent Trends in Automatic Speech Recognition. In International Conference on Emerging Technologies for Developing Countries; Springer Nature: Cham, Switzerland, 2023; pp. 185–203. [Google Scholar]
  3. Prabhavalkar, R.; Hori, T.; Sainath, T.N.; Schlüter, R.; Watanabe, S. End-to-end speech recognition: A survey. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 325–351. [Google Scholar] [CrossRef]
  4. Slam, W.; Li, Y.; Urouvas, N. Frontier research on low-resource speech recognition technology. Sensors 2023, 23, 9096. [Google Scholar] [CrossRef] [PubMed]
  5. Abdou Mohamed, N.; Allak, A.; Gaanoun, K.; Benelallam, I.; Erraji, Z.; Bahafid, A. Multilingual speech recognition initiative for African languages. Int. J. Data Sci. Anal. 2024, 20, 3513–3528. [Google Scholar] [CrossRef]
  6. Barkani, F.; Hamidi, M.; Laaidi, N.; Zealouk, O.; Satori, H.; Satori, K. Amazigh speech recognition based on the Kaldi ASR toolkit. Int. J. Inf. Technol. 2023, 15, 3533–3540. [Google Scholar] [CrossRef]
  7. Boulal, H.; Hamidi, M.; Abarkan, M.; Barkani, J. Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method. Int. J. Speech Technol. 2024, 27, 287–296. [Google Scholar] [CrossRef]
  8. Boulal, H.; Bouroumane, F.; Hamidi, M.; Barkani, J.; Abarkan, M. Exploring data augmentation for Amazigh speech recognition with convolutional neural networks. Int. J. Speech Technol. 2025, 28, 53–65. [Google Scholar] [CrossRef]
  9. Telmem, M.; Laaidi, N.; Ghanou, Y.; Hamiane, S.; Satori, H. Comparative study of CNN, LSTM and hybrid CNN-LSTM model in Amazigh speech recognition using spectrogram feature extraction and different gender and age dataset. Int. J. Speech Technol. 2024, 27, 1121–1133. [Google Scholar] [CrossRef]
  10. Telmem, M.; Laaidi, N.; Satori, H. The impact of MFCC, spectrogram, and Mel-Spectrogram on deep learning models for Amazigh speech recognition system. Int. J. Speech Technol. 2025, 1–14. [Google Scholar] [CrossRef]
  11. Guan, B.; Cao, J.; Wang, X.; Wang, Z.; Sui, M.; Wang, Z. Integrated method of deep learning and large language model in speech recognition. In Proceedings of the 2024 IEEE 7th International Conference on Electronic Information and Communication Technology (ICEICT), Xi’an, China, 31 July 2024–2 August 2024; IEEE: New York, NY, USA, 2024; pp. 487–490. [Google Scholar]
  12. Anh, N.M.T.; Sy, T.H. Improving speech recognition with prompt-based contextualized asr and llm-based re-predictor. In Interspeech; International Speech Communication Association (ISCA): Kos, Greece, 2024; Volume 2024, pp. 737–741. [Google Scholar]
  13. Zhou, S.; Xu, Z.; Zhang, M.; Xu, C.; Guo, Y.; Zhan, Z.; Zhang, R. Large language models for disease diagnosis: A scoping review. Npj Artif. Intell. 2025, 1, 9. [Google Scholar] [CrossRef]
  14. Telmem, M.; Ghanou, Y. The convolutional neural networks for Amazigh speech recognition system. TELKOMNIKA 2021, 19, 515–522. [Google Scholar] [CrossRef]
  15. Mukhamadiyev, A.; Mukhiddinov, M.; Khujayarov, I.; Ochilov, M.; Cho, J. Development of language models for continuous Uzbek speech recognition system. Sensors 2023, 23, 1145. [Google Scholar] [CrossRef]
  16. Liu, Z.; Venkateswaran, N.; Le Ferrand, É.; Prud’hommeaux, E. How important is a language model for low-resource ASR. In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 206–213. [Google Scholar]
  17. Anoop, C.S.; Ramakrishnan, A.G. CTC-based end-to-end ASR for the low resource Sanskrit language with spectrogram augmentation. In Proceedings of the 2021 National Conference on Communications (NCC), Kanpur, India, 27–30 July 2021; IEEE: New York, NY, USA; pp. 1–6. [Google Scholar]
  18. Mamyrbayev, O.Z.; Oralbekova, D.O.; Alimhan, K.; Nuranbayeva, B.M. Hybrid end-to-end model for Kazakh speech recognition. Int. J. Speech Technol. 2023, 26, 261–270. [Google Scholar] [CrossRef]
  19. Labied, M.; Belangour, A.; Banane, M. Delve deep into End-To-End Automatic Speech Recognition Models. In Proceedings of the 2023 International Seminar on Application for Technology of Information and Communication (iSemantic), Semarang, Indonesia, 16–17 September 2023; IEEE: New York, NY, USA, 2023; pp. 164–169. [Google Scholar]
  20. Mori, D.; Ohta, K.; Nishimura, R.; Ogawa, A.; Kitaoka, N. Advanced language model fusion method for encoder-decoder model in Japanese speech recognition. In Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; IEEE: New York, NY, USA, 2021; pp. 503–510. [Google Scholar]
  21. El Ouahabi, S.; El Ouahabi, S.; Atounti, M. Comparative Study of Amazigh Speech Recognition Systems Based on Different Toolkits and Approaches. In E3S Web of Conferences; EDP Sciences: Les Ulis, France, 2023; Volume 412, p. 01064. [Google Scholar]
  22. Jorge, J.; Gimenez, A.; Silvestre-Cerda, J.A.; Civera, J.; Sanchis, A.; Juan, A. Live streaming speech recognition using deep bidirectional LSTM acoustic models and interpolated language models. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 148–161. [Google Scholar] [CrossRef]
  23. Paul, B.; Phadikar, S.; Bera, S.; Dey, T.; Nandi, U. Isolated word recognition based on a hyper-tuned cross-validated CNN-BiLSTM from Mel Frequency Cepstral Coefficients. Multimed. Tools Appl. 2025, 84, 17309–17328. [Google Scholar] [CrossRef]
  24. Ismael, M.K.; Hock, G.C.; Abdulrazzak, H.N. Mathematical Modelling of Engineering Problems. Int. Inf. Eng. Assoc. 2025, 12, 1893–1910. Available online: http://iieta.org/journals/mmep (accessed on 12 January 2020).
  25. Xue, J.; Zheng, T.; Han, J. Exploring attention mechanisms based on summary information for end-to-end automatic speech recognition. Neurocomputing 2021, 465, 514–524. [Google Scholar] [CrossRef]
  26. Alawdi, A. MultiheadSelfAttention vs, Traditional Encoders: A Benchmark Study on Precision and Recall in Tajweed Recognition. In Proceedings of the 2025 5th International Conference on Emerging Smart Technologies and Applications (eSmarTA), Ibb, Yemen, 5–6 August 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
  27. Addarrazi, I.; Zealouk, O.; Satori, H.; Satori, K. The Hmm Based Amazigh Digits Audiovisual Speech Recognition System. Math. Stat. Eng. Appl. 2022, 71, 2261–2278. [Google Scholar]
  28. Ouhnini, A.; Aksasse, B.; Ouanan, M. Towards an automatic speech-to-text transcription system: Amazigh language. Int. J. Adv. Comput. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
  29. Satori, H.; ElHaoussi, F. Investigation Amazigh speech recognition using CMU tools. Int. J. Speech Technol. 2014, 17, 235–243. [Google Scholar] [CrossRef]
  30. Amin, N.A.M. Low-Resource Automatic Speech Recognition Domain Adaptation: A Case-Study in Aviation Maintenance. Doctoral Dissertation, Purdue University Graduate School, West Lafayettem, IN, USA, 2023. [Google Scholar]
  31. Samin, A.M.; Kobir, M.H.; Kibria, S.; Rahman, M.S. Deep learning based large vocabulary continuous speech recognition of an under-resourced language Bangladeshi Bangla. Acoust. Sci. Technol. 2021, 42, 252–260. [Google Scholar] [CrossRef]
  32. Oh, J.; Cho, E.; Kim, J.H. Integration of WFST Language Model in Pre-trained Korean E2E ASR Model. KSII Trans. Internet Inf. Syst. 2024, 18, 1692–1705. [Google Scholar]
Figure 1. The proposed CNN-BiLSTM-CTC with attention architecture for Amazigh speech recognition.
Figure 1. The proposed CNN-BiLSTM-CTC with attention architecture for Amazigh speech recognition.
Make 07 00164 g001
Figure 2. Training results of the CNN-BiLSTM-CTC acoustic model for Amazigh ASR.
Figure 2. Training results of the CNN-BiLSTM-CTC acoustic model for Amazigh ASR.
Make 07 00164 g002
Figure 3. Evolution of training loss for the RNN-LM.
Figure 3. Evolution of training loss for the RNN-LM.
Make 07 00164 g003
Table 1. Amazigh alphabet and the corresponding Tifinagh letters with typical syllable patterns used in this study.
Table 1. Amazigh alphabet and the corresponding Tifinagh letters with typical syllable patterns used in this study.

a
yaa
CVV

b
yab
CVC

g
yag
CVC
ⴳⵯ
gw
yagw
CVCC

d
yad
CVC


yadd
CVCC

e
yey
CVC

f
yaf
CVC

k
yak
CVC
ⴽⵯ
kw
yakw
CVCC

h
yah
CVC


yahh
CVCC

ɛ
yaε
CVC

x
yax
CVCC

q
yaq
CVC

i
yi
CVC

j
yaj
CVC

l
yal
CVC

m
yam
CVC

n
yan
CVC

u
you
CVV

r
yar
CVC


yarr
CVCC

γ
yiγ
CVC

s
yas
CVC


yass
CVCC

c
yach
CVCC

t
yat
CVC


yatt
CVCC

w
yaw
CVC

y
yay
CVC

z
yaz
CVC


yazz
CVCC
Table 2. Recording parameters used for the preparation of the corpus.
Table 2. Recording parameters used for the preparation of the corpus.
ParameterValue
Total number of audio files 21,500
Base duration2 h 18 min
Sampling16 kHz, 16 bits
Wave formatMono, wav
Corpus33-letter Amazigh alphabet
Speakers50 (50% male and 50% female)
Table 3. RNN-LM and training configuration.
Table 3. RNN-LM and training configuration.
ParameterValue
Number of LSTM layers1
Embedding dimension64
ClassifierFully connected layer
Dropoutp = 0.3 (after LSTM)
OptimizerAdamW
Initial learning rate0.0005
Weight decay0.01
β120.9/0.98
SchedulerCosine annealing
T_max100 epochs
Minimum learning rateη_min = 5 × 10−6
Loss functionCross-entropy with label smoothing
Label smoothingα = 0.1
Early stoppingPatience = 20 epochs
Training dataThe Amazigh lexicon (33 words)
Data augmentationTruncation, subsequence extraction, repetition
Augmentation factor≈×15
Convergence67 epochs
Final validation loss0.0234
Table 4. Amazigh lexicon with transcriptions based on the 33-letter Amazigh alphabet proposed by IRCAM.
Table 4. Amazigh lexicon with transcriptions based on the 33-letter Amazigh alphabet proposed by IRCAM.
WordTranscriptionWordTranscriptionWordTranscription
yaay-aayaghy-aeyary-ae-r
yaby-ae-byagwy-ae-g-wyarry-ae-r
yachy-ae-chyahy-aayasy-ae-z
yady-ae-dyahhy-ae-hhyassy-ae-s
yaddy-ae-dyajy-ae-jhyaty-ae-t
yagy-ae-gyaky-ae-kyatty-ae-t
yakhy-ae-kyaly-ae-lyawy-ao
yakwy-ae-k-wyamy-ae-myayy-ey
yany-ae-nyaqy-ae-kyazy-ae-z
yazzy-ae-zyefy-eh-fyeyy-iy
yiy-iyyouy-uwyay-aa
Table 5. Comparative performance results across different decoding techniques.
Table 5. Comparative performance results across different decoding techniques.
Decoding MethodGreedyBeam SearchBeam + 4-Gram LMBeam + RNN-LM
PER0.02790.02730.02680.0277
Relative ImprovementBaseline2.30%4.00%0.9%
Table 6. Word-level recognition results using different decoding strategies.
Table 6. Word-level recognition results using different decoding strategies.
MotGreedyBeamN-GramRNN-LM
ya0.02080.02080.02080.0208
yaa0.03120.03120.04170.0417
yab0.04170.04170.04170.0417
yach0.00690.006900.0069
yad0.04860.04860.04860.0417
yadd0.00690.00690.00690.0069
yag0.04170.03470.03470.0417
yagh0.01040.01040.01040.0104
yagw0.03650.03650.03120.0469
yah0.04170.04170.04170.0521
yahh0.04170.04170.03470.0347
yaj0.04170.04170.04170.0417
yak0.04860.03470.03470.0208
yakh0000
yakw0.03120.03120.03120.0365
yal0.04170.04170.03470.0347
yam0.04860.04860.04860.0486
yan0.09720.09720.09720.0972
yaq0.02780.02780.02780.0278
yar0.03470.03470.03470.0347
yarr0.00690.00690.00690.0069
yas0.03470.03470.03470.0208
yass0.02780.02780.02780.0417
yat0.00690.00690.00690.0069
yatt0.02780.02780.02780.0278
yaw0.03120.03120.03120.0312
yay0.03120.03120.03120.0521
yaz0.01390.01390.01390.0139
yazz0.02080.02080.02080.0139
yef0000
yey0.02080.02080.02080.0104
yi0000
you0000
Table 7. Error type analysis by language model.
Table 7. Error type analysis by language model.
Error TypeN-gram (Improvement %)RNN-LM (Improvement %)
Insertionyach (100%), yahh (16.7%), yal (16.7%)yahh (16.7%), yal (16.7%)
Substitutionyagw (14.3%), yag (16.7%)yak (57.1%), yas (40.0%), yey (50.0%), yazz (33.3%)
Deletion-----yad (14.3%)
Table 8. Average PER performance for syllabic structure.
Table 8. Average PER performance for syllabic structure.
StructureWordsPER GreedyPER BeamPER N-GramPER RNN-LMError Reduction Range
CVV20.01560.01560.02080.0208~0% (no gain)
CVC220.03030.02970.02920.03033.6%
CVCC90.02470.02440.02280.02367.7%
Table 9. Comparative analysis with State-of-the-Art Studies.
Table 9. Comparative analysis with State-of-the-Art Studies.
ReferenceStudied LanguageModels UsedAccuracy %
Samin et al. [31]Bangladeshi BanglaCNN-CTC + 5-gram LM86.07%
Oh et al. [32]Korean speechE2E model99.34%
Proposed work33-letter AmazighCNN-BiLSTM-CTC Model97.32%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Telmem, M.; Laaidi, N.; Ghanou, Y.; Satori, H. N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition. Mach. Learn. Knowl. Extr. 2025, 7, 164. https://doi.org/10.3390/make7040164

AMA Style

Telmem M, Laaidi N, Ghanou Y, Satori H. N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition. Machine Learning and Knowledge Extraction. 2025; 7(4):164. https://doi.org/10.3390/make7040164

Chicago/Turabian Style

Telmem, Meryam, Naouar Laaidi, Youssef Ghanou, and Hassan Satori. 2025. "N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition" Machine Learning and Knowledge Extraction 7, no. 4: 164. https://doi.org/10.3390/make7040164

APA Style

Telmem, M., Laaidi, N., Ghanou, Y., & Satori, H. (2025). N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition. Machine Learning and Knowledge Extraction, 7(4), 164. https://doi.org/10.3390/make7040164

Article Metrics

Back to TopTop