N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition

Telmem, Meryam; Laaidi, Naouar; Ghanou, Youssef; Satori, Hassan

doi:10.3390/make7040164

Open AccessArticle

N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition

¹

The Higher School of Technology in Meknes, Moulay Ismail University, Meknes 50000, Morocco

²

Faculty of Sciences Dhar Mahraz, Sidi Mohamed Ben Abdellah University, Fes 30000, Morocco

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(4), 164; https://doi.org/10.3390/make7040164

Submission received: 23 October 2025 / Revised: 28 November 2025 / Accepted: 3 December 2025 / Published: 10 December 2025

(This article belongs to the Section Learning)

Download

Browse Figures

Versions Notes

Abstract

This work investigates how different language modeling techniques affect the performance of an end-to-end automatic speech recognition (ASR) system for the Amazigh language. A (CNN-BiLSTM-CTC) model enhanced with an attention mechanism was used as the baseline. During decoding, two external language models were integrated using shallow fusion: a trigram N-gram model built with KenLM and a recurrent neural network language model (RNN-LM) trained on the same Tifdigit corpus. Four decoding methods were compared: greedy decoding; beam search; beam search with an N-gram language model; and beam search with a compact recurrent neural network language model. Experimental results on the Tifdigit dataset reveal a clear trade-off: the N-gram language model produces the best results compared to RNN-LM, with a phonetic error rate (PER) of 0.0268, representing a relative improvement of 4.0% over the greedy baseline model, and translates into an accuracy of 97.32%. This suggests that N-gram models can outperform neural approaches when reliable, limited data and lexical resources are available. The improved N-gram approach notably outperformed both simple beam search and the RNN neural language model. This improvement is due to higher-order context modeling, its optimized interpolation weights, and its adaptive lexical weighting tailored to the phonotactic structure of the Amazigh language.

Keywords:

Amazigh ASR; language model integration; shallow fusion; CNN-BiLSTM-CTC; RNN-LM; N-gram

1. Introduction

The field of automatic speech recognition (ASR) has seen significant advancements with the emergence of end-to-end (E2E) neural architectures, such as connectionist temporal classification (CTC) and encoder–decoder models with attention mechanisms [1]. Traditional Hidden Markov Model (HMM)-based systems and recent end-to-end (E2E) architectures differ in how they represent linguistic information. HMM systems consist of the acoustic, pronunciation, and language models. In contrast, end-to-end (E2E) systems simplify the ASR pipeline by replacing separate components with a single trainable network that directly maps acoustic signals to text sequences. However, despite this architectural simplicity, the end-to-end models have often been limited in their ability to represent the linguistic richness and syntactic regularity of natural language. This limitation is particularly noticeable in contexts where resources are limited and annotated text and speech data are scarce, like the Amazigh language [2].

The integration of an external language model (LM) has been shown to be an effective approach to address this limitation. Linguistic knowledge is injected during decoding by one of the existing fusion techniques. The most straightforward and widely adopted of these is known as shallow fusion [3]. There are two primary categories of language models that can be distinguished. Statistical N-gram models, such as trigrams, estimate the probability of a word based on its preceding words. Recurrent neural network language models (RNN-LMs) capture long-term dependencies beyond the fixed context windows of N-gram models, allowing for more accurate contextual modeling. Existing studies of low-resource languages, such as Amazigh, have focused primarily on the acoustic and architectural aspects of recognition systems, while the integration of external language models during decoding has been limited [4,5]. Furthermore, the comparative impact of statistical and neural language models has yet to be examined. Against this backdrop, our research endeavors to conduct a comparative analysis of the influence of these two language modeling paradigms, N-gram and RNN-LM, on the decoding performance of an Amazigh speech recognition system. Our objective is to determine whether the enhanced contextual modeling capability of neural language models leads to measurable improvements in recognition accuracy and whether these improvements justify their additional computational complexity.

This study aims to leverage the strengths of different language modeling approaches, which could be particularly beneficial for recognizing Amazigh speech, a linguistically rich yet underdeveloped language in the field of speech recognition.

The main contributions of this work are as follows:

Development of a baseline ASR system for the Amazigh language based on a convolutional neural network, bidirectional long short-term memory, and connectionist temporal classification model CNN-BiLSTM-CTC enhanced with an attention mechanism, designed for recognizing the 33-letter Tifinagh alphabet.
Compact RNN-LM architecture optimized for low-resource scenarios.
Integration of an external N-gram language model via shallow fusion during decoding, effectively combining acoustic and linguistic knowledge and improving recognition accuracy.
Integration of an external recurrent neural network language model (RNN-LM) via shallow fusion.
Utilization of the Tifdigit corpus, comprising 8940 speech samples recorded by 50 native speakers, as a key resource in the study of low-resource languages such as Amazigh.
First comprehensive comparison of four decoding methods for Amazigh ASR. an in-depth analysis was carried out of decoding strategies and error patterns, including word-level improvements, error correction mechanisms (insertion, substitution, and deletion), and syllabic structures.

The paper is organized as follows. Section 1 is the introduction. Section 2 provides a review of related work on automatic speech recognition (ASR). Section 3 details the proposed CNN-BiLSTM-CTC methodology. Section 4 presents the Amazigh language. Section 5 describes the language model integration for the Amazigh speech recognition system. Section 6 reports and discusses the results, and Section 7 concludes the study and suggests future research directions.

2. Motivation

For automatic speech recognition (ASR) systems in low-resource languages, such as Amazigh, most previous work focuses on acoustic modeling, while the integration of external language models during decoding remains limited, despite these languages being morphologically rich. Furthermore, comparative evaluation of language models is limited.

To address this gap, we propose an in-depth analysis of statistical N-gram and neural RNN-LM language models. Our objective is to analyze the impact of these models, taken separately, on decoding accuracy, efficiency, and robustness, in order to improve the performance of ASR systems for low-resource languages.

3. Related Works

Amazigh automatic speech recognition (ASR) has evolved from conventional statistical HMM-based models to modern deep learning and transformer-based architectures.

Barkani et al. [6] developed an Amazigh speech recognition (ASR) system that recognizes isolated words using the Kaldi tool and HMM statistical models. This system has achieved good results with small vocabularies. Subsequently, several studies have emerged on deep learning architectures, such as convolutional neural networks (CNNs), which leverage representations. Boulal et al. [7] constructed a Mel-spectrogram-CNN system for the Amazigh language with data augmentation to improve the robustness of the CNN models.

Boulal et al. [8] developed hybrid CNN-LSTM and CNN-CTC models that showed better performance in recognizing Amazigh numbers and commands.

Telmem et al. [9] conducted comparative studies on Amazigh speech recognition, evaluating CNN, LSTM, and hybrid CNN-LSTM architectures. The results on the Tifdigit corpus in terms of accuracy highlight the superiority of CNNs in Amazigh speech recognition, with an accuracy rate of 88% when CNNs are trained with the female category subset of data.

Another work by Telmem et al. [10] presents a comparative study with different feature extraction techniques, such as MFCCs, spectrograms, and Mel-spectrograms. Their results showed that CNN and Bi-LSTM models offer the best performance.

Most state-of-the-art studies in automatic speech recognition (ASR) for low-resource languages, such as Amazigh, have mainly focused on improving acoustic modeling, while the integration of external language models remains rare despite the morphological richness of these languages.

Guan et al. [11] proposed an integrated approach combining the deep learning models DNN, CNN, LSTM, and a transformer language model. The evaluation was carried out on three corpora (TIMIT, LibriSpeech, and Common Voice) with the aim of improving accuracy.

Anh and Sy [12] integrated a language model based on LLM approaches. The results show that the system achieved a relative reduction in WER of up to 33%, particularly in noisy contexts.

Similarly, Zhang et al. [13] integrated an LLM-type language model directly using deep fusion during training. The results achieved reduced the CER from 51.1% to 17.65% with only 10 h of data.

4. Methodology

4.1. Principle of Speech Recognition (ASR)

Automatic speech recognition (ASR) aims to convert an acoustic signal (X) into a text sequence (Y) according to [14]:

\hat{Y} = \arg \max P (Y | X)

(1)

Applying Bayes’ theorem, this expression can be rewritten as follows:

\hat{Y} = \arg \max P (X | Y) . P (Y)

(2)

where

P(X|Y): describes the probability of an audio signal given a sequence of words (acoustic model).
P(Y): expresses the probability of a sequence of words in the language (language model).

4.2. Language Models Used in This Work

The language model (LM) is an essential component of an ASR system. It plays an important role in constraining the recognition process by estimating the probability of a sequence of words in the language. It captures grammatical and syntactic regularities that help resolve acoustic ambiguities [15]. The probability of a sentence composed of M words is defined as follows:

P (Y) = \prod_{i = 1}^{M} {P (y_{i} | y}_{1}, y, \dots, y_{i - 1})

(3)

This formulation estimates the likelihood of each word based on the words that precede it, thus ensuring linguistic consistency during decoding.

N-Gram Model

The N-gram model estimates the probability of a word using the (n − 1) preceding words. Thus, the probability of a sequence Y =

y, \dots, y_{N}

, is approximated by [15,16]:

P (Y) = {P (y}_{1}, y, \dots, y_{N}) \approx \prod_{i = 1}^{N} {P (y_{i - (n - 1)} | y}_{1}, y, \dots, y_{i - 1})

(4)

where

n = 1: unigram model.
n = 2: bigram model.
n = 3: trigram model

4.3. Shallow Fusion Strategy Connectionist Temporal Classification

The CTC method automatically aligns audio sequences X = [

x_{1}

,

x_{2}

,

\dots

,

x_{T}

] and text sequences Y = [

y_{1}, y_{2} \dots, y_{U}

] without manual segmentation. It introduces an empty symbol (ε) to handle silences. The probability of a transcription y resulting from an input X is [17,18]:

P (y | X) = \sum_{π \in β^{- 1 (y)}} P (π | X)

(5)

where β is a function that merges repeated labels and removes empty symbols [19]. The CTC loss function is defined by

L_{C T C} = - \ln P (y | X)

(6)

4.4. Shallow Fusion Strategy

Shallow fusion is a strategy that combines the acoustic scores of the recognition model and the probabilities of the language model during decoding [20]. The final decision is given by

\hat{Y} = \arg \max [P_{AM} (X | Y) + λ P_{L M} P (Y)]

(7)

where λ is a weighting parameter that balances the contribution of the language model. This technique improves decoding reliability and reduces transcription errors.

4.5. CNN-BiLSTM-CTC Architecture

A CNN-BiLSTM-CTC model enhanced with an attention mechanism designed for recognizing the 33-letter Tifinagh alphabet was used as the baseline in this work.

CNNs are particularly effective at extracting local and global features from acoustic signals, thanks to their convolutions and pooling layers, which reduce dimensionality while preserving essential spectro-temporal information. ReLU activation functions introduce the necessary non-linearity, while normalization and subsampling improve the robustness and generalization of the model [21].

The Bi-LSTM network consists of two layers of LSTM oriented in opposite directions in order to model temporal dependencies; one processing data from left to right in the forward direction and the other from right to left in the reverse direction. This double propagation makes it possible to better capture the relationships between successive phonemes and reinforce the temporal consistency of the prediction [22,23].

Temporal connectionist classification (TCC) enables the model to facilitate the learning of all phoneme–time correspondences by calculating the probability of all possible alignments and selecting the most probable sequence.

Finally, the addition of the attention mechanism allows processing to focus on the most relevant parts of a sequence and assembles relevant information to improve the overall representation [24,25,26]. In automatic speech recognition (ASR), it calculates normalized similarity scores using a SoftMax function, which highlights the most informative acoustic segments. This operation improves audio–text alignment and the robustness of the system in the face of noise and linguistic variations.

5. Amazigh Language

Ircam offers a graphics system consisting of the following [27]:

27 Consonants labial: ⴼ, ⴱ, ⵎ dental ⵜ, ⴷ, ⵟ, ⴹ, ⵏ, ⵔ, ⵕ, ⵍ alveolar ⵙ, ⵣ, ⵚ, ⵥ palatal ⵛ, ⵊ velar ⴽ, ⴳ labiovelar ⴽⵯ, ⴳⵯ uvular ⵇ, ⵅ, ⵖ pharyngeal ⵃ, and ⵄ laryngeal ⵀ;
2 Semi-consonants: ⵢ and ⵡ;
4 Vowels: full vowel ⴰ, ⵉ, and ⵓ, neuter vowel (schwa) ⴻ.

Syllabic structure plays a crucial role in the phonological organization of Amazigh, which helps to maintain the regularity and fluidity of the phonetic system. A fundamental phonotactic principle underlies the segmentation of syllables in Amazigh: each syllable must contain a vowel, which forms its nucleus, and may be preceded and/or followed by one or more consonants [28,29]. V means ‘vowel alone’, CV ‘consonant + vowel’, VC ‘vowel + consonant’, CVC ‘consonant + vowel + consonant’, and CCV ‘complex onset followed by a vowel’, the latter being more marginal (Table 1).

6. Language Model Integration for Amazigh Speech Recognition System

6.1. System Description

The proposed Amazigh speech recognition system, illustrated in Figure 1, adopts an end-to-end CNN-BiLSTM-CTC enhanced with an attention mechanism trained with ctc loss. The input audio converted into 80-dimensional Mel-spectrograms. The model consisted of three convolutional layers (80→320 channels), a three-layer BiLSTM (320 units per direction), and a sixteen-head self-attention module, followed by a fully connected classifier (640→26).

Training uses the AdamW optimizer with cosine annealing, dropout = 0.15, and speed perturbation for augmentation. Two language models guide decoding: a 4-gram KenLM with Kneser–Ney smoothing and an LSTM-RNN LM (embedding = 64, hidden = 128) trained with the AdamW optimizer and label smoothing = 0.1.

Final decoding employs CTC beam search with shallow fusion (λ_LM = 0.25–0.30, beam = 5–8), combining acoustic and linguistic probabilities for optimal phoneme sequence prediction.

6.2. Corpus

In these experiments, we trained and tested an enhanced CNN-BiLSTM-CTC model with an attention mechanism using the Tifdigit dataset [29], which was designed to recognize the 33 letters of the Amazigh alphabet. The Tifdigit corpus includes the Amazigh alphabet (33 letters) and the first 10 Amazigh digits. Approximately fifty native Tarifit speakers from Morocco were invited to pronounce each letter and digit ten times. Consequently, the corpus contains ten repetitions of each character per speaker, yielding a total of 21,500 audio files (10 digits × 33 letters × 10 repetitions × 50 speakers).

The recordings were made in a noise-free environment using the same microphone, as well as the open-source software Audacity and Wavesurfer, on a Linux system. Table 2 presents the recording parameters and dataset configuration used for the preparation of the Tifdigit corpus.

To evaluate our proposed models, we used a subset of the corpus consisting of 8942 audio files representing the 33 Amazigh letters, divided into training and testing sets, comprising 82% and 18% of the data, respectively, using a balanced, stratified approach. To ensure speaker independence and prevent data leakage, all recordings from the same speaker were assigned to the same partition (either the training or testing set).

All experiences were conducted on a CUDA-enabled GPU using PyTorch 2.0.1, in the Google Colaboratory Jupyter Notebook Service with Python 3 on an Ubuntu 18.04 Lenovo laptop with a 13th Gen Intel (R) Core (TM) i7-13620H 2.40 GHz and 16 GB RAM.

6.3. Language Model

The language model is a key element in our strategy. It plays an essential role in improving the linguistic consistency of the output sequences generated by the acoustic model. It expresses the probability of a sequence of words in the language. In this work, we conduct a comparative analysis of the influence of two language modeling paradigms, N-gram and RNN-LM, on the decoding performance of an Amazigh speech recognition system.

The N-gram language model was trained using the 33 letter Amazigh alphabet proposed by IRCAM, and employs a 4-gram interpolation with smoothing α = 0.1 and weights λ₄ = 0.40, λ₃ = 0.30, λ₂ = 0.20, and λ₁ = 0.10.

The RNN language model, built with an LSTM layer (128 hidden units) and 64-dimensional embedding, is trained with AdamW, dropout = 0.3, and early stopping at 100 epochs.

The RNN language model has a compact architecture that is suited to low-resource environments. It consists of a single-layer, unidirectional long short-term memory (LSTM) network, an encoding module, and a fully connected classifier. Training uses the AdamW optimizer with cosine scheduling, as well as several regularization strategies such as label smoothing, dropout, and an early stopping mechanism. Training the model on heavily augmented phonemic sequences from a 33-word Amazigh lexicon resulted in stable convergence. The detailed architectural and training hyperparameters of the RNN language model are summarized in Table 3.

A dedicated Amazigh lexicon was created manually to facilitate the decoding process presented in Table 4. It provides standardized phonetic transcriptions based on the 33-letter Amazigh alphabet defined by IRCAM.

6.4. Decoding Strategies

Four decoding strategies were used to evaluate the performance: (1) greedy decoding, which selects the most probable phoneme at each time step, providing a fast baseline; (2) beam search decoding, which maintains multiple hypotheses to improve accuracy; (3) beam search with language model, integrating a 4-gram LM, and (4) beam search with language model, integrating a RNN-LM with an optimized weight (beam width = 8 and λ = 0.25) to combine acoustic and linguistic information for enhanced recognition performance.

6.5. Evaluation Metrics

The performance of the CNN-BiLSTM-CTC model was evaluated using the phone error rate (PER) and exact match accuracy, which were computed using the JiWER package [30]. PER is defined as follows:

PER = (S + D + I)/N

(8)

This measures the proportion of incorrectly recognized phonemes, where S: substitutions; D: deletions; I: insertions; and N: total phonemes. The exact match accuracy represents the percentage of sequences recognized perfectly and provides a stringent measure of transcription precision.

7. Results and Discussion

To investigate the influence of external language models on the performance of our end-to-end Amazigh ASR system, we integrated two types of language models, a 4-gram statistical model and a recurrent neural network language model (RNN-LM), via shallow fusion into the CNN-BiLSTM-CTC acoustic model during the decoding process. The training results of the acoustic model are illustrated in Figure 2. Four decoding approaches were evaluated on 1584 test utterances: greedy decoding, beam search, beam search with a 4-gram LM, and beam search with an RNN-LM.

The results in Table 5 show that the improved 4-gram language model achieved the best overall performance, with a phonetic error rate (PER) of 0.0268, representing a relative improvement of 4.0% over the greedy reference model. This translates into an accuracy of 97.32%, and an exact word match rate of 93.4%. This gain is due to higher-order context modeling, optimized interpolation weights, and adaptive lexical weighting tailored to the phonotactic structure of Amazigh. The improved N-gram approach notably outperformed simple beam search and the RNN neural language model.

Although the RNN-LM achieved modest improvements with a phonetic error rate (PER) of 0.0277, this represents a relative improvement of +0.9% compared to the greedy baseline and the N-gram model, which improved by +4.0%. The underperformance of the RNN-LM is primarily the result of data limitations rather than inherent algorithmic limitations. The evolution of the RNN-LM training loss is presented in Figure 3, highlighting the model’s convergence behavior under limited data conditions. The 33 words in the lexicon generate only 1447 training sequences, which is insufficient for a neural model to learn robust phonotactic regularities.

Furthermore, there are certain limitations to our augmentation strategy. The 33 words in the lexicon were expanded through 15 repetitions of each sequence and by extracting the prefixes and suffixes of two to five phonemes, which were then divided into sliding windows. This approach generates a total of 1447 sequences, which is insufficient to provide the model with exploitable phonotactic diversity. Conversely, the N-gram model makes better use of an extremely limited context (less than 50 words and less than 2000 sequences), which explains its significant advantage.

7.1. Word-Level Performance Analysis

A detailed analysis by word, summarized in Table 6, reveals the heterogeneous effects of decoding strategies on Amazigh vocabulary. Notable improvements were observed particularly for five words: yach, yak, yas, yazz, and yey. The word yach showed the most marked improvement, achieving complete correction with the N-gram model (with a PER error rate of 0, representing a 100% improvement over the greedy reference model).

The RNN-LM also had a positive effect on yak and yas, with relative reductions in phonetic error rate (PER) of 57.1% and 40.0%, respectively. Similarly, yey and yazz saw moderate gains of 50.0% and 33.3%, confirming the effectiveness of the recurrent model in capturing longer temporal dependencies.

On the other hand, a large set of words—such as ya, yaa, yab, yan, and yam—showed stable performance regardless of the decoding type. Three words, yakh, yi, and you, achieved perfect recognition (PER = 0.0000) with all decoding strategies, leading to the conclusion that the acoustic model alone was sufficiently robust for these sequences. This robustness can be explained by three main factors:

A high phonemic distinctiveness: the vowels /i/ and /u/ in these words are at opposite ends of the vowel spectrum.
Simple syllabic structures (CV/CVV): ‘yi’, ‘you’, and ‘yef’ have limited coarticulation and long, stable vowels, which facilitates the extraction of acoustic features.
Low lexical confusability: their closest phonetic neighbors are at an editing distance of at least 2, unlike ambiguous pairs such as ‘yas’ and ‘yaz’ (distance 1, PER = 0.0347).
Therefore, the CNN-BiLSTM-CTC model effectively captures the spectro-temporal characteristics of these distinctive and simple words. Linguistic strategies become particularly useful in more complex phonetic contexts.

Detailed word-by-word analysis confirms that N-gram statistical models can outperform neural approaches in low-resource conditions, especially when based on a limited but high-quality lexicon. Indeed, N-gram language models mainly improve the recognition of ambiguous or phonetically similar words, while the acoustic model is sufficient for simple words.

7.2. Error Correction Mechanisms

Phoneme Insertion Correction: Insertions occur when the acoustic model predicts spurious phonemes. The N-gram model is more effective at correcting insertions because it penalizes phototactically improbable sequences through its explicit lexical constraints and high probability sequences. It removes insertions in short words such as yach and yal up to 100% correction. The RNN-LM partially improves results by approximately 16% in cases such as yahh by exploiting distributed representations of phonetic features.

Phoneme Substitution Correction: The N-gram model also performs better at correcting substitutions, which correspond to confusions between acoustically similar phonemes, thanks to its ability to model long-range dependencies and capture implicit phonetic features “voicing, vowel height”. These substitutions are better corrected by the RNN-LM, which achieves gains of 40 to 57% on words such as yak, yas, and yey.

Recovery of Phoneme Deletions: Deletions remain the most difficult to correct, as the language model cannot reintroduce phonemes that are absent from the acoustic signal. Only the word ‘yad’ shows a slight improvement (14%) with the RNN-LM. The persistent cases of yan and yam illustrate nasal confusion, /m/ and /n/, and require improvement in the acoustic model rather than linguistic correction.

As summarised in Table 7, the error-type analysis reveals the differences in the ability of the N-gram and RNN-LM models to correct insertion, substitution and deletion errors.

7.3. Influence of Language Models Across Amazigh Syllabic Structures

A syllabic structure analysis was performed to assess the impact of the language model on phonemic recognition performance. We studied the performance of different decoding strategies according to Amazigh syllabic structures. The results presented in Table 8 show that the contribution of the language model varies according to the phonotactic structure of Amazigh words.

Indeed, for the simplest syllables, CVV syllables have a PER of around 0,016, without benefiting significantly from linguistic models. CVC syllables, which represent the most frequent syllabic structure with 22 words, benefit from a moderate gain of 3,6% with the N-gram language model, particularly for ambiguous consonants (/k/, /g/, /s/, /z/), while the RNN-LM improves context-sensitive cases, such as yak, yas, and yey, by correcting vowel or voicing confusion.

Finally, CVCC syllables, which are the most complex structures, record the maximum relative gain of 7.7% with the N-gram, particularly on the consonantal syllables yach and yagw, as the explicit modeling of four-phoneme dependencies fits perfectly with the maximum syllabic length of Amazigh words.

This analysis highlights an inverse relationship between syllabic simplicity and linguistic modeling efficiency. While simple syllables (CVV) are acoustically sufficient, linguistically complex structures (CVCC) benefit greatly from the explicit constraints of the linguistic model. The 4-gram model thus proves to be structurally optimal for Amazigh, corresponding precisely to the dominant syllabic window of four phonemes in this language.

7.4. Comparative Analysis with State-of-the-Art

The Table 9 illustrates the Comparative Performance of the Proposed Amazigh ASR System with State-of-the-Art Studies.

8. Conclusions

This study presented a comprehensive comparative analysis of statistical and neural language model integration using the shallow fusion technique in an end-to-end Amazigh speech recognition framework. For a baseline ASR system for the Amazigh language based on a CNN-BiLSTM-CTC enhanced with an attention mechanism and designed for recognizing the 33-letter Tifinagh alphabet, we demonstrated that shallow fusion with a 4-gram language model achieved the most consistent improvements, reducing the phoneme error rate by 4% relative to the greedy baseline. The RNN-based model showed moderate gains, limited by the small corpus size.

At the linguistic level, the N-gram model proved to be more optimal for complex syllabic structures (CVCC) in the Amazigh language, while simple forms (CVV) were already sufficient from an acoustic point of view. This leads to the conclusion that N-gram models can outperform neural approaches when lexical resources are limited but reliable. In the future, we plan to develop a novel hybrid decoding framework that combines the strengths of both N-gram and RNN-based models through shallow fusion and deep fusion.

Author Contributions

Conceptualization, M.T. and N.L.; methodology, M.T. and N.L.; software and implementation, M.T. and N.L.; validation and experiments, M.T., N.L., Y.G. and H.S.; writing—original draft, M.T.; formal analysis, M.T., N.L., Y.G. and H.S.; visualization, M.T. and N.L.; supervision, Y.G. and H.S.; project administration; H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Tifdigit datasets used in this research are publicly available at https://github.com/mtel-dotcom (accessed on 2 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASR	Automatic Speech Recognition System
HMMs	Hidden Markov Model
E2E	End-to-End
CNN	Convolutional Neural Network
BiLSTM	Bidirectional Long Short-Term Memory
LSTM	Long Short-Term Memory
CTC	Connectionist Temporal Classification
LM	Language Model
RNN	Recurrent Neural Network

References

Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11. [Google Scholar] [CrossRef]
Kandji, A.K.; Ba, C.; Ndiaye, S. State-of-the-Art Review on Recent Trends in Automatic Speech Recognition. In International Conference on Emerging Technologies for Developing Countries; Springer Nature: Cham, Switzerland, 2023; pp. 185–203. [Google Scholar]
Prabhavalkar, R.; Hori, T.; Sainath, T.N.; Schlüter, R.; Watanabe, S. End-to-end speech recognition: A survey. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 325–351. [Google Scholar] [CrossRef]
Slam, W.; Li, Y.; Urouvas, N. Frontier research on low-resource speech recognition technology. Sensors 2023, 23, 9096. [Google Scholar] [CrossRef] [PubMed]
Abdou Mohamed, N.; Allak, A.; Gaanoun, K.; Benelallam, I.; Erraji, Z.; Bahafid, A. Multilingual speech recognition initiative for African languages. Int. J. Data Sci. Anal. 2024, 20, 3513–3528. [Google Scholar] [CrossRef]
Barkani, F.; Hamidi, M.; Laaidi, N.; Zealouk, O.; Satori, H.; Satori, K. Amazigh speech recognition based on the Kaldi ASR toolkit. Int. J. Inf. Technol. 2023, 15, 3533–3540. [Google Scholar] [CrossRef]
Boulal, H.; Hamidi, M.; Abarkan, M.; Barkani, J. Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method. Int. J. Speech Technol. 2024, 27, 287–296. [Google Scholar] [CrossRef]
Boulal, H.; Bouroumane, F.; Hamidi, M.; Barkani, J.; Abarkan, M. Exploring data augmentation for Amazigh speech recognition with convolutional neural networks. Int. J. Speech Technol. 2025, 28, 53–65. [Google Scholar] [CrossRef]
Telmem, M.; Laaidi, N.; Ghanou, Y.; Hamiane, S.; Satori, H. Comparative study of CNN, LSTM and hybrid CNN-LSTM model in Amazigh speech recognition using spectrogram feature extraction and different gender and age dataset. Int. J. Speech Technol. 2024, 27, 1121–1133. [Google Scholar] [CrossRef]
Telmem, M.; Laaidi, N.; Satori, H. The impact of MFCC, spectrogram, and Mel-Spectrogram on deep learning models for Amazigh speech recognition system. Int. J. Speech Technol. 2025, 1–14. [Google Scholar] [CrossRef]
Guan, B.; Cao, J.; Wang, X.; Wang, Z.; Sui, M.; Wang, Z. Integrated method of deep learning and large language model in speech recognition. In Proceedings of the 2024 IEEE 7th International Conference on Electronic Information and Communication Technology (ICEICT), Xi’an, China, 31 July 2024–2 August 2024; IEEE: New York, NY, USA, 2024; pp. 487–490. [Google Scholar]
Anh, N.M.T.; Sy, T.H. Improving speech recognition with prompt-based contextualized asr and llm-based re-predictor. In Interspeech; International Speech Communication Association (ISCA): Kos, Greece, 2024; Volume 2024, pp. 737–741. [Google Scholar]
Zhou, S.; Xu, Z.; Zhang, M.; Xu, C.; Guo, Y.; Zhan, Z.; Zhang, R. Large language models for disease diagnosis: A scoping review. Npj Artif. Intell. 2025, 1, 9. [Google Scholar] [CrossRef]
Telmem, M.; Ghanou, Y. The convolutional neural networks for Amazigh speech recognition system. TELKOMNIKA 2021, 19, 515–522. [Google Scholar] [CrossRef]
Mukhamadiyev, A.; Mukhiddinov, M.; Khujayarov, I.; Ochilov, M.; Cho, J. Development of language models for continuous Uzbek speech recognition system. Sensors 2023, 23, 1145. [Google Scholar] [CrossRef]
Liu, Z.; Venkateswaran, N.; Le Ferrand, É.; Prud’hommeaux, E. How important is a language model for low-resource ASR. In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 206–213. [Google Scholar]
Anoop, C.S.; Ramakrishnan, A.G. CTC-based end-to-end ASR for the low resource Sanskrit language with spectrogram augmentation. In Proceedings of the 2021 National Conference on Communications (NCC), Kanpur, India, 27–30 July 2021; IEEE: New York, NY, USA; pp. 1–6. [Google Scholar]
Mamyrbayev, O.Z.; Oralbekova, D.O.; Alimhan, K.; Nuranbayeva, B.M. Hybrid end-to-end model for Kazakh speech recognition. Int. J. Speech Technol. 2023, 26, 261–270. [Google Scholar] [CrossRef]
Labied, M.; Belangour, A.; Banane, M. Delve deep into End-To-End Automatic Speech Recognition Models. In Proceedings of the 2023 International Seminar on Application for Technology of Information and Communication (iSemantic), Semarang, Indonesia, 16–17 September 2023; IEEE: New York, NY, USA, 2023; pp. 164–169. [Google Scholar]
Mori, D.; Ohta, K.; Nishimura, R.; Ogawa, A.; Kitaoka, N. Advanced language model fusion method for encoder-decoder model in Japanese speech recognition. In Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; IEEE: New York, NY, USA, 2021; pp. 503–510. [Google Scholar]
El Ouahabi, S.; El Ouahabi, S.; Atounti, M. Comparative Study of Amazigh Speech Recognition Systems Based on Different Toolkits and Approaches. In E3S Web of Conferences; EDP Sciences: Les Ulis, France, 2023; Volume 412, p. 01064. [Google Scholar]
Jorge, J.; Gimenez, A.; Silvestre-Cerda, J.A.; Civera, J.; Sanchis, A.; Juan, A. Live streaming speech recognition using deep bidirectional LSTM acoustic models and interpolated language models. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 148–161. [Google Scholar] [CrossRef]
Paul, B.; Phadikar, S.; Bera, S.; Dey, T.; Nandi, U. Isolated word recognition based on a hyper-tuned cross-validated CNN-BiLSTM from Mel Frequency Cepstral Coefficients. Multimed. Tools Appl. 2025, 84, 17309–17328. [Google Scholar] [CrossRef]
Ismael, M.K.; Hock, G.C.; Abdulrazzak, H.N. Mathematical Modelling of Engineering Problems. Int. Inf. Eng. Assoc. 2025, 12, 1893–1910. Available online: http://iieta.org/journals/mmep (accessed on 12 January 2020).
Xue, J.; Zheng, T.; Han, J. Exploring attention mechanisms based on summary information for end-to-end automatic speech recognition. Neurocomputing 2021, 465, 514–524. [Google Scholar] [CrossRef]
Alawdi, A. MultiheadSelfAttention vs, Traditional Encoders: A Benchmark Study on Precision and Recall in Tajweed Recognition. In Proceedings of the 2025 5th International Conference on Emerging Smart Technologies and Applications (eSmarTA), Ibb, Yemen, 5–6 August 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Addarrazi, I.; Zealouk, O.; Satori, H.; Satori, K. The Hmm Based Amazigh Digits Audiovisual Speech Recognition System. Math. Stat. Eng. Appl. 2022, 71, 2261–2278. [Google Scholar]
Ouhnini, A.; Aksasse, B.; Ouanan, M. Towards an automatic speech-to-text transcription system: Amazigh language. Int. J. Adv. Comput. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
Satori, H.; ElHaoussi, F. Investigation Amazigh speech recognition using CMU tools. Int. J. Speech Technol. 2014, 17, 235–243. [Google Scholar] [CrossRef]
Amin, N.A.M. Low-Resource Automatic Speech Recognition Domain Adaptation: A Case-Study in Aviation Maintenance. Doctoral Dissertation, Purdue University Graduate School, West Lafayettem, IN, USA, 2023. [Google Scholar]
Samin, A.M.; Kobir, M.H.; Kibria, S.; Rahman, M.S. Deep learning based large vocabulary continuous speech recognition of an under-resourced language Bangladeshi Bangla. Acoust. Sci. Technol. 2021, 42, 252–260. [Google Scholar] [CrossRef]
Oh, J.; Cho, E.; Kim, J.H. Integration of WFST Language Model in Pre-trained Korean E2E ASR Model. KSII Trans. Internet Inf. Syst. 2024, 18, 1692–1705. [Google Scholar]

Figure 1. The proposed CNN-BiLSTM-CTC with attention architecture for Amazigh speech recognition.

Figure 2. Training results of the CNN-BiLSTM-CTC acoustic model for Amazigh ASR.

Figure 3. Evolution of training loss for the RNN-LM.

Table 1. Amazigh alphabet and the corresponding Tifinagh letters with typical syllable patterns used in this study.

ⴰ a yaa CVV	ⴱ b yab CVC	ⴳ g yag CVC	ⴳⵯ gw yagw CVCC	ⴷ d yad CVC	ⴹ ḍ yadd CVCC	ⴻ e yey CVC	ⴼ f yaf CVC	ⴽ k yak CVC	ⴽⵯ kw yakw CVCC	ⵀ h yah CVC
ⵃ ḥ yahh CVCC	ⵄ ɛ yaε CVC	ⵅ x yax CVCC	ⵇ q yaq CVC	ⵉ i yi CVC	ⵊ j yaj CVC	ⵍ l yal CVC	ⵎ m yam CVC	ⵏ n yan CVC	ⵓ u you CVV	ⵔ r yar CVC
ⵕ ṛ yarr CVCC	ⵖ γ yiγ CVC	ⵙ s yas CVC	ⵚ ṣ yass CVCC	ⵛ c yach CVCC	ⵜ t yat CVC	ⵟ ṭ yatt CVCC	ⵡ w yaw CVC	ⵢ y yay CVC	ⵣ z yaz CVC	ⵥ ẓ yazz CVCC

Table 2. Recording parameters used for the preparation of the corpus.

Parameter	Value
Total number of audio files	21,500
Base duration	2 h 18 min
Sampling	16 kHz, 16 bits
Wave format	Mono, wav
Corpus	33-letter Amazigh alphabet
Speakers	50 (50% male and 50% female)

Table 3. RNN-LM and training configuration.

Parameter	Value
Number of LSTM layers	1
Embedding dimension	64
Classifier	Fully connected layer
Dropout	p = 0.3 (after LSTM)
Optimizer	AdamW
Initial learning rate	0.0005
Weight decay	0.01
β₁/β₂	0.9/0.98
Scheduler	Cosine annealing
T_max	100 epochs
Minimum learning rate	η_min = 5 × 10⁻⁶
Loss function	Cross-entropy with label smoothing
Label smoothing	α = 0.1
Early stopping	Patience = 20 epochs
Training data	The Amazigh lexicon (33 words)
Data augmentation	Truncation, subsequence extraction, repetition
Augmentation factor	≈×15
Convergence	67 epochs
Final validation loss	0.0234

Table 4. Amazigh lexicon with transcriptions based on the 33-letter Amazigh alphabet proposed by IRCAM.

Word	Transcription	Word	Transcription	Word	Transcription
yaa	y-aa	yagh	y-ae	yar	y-ae-r
yab	y-ae-b	yagw	y-ae-g-w	yarr	y-ae-r
yach	y-ae-ch	yah	y-aa	yas	y-ae-z
yad	y-ae-d	yahh	y-ae-hh	yass	y-ae-s
yadd	y-ae-d	yaj	y-ae-jh	yat	y-ae-t
yag	y-ae-g	yak	y-ae-k	yatt	y-ae-t
yakh	y-ae-k	yal	y-ae-l	yaw	y-ao
yakw	y-ae-k-w	yam	y-ae-m	yay	y-ey
yan	y-ae-n	yaq	y-ae-k	yaz	y-ae-z
yazz	y-ae-z	yef	y-eh-f	yey	y-iy
yi	y-iy	you	y-uw	ya	y-aa

Table 5. Comparative performance results across different decoding techniques.

Decoding Method	Greedy	Beam Search	Beam + 4-Gram LM	Beam + RNN-LM
PER	0.0279	0.0273	0.0268	0.0277
Relative Improvement	Baseline	2.30%	4.00%	0.9%

Table 6. Word-level recognition results using different decoding strategies.

Mot	Greedy	Beam	N-Gram	RNN-LM
ya	0.0208	0.0208	0.0208	0.0208
yaa	0.0312	0.0312	0.0417	0.0417
yab	0.0417	0.0417	0.0417	0.0417
yach	0.0069	0.0069	0	0.0069
yad	0.0486	0.0486	0.0486	0.0417
yadd	0.0069	0.0069	0.0069	0.0069
yag	0.0417	0.0347	0.0347	0.0417
yagh	0.0104	0.0104	0.0104	0.0104
yagw	0.0365	0.0365	0.0312	0.0469
yah	0.0417	0.0417	0.0417	0.0521
yahh	0.0417	0.0417	0.0347	0.0347
yaj	0.0417	0.0417	0.0417	0.0417
yak	0.0486	0.0347	0.0347	0.0208
yakh	0	0	0	0
yakw	0.0312	0.0312	0.0312	0.0365
yal	0.0417	0.0417	0.0347	0.0347
yam	0.0486	0.0486	0.0486	0.0486
yan	0.0972	0.0972	0.0972	0.0972
yaq	0.0278	0.0278	0.0278	0.0278
yar	0.0347	0.0347	0.0347	0.0347
yarr	0.0069	0.0069	0.0069	0.0069
yas	0.0347	0.0347	0.0347	0.0208
yass	0.0278	0.0278	0.0278	0.0417
yat	0.0069	0.0069	0.0069	0.0069
yatt	0.0278	0.0278	0.0278	0.0278
yaw	0.0312	0.0312	0.0312	0.0312
yay	0.0312	0.0312	0.0312	0.0521
yaz	0.0139	0.0139	0.0139	0.0139
yazz	0.0208	0.0208	0.0208	0.0139
yef	0	0	0	0
yey	0.0208	0.0208	0.0208	0.0104
yi	0	0	0	0
you	0	0	0	0

Table 7. Error type analysis by language model.

Error Type	N-gram (Improvement %)	RNN-LM (Improvement %)
Insertion	yach (100%), yahh (16.7%), yal (16.7%)	yahh (16.7%), yal (16.7%)
Substitution	yagw (14.3%), yag (16.7%)	yak (57.1%), yas (40.0%), yey (50.0%), yazz (33.3%)
Deletion	-----	yad (14.3%)

Table 8. Average PER performance for syllabic structure.

Structure	Words	PER Greedy	PER Beam	PER N-Gram	PER RNN-LM	Error Reduction Range
CVV	2	0.0156	0.0156	0.0208	0.0208	~0% (no gain)
CVC	22	0.0303	0.0297	0.0292	0.0303	3.6%
CVCC	9	0.0247	0.0244	0.0228	0.0236	7.7%

Table 9. Comparative analysis with State-of-the-Art Studies.

Reference	Studied Language	Models Used	Accuracy %
Samin et al. [31]	Bangladeshi Bangla	CNN-CTC + 5-gram LM	86.07%
Oh et al. [32]	Korean speech	E2E model	99.34%
Proposed work	33-letter Amazigh	CNN-BiLSTM-CTC Model	97.32%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Telmem, M.; Laaidi, N.; Ghanou, Y.; Satori, H. N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition. Mach. Learn. Knowl. Extr. 2025, 7, 164. https://doi.org/10.3390/make7040164

AMA Style

Telmem M, Laaidi N, Ghanou Y, Satori H. N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition. Machine Learning and Knowledge Extraction. 2025; 7(4):164. https://doi.org/10.3390/make7040164

Chicago/Turabian Style

Telmem, Meryam, Naouar Laaidi, Youssef Ghanou, and Hassan Satori. 2025. "N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition" Machine Learning and Knowledge Extraction 7, no. 4: 164. https://doi.org/10.3390/make7040164

APA Style

Telmem, M., Laaidi, N., Ghanou, Y., & Satori, H. (2025). N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition. Machine Learning and Knowledge Extraction, 7(4), 164. https://doi.org/10.3390/make7040164

Article Menu

N-Gram and RNN-LM Language Model Integration for End-to-End Amazigh Speech Recognition

Abstract

1. Introduction

2. Motivation

3. Related Works

4. Methodology

4.1. Principle of Speech Recognition (ASR)

4.2. Language Models Used in This Work

N-Gram Model

4.3. Shallow Fusion Strategy Connectionist Temporal Classification

4.4. Shallow Fusion Strategy

4.5. CNN-BiLSTM-CTC Architecture

5. Amazigh Language

6. Language Model Integration for Amazigh Speech Recognition System

6.1. System Description

6.2. Corpus

6.3. Language Model

6.4. Decoding Strategies

6.5. Evaluation Metrics

7. Results and Discussion

7.1. Word-Level Performance Analysis

7.2. Error Correction Mechanisms

7.3. Influence of Language Models Across Amazigh Syllabic Structures

7.4. Comparative Analysis with State-of-the-Art

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI