1. Introduction
Automatic Speech Recognition (ASR) has achieved remarkable progress in recent years for well-resourced languages like English. However, developing high-accuracy speech recognition systems for morphologically complex and under-resourced languages remains a significant challenge. Arabic is one of the world’s most widely spoken languages that has these difficulties. Arabic’s complex morphology, flexible word order, and dependence on diacritical marks create major obstacles for ASR development. These diacritical marks are challenging, as they are essential for avoiding ambiguity and indicating grammatical relationships [
1,
2].
The Arabic language is spoken by over 400 million individuals in more than 20 nations [
3]. It has three main varieties: Classical Arabic, used in the Quran and Islamic literature; Modern Standard Arabic (MSA), which is the formal written and spoken language throughout the Arab world; and Dialectal Arabic, consisting of various regional dialects used for everyday conversation. The Arabic writing system is largely consonantal, with short vowels generally absent in standard text, which leads to significant ambiguity that native speakers interpret through contextual understanding [
4].
MSA refers to Arabic text without diacritical marks, as used in newspapers, books, and digital content. Diacritical Arabic, in contrast, represents the same MSA language but enhanced with explicit diacritical marks that indicate short vowels and other phonetic features [
2,
5].
The diacritical marks used in the MSA are shown in
Table 1 along with their corresponding marks [
5]. For instance, the non-diacritical word كتب in MSA can be read as كَتَبَ (kataba—he wrote), كُتُب (kutub—books), and كُتِبَ (kuteba—written). Therefore, diacritics can affect the pronunciation and meaning of Arabic words. In non-diacritical MSA, a single form can have multiple interpretations, which may lead to ambiguity and increase computation when processed by computers. Native Arabic speakers use context to resolve ambiguities, whereas ASR systems depend on diacritics to transcribe speech accurately. Diacritics are important in applications that require high accuracy, such as educational tools, language learning platforms, and voice-controlled systems [
6,
7,
8].
Several factors make diacritical Arabic ASR challenging, including the scarcity of diacritical Arabic ASR datasets, the computational complexity, and the need for strategies that capture both linguistic structures and vowelization patterns efficiently [
1,
2].
Research efforts in Arabic ASR have traditionally relied on conventional feature extraction techniques such as log-mel spectrograms, combined with statistical models like Hidden Markov Models (HMMs), or more recently deep neural networks [
9,
10,
11,
12,
13]. In recent years, with the advancements in deep learning, end-to-end (E2E) architectures based on Recurrent Neural Networks [
14,
15,
16,
17,
18,
19,
20,
21,
22,
23] and Transformers [
24] have significantly improved recognition accuracy. Notably, models like XLSR [
25] and HuBERT [
26] leverage self-supervised pretraining on vast amounts of unlabeled speech data, enabling the extraction of highly rich and contextualized speech representations [
5].
The study in [
5] pretrained a model, called XLSR, that made a significant advancement in diacritical Arabic ASR. Utilizing transfer learning and data augmentation, XLSR showed remarkable results. Although transfer learning and data augmentation demonstrate that Transformers can be successfully adapted for this purpose, many methods depend on either extensive pretraining datasets or direct E2E training with diacritical data, which may not fully utilize the hierarchical connection between MSA and diacritical Arabic.
Transfer learning has become a key paradigm in modern ASR, especially given its success in enabling models trained on high-resource languages or massive unlabeled corpora to generalize to under-resourced languages with only limited fine-tuning. A recent survey [
27] highlights how such techniques help models better generalize by initializing lower layers with rich acoustic-phonetic representations, then adapting higher layers to language-specific phenomena. Meanwhile, the rise in Transformer-based and large language model (LLM) architectures in ASR has shifted the field toward encoder–decoder frameworks and massive self-supervised pretraining. However, these models often demand heavy computational resources. Our work seeks to bring the power of transfer learning and Transformer efficiency into reach for diacritical Arabic ASR by designing a lightweight, effective architecture that can perform well without the cost of full-scale pretraining [
27,
28].
This paper presents a lightweight Arabic Automatic Speech Recognition system that predicts diacritical Arabic using a two-stage training approach: first, training a Transformer-based architecture enhanced with RPE [
29,
30] within a CTC [
31] framework on an MSA dataset, followed by a fine-tuning stage that uses a diacritical Arabic dataset. This methodology leverages the foundational knowledge learned from MSA to enhance diacritical prediction performance.
While both this work and XLSR [
5] employ a similar two-stage fine-tuning approach (MSA pre-training followed by diacritical Arabic fine-tuning), the key differences lie in computational scale and architecture. XLSR uses a massive 300 M+ parameter model with extensive multilingual pre-training on 53 languages, requiring weeks of training on GPU clusters. Our approach employs a lightweight 14 M parameter transformer encoder with relative positional encoding (RPE) and CTC loss, eliminating the need for a decoder component. This targeted architecture focuses specifically on Arabic without multilingual pre-training, requiring only 15 h of training on a single GPU. The encoder-only design with CTC achieves competitive performance while dramatically reducing computational requirements and improving accessibility for resource-constrained researchers.
Our training approach moves from undiacritized MSA to fully diacritized Arabic that effectively captures both the underlying linguistic structure and the specific vowelization patterns. RPE enhances the model’s ability to capture long-range temporal dependencies in speech sequences, while the CTC framework enables training without explicit alignment data.
These limitations highlight an important research gap: Can effective Arabic diacritical ASR systems be built from scratch, without relying on extensive pretraining? Addressing this question is motivated by the realities of resource-limited contexts, initiatives for language preservation, and the need for transparent, lightweight models that can be easily adapted and deployed on devices with limited capabilities.
In summary, the key contributions of this work are:
Novel Architectural Design: Propose an encoder-only Transformer architecture with RPE and CTC designed for diacritical ASR that captures long-range temporal dependencies in speech sequences and enables efficient sequence alignment without explicit alignment data.
Efficient Two-Stage Transfer Learning Implementation: Demonstrate how MSA-to-diacritical transfer learning can be effectively implemented in a lightweight architecture, achieving competitive performance with significantly reduced computational requirements.
Efficiency–Performance Balance: We introduce a lightweight model with only 14 M parameters (vs. 300 M reported in [
5]) that achieves a WER of 22.01%. While this is higher than the 12.17% WER reported in [
5], it substantially outperforms traditional approaches, highlighting a practical trade-off between efficiency and accuracy.
2. Diacritical Arabic ASR
Despite the importance of diacritical Arabic in various applications, research in diacritical Arabic ASR remains underexplored, with only a limited number of studies addressing this challenging domain. This section provides a comprehensive review of the existing literature, which encompasses most of the research conducted in diacritical Arabic ASR to date. The scarcity of work in this area highlights both the complexity of the problem and the significant research opportunities that remain largely untapped.
Among the few research groups that have ventured into this domain, several have conducted comparative analyses between diacritized and non-diacritized Arabic speech recognition systems. In one of the studies, Abed et al. [
32] developed five distinct models using identical corpora with and without diacritical marks, and found that incorporating diacritical information increased the WER by 0.59% to 3.29%. In contrast, Al-Anzi et al. [
33,
34] presented evidence supporting the beneficial effects of diacritics on recognition performance.
The CMU Sphinx toolkit [
35] has been extensively utilized in diacritical Arabic ASR research. Al-Anzi et al. [
33] achieved 63.8% accuracy when processing diacritized text, while non-diacritized versions reached 76.4% accuracy. In a follow-up investigation, Al-Anzi et al. [
32] observed that although diacritics enhanced phoneme-level recognition capabilities, the overall system accuracy for diacritical processing (69.1%) remained below that of non-diacritical systems (81.2%). This performance gap was attributed to the increased computational complexity introduced by diacritic variability.
Given the limited research landscape, innovative approaches to phonetic representation and linguistic modeling represent particularly valuable contributions to this nascent field. Among the few studies exploring novel methodologies, AbuZeina et al. [
36] incorporated sophisticated linguistic frameworks, including parts-of-speech tagging methodologies, to generate contextually appropriate word representations. This strategy resulted in a notable 2.39% reduction in WER when evaluated on Modern Standard Arabic datasets. In another investigation of alternative representations, Boumehdi et al. [
37] proposed a novel semi-syllabic representation scheme that integrated diacritical information by treating Sukun and Shadda as distinct phonemic units. Their configuration-based approach demonstrated superior recognition performance, particularly when multiple linguistic representations were combined.
Advanced feature extraction methodologies have been explored by only a handful of studies. Ding et al. [
38] implemented MFCC [
39] and HMM-based [
40] processing techniques on a limited diacritical Arabic vocabulary, achieving impressive results with 7.08% WER and 92.92% recognition accuracy. Among the few researchers pursuing more sophisticated feature extraction approaches, Abed et al. [
32] and Alsayadi et al. [
41] employed Linear Discriminant Analysis (LDA) [
42], Maximum Likelihood Linear Transform (MLLT) [
43], and feature-space Maximum Likelihood Linear Regression (fMLLR) [
44]. These techniques proved particularly effective for Gaussian Mixture Model (GMM) [
45]-based architectures, representing some of the most advanced work in this limited domain.
The exploration of E2E learning paradigms for diacritical Arabic ASR has been extremely limited, with only one notable study pioneering this approach. Alsayadi et al. [
41] investigated the combination of MFCC and filter bank (FBANK) features [
46] with advanced neural architectures, including CNN-LSTM [
47] and CTC-attention [
48] models. Their experimental results demonstrated that CNN-LSTM architectures with attention mechanisms surpassed traditional ASR systems by 5.24% and outperformed joint CTC-attention approaches by 2.62%, representing the sole investigation of modern neural approaches in this underexplored domain.
The review of existing studies underscores a critical research gap in diacritical Arabic ASR, indicating the need for further investigation. The complexity inherent in Arabic speech recognition stems from multiple linguistic factors, including rich morphological structures, extensive lexical variation, and constrained availability of diacritized training materials [
49,
50]. Computational linguistics research focused on Arabic remains limited compared to other languages [
9,
10,
11,
12]. The scarcity of studies specifically addressing diacritical Arabic ASR is particularly striking—while substantial advances have been made in general Arabic ASR technology, diacritical Arabic recognition has received minimal attention from the research community.
Given this significant research gap and the promising results of transformer architectures in related domains, recent work has begun exploring transformer models for diacritical Arabic ASR. The emergence of transformer-based approaches represents a paradigm shift from traditional neural architectures toward more sophisticated attention mechanisms capable of capturing long-range dependencies in speech sequences. Initial investigations in this direction leveraged XLSR-based approaches that achieved promising recognition performance, representing the first steps in applying transformer technology to this challenging domain.
The study in [
5] introduced two novel transformer-based models employing strategic architectural modifications and transfer learning techniques. The DAASR 1 model implemented direct fine-tuning of the XLSR-53 foundation model on diacritical Arabic data, while DAASR 2 adopted a two-stage transfer learning approach, initially fine-tuning on MSA before adapting to diacritical Arabic recognition tasks. A key innovation in that work involved modifying the XLSR tokenizer to fully support diacritical mark generation, enabling properly diacritized Arabic transcriptions rather than standard MSA outputs. Our comprehensive evaluation on the SASSC dataset demonstrated substantial performance improvements, with DAASR 1 and DAASR 2 achieving WERs of 12.17% and 14%, respectively, representing gains of more than 7% over previous state-of-the-art methods. Furthermore, our investigation of hybrid data augmentation techniques—combining speed adjustment, pitch shifting, and volume modification—yielded additional performance gains, with the parallel hybrid approach achieving an optimal WER of 12.17%. While XLSR-based models achieve high recognition accuracy and set a new performance benchmark, they also reveal the growing need for computationally efficient solutions in transformer-based diacritical Arabic ASR. In this work, we introduce a novel transformer-based approach that integrates relative positional encoding (RPE) and replaces the traditional decoder with connectionist temporal classification (CTC). Our design reduces complexity by using fewer parameters, enables faster training, and requires less memory. As a result, the proposed model is more practical for resource-constrained environments while maintaining competitive performance.
We propose a lightweight transformer for diacritical ASR that provides a good balance between accuracy and computational cost. In this study, computational cost is measured in terms of training time and the number of GPUs used, without considering detailed hardware-level metrics such as memory usage or floating-point operations per second. Additionally, inference efficiency is reported separately in terms of latency and real-time factor. This represents an important advancement in making transformer-based diacritical Arabic ASR obtainable where computational resources are limited.
3. Proposed Speech Transformer
3.1. Overall Design
The proposed architecture for Arabic ASR with diacritics is illustrated in
Figure 1. The model follows an encoder-only Transformer architecture that processes log mel-spectrograms and outputs character-level predictions through CTC. A linear projection layer maps the log mel-spectrograms to the model’s hidden dimension, followed by layer normalization and dropout for regularization. RPE replaces absolute positional encoding to better capture long-range dependencies in speech sequences and replaces the traditional Transformer decoder with a CTC head for efficient sequence-to-sequence alignment.
The Transformer Encoder block is composed of two main sub-layers: multi-head attention (MHA) layer and feedforward layer connected Via residual connections. RPE replaces the absolute positional embedding (APE) to capture long dependencies. Normalization and dropout are added for regularization to avoid overfitting.
3.2. Input Processing and Feature Projection
The input log mel-spectrogram X ∈ R^((T × 80)) are first projected through a linear transformation to the hidden dimension of the model:
where
and
.
Layer normalization and dropout (
p = 0.2) are applied for regularization:
A slightly higher dropout rate (p = 0.2) is used only at the input projection layer to mitigate overfitting on raw acoustic features, which typically contain more noise variability.
Within the Transformer encoder blocks, a lower dropout (p = 0.1) is applied, as confirmed by the ablation study, since moderate regularization preserves representational capacity while ensuring stable convergence.
3.3. Encoder Layer Structure
The encoder architecture consists of L stacked transformer layers, each containing the key components detailed below. The encoder processes log mel-spectrogram inputs through a linear projection layer, followed by L identical encoder transformer blocks that each comprise MHA with relative positional encoding, position-wise feedforward networks, with residual connections, layer normalization applied to each sub-layer and dropout for regularization.
3.3.1. Relative Positional Encoding (RPE)
The APE used by conventional Transformer architectures determines fixed position embeddings according to the precise location of the element in the sequence. RPE, however, works better for ASR tasks, particularly in Arabic, because diacritical Arabic is dependent on contextual relationships rather than absolute placement. RPE captures the relative distance between elements and allows the model to understand temporal relationships regardless of absolute sequence position [
27,
28].
The standard self-attention mechanism computes:
RPE modifies this by incorporating relative position bias:
where
: attention logit between positions
i and
j for head
h,
is a query vector for head h,
is a key vector for head h and
is a learnable relative position embedding for keys, representing the distance between positions
i and
j.
3.3.2. Multi-Head Self-Attention (MHA) with RPE
MHA mechanism forms the core of our encoder, enhanced with RPE to improve temporal modeling. Each attention head computes attention weights between all pairs of positions in the input sequence, modified by relative position bias terms. Building upon the RPE formulation from
Section 3.3.1, MHA mechanism computes multiple attention heads in parallel, using the same relative position attention for each head:
Each encoder layer implements MHA enhanced with RPE:
where
3.3.3. Feedforward Network
Each encoder layer includes a position-wise feedforward network that consists of two linear transformations with Swish activation [
51]. The feedforward network expands the representation to a higher dimension (6× the model dimension based on our ablation study), applies non-linear transformation, and projects back to the original dimension:
where
,
and
Based on our ablation study (
Section 4.2), both the 6× expansion ratio and Swish activation consistently outperformed alternative configurations for diacritical Arabic speech recognition. This component provides the model with the capacity to learn complex non-linear mappings between acoustic features and linguistic representations. The larger expansion ratio (6× vs. typical 4×) and smooth, differentiable nature of Swish activation prove particularly effective for Arabic ASR, enabling better gradient flow when learning the intricate relationships between spectral features and diacritical patterns.
3.3.4. Residual Connection and Layer Normalization
Each sub-layer (attention and feedforward) employs residual connections with post-layer normalization:
Each sub-layer is followed by layer normalization. For diacritical Arabic speech recognition, this post-norm setup performs better than the more popular pre-norm technique, according to our ablation study (
Section 4.2). Layer normalization is more suited for sequence modeling jobs where batch sizes may change, as it calculates statistics across the feature dimension for each individual sample, unlike batch normalization.
In the context of Arabic speech recognition, the post-norm configuration helps maintain stable activations across different speakers, dialects, and acoustic conditions while providing superior gradient flow during training. This is useful for diacritical Arabic ASR, where the model learns to distinguish phonetic variations that indicate different vowelization patterns.
Finally, the complete encoder layer l processes input
as:
3.4. Connectionist Temporal Classification (CTC) Head and Output Layer
CTC [
29] head is used to align the sequence of spectrograms and their corresponding character sequences instead of a decoder. The CTC head consists of a linear projection, which translates the hidden representations from the encoder into a vocabulary-sized output that includes all Arabic characters and their variations, diacritical marks, and a special blank symbol. Without the need for explicit alignment between audio frames and characters, CTC enables the model to predict character sequences of any length from input sequences. Since it does not require frame-level alignment but only sentence-level transcriptions, this is beneficial for Arabic ASR where precisely aligned annotated data is limited. The model can automatically learn the best character-to-frame mappings since the CTC loss function marginalizes over all potential alignments between the input and target sequences. During inference, the most likely character sequence from the CTC output probabilities is decoded using a greedy decoding method. After decoding, a post-processing step ensures valid Arabic orthography. Each predicted diacritic is appended to the immediately preceding base character, maintaining proper character–diacritic alignment. Consecutive diacritics without an intervening base letter are filtered to retain only the last valid one, and any isolated or non-Arabic symbols are removed. This step guarantees that the final output text follows correct Arabic writing conventions and avoids invalid sequences.
The CTC head converts encoder outputs to character-level predictions as:
where W
out ∈
(d_model×|V|+1), |V| = 48 (vocabulary size), and +1 accounts for the CTC blank token.
Vocabulary compositions: Arabic letters: 28 basic letters + 8 variations (أ، إ، آ، ة، ى، ء، ؤ، ئ), representing phonemic and orthographic variants rather than positional letter shapes; diacritical marks: 8 marks and special tokens: <blank>, <pad>, <unk>, <space>.
CTC loss: The CTC loss marginalizes over all valid alignments π between input sequence X and target sequence Y:
where ϕ(Y) represents all valid CTC paths that decode to Y, π is alignment path (one possible character sequence), t: Time step index and T is the total number of time steps (total input sequence length).
3.5. Training Hyperparameters and Implementation Details
All key hyperparameters and preprocessing settings were defined explicitly in the training script:
Optimizer: AdamW with β1 = 0.9, β2 = 0.999, ε = 1 × 10−8
Weight decay: 0.01
Gradient clipping: applied at max norm = 1.0
Learning-rate schedule: cosine annealing with Tmax = epochs per phase and Initial LR: 1 × 10−3 (Phase 1), 3 × 10−4 (Phase 2), 1 × 10−4 (Phase 3)
Batch size: 6
Sampling rate: 16 kHz
Spectrogram parameters: 80 Mel bins, FFT = 512, window length = 400 (25 ms), hop length = 160 (10 ms), fmin = 0, fmax = 8 kHz
3.6. Transfer Learning
Transfer learning is a core technique in machine learning that allows models to leverage knowledge gained from large-scale pretraining to improve performance on tasks with limited data. It is significant for low-resource languages and specialized domains, where collecting huge training data is challenging or expensive. Transfer learning means that the knowledge gained from one task is used for similar tasks effectively. In our task, transfer learning refers to pretraining the model on a large MSA dataset of unlabeled audio and then fine-tuning it on a diacritical Arabic dataset. Thus, low-level acoustic features and patterns learned during pretraining phase are transferred across different languages, speakers, and domains [
52].
The effectiveness of transfer learning in ASR stems from the hierarchical nature of speech representations. Lower layers of neural networks learn general acoustic features such as spectral patterns, while higher layers capture more linguistic information. This hierarchical organization enables effective knowledge transfer, where pre-trained lower layers provide a strong foundation for learning task-specific upper layers. Transfer learning has proven especially valuable for cross-lingual speech recognition, where models trained on high-resource languages are adapted for low-resource ones. Research has shown that acoustic-phonetic similarities between languages influence transfer effectiveness, with closely related languages benefiting more than distant language pairs [
52,
53].
4. Ablation Study
We conducted a comprehensive ablation study using a CommonVoice 6.1 Arabic dataset [
54] to systematically assess the effects of various architectural decisions and hyperparameters on the performance of our Transformer-based Arabic ASR system. The experiments required more than 110 GPU hours of training time because they included 30 different tests spread over seven progressive phases. Each phase built upon the best configuration from the previous phase, ensuring that improvements are cumulative and interactions between architectural components are properly evaluated.
Our progressive methodology began with a baseline configuration of an 8-layer, 6-head, 192-model dimensional architecture using ReLU activation, Post-LN normalization, a 0.1 dropout rate, a 4× feedforward expansion ratio, and a 1.5 × 10
−4 learning rate. For efficient exploration, we applied 40 epochs for initial evaluation, with the best-performing model from each phase then fully trained the final model for 200 epochs with early stopping patience of 20 for the ablation study only. All architectural configurations and hyperparameter variations examined during the seven ablation study phases are reported in
Table 2, and the results are comprehensively visualized in
Figure 2.
4.1. Architecture Ablation
In Phase 1, we systematically investigated the transformer architecture by varying the number of layers (L) and attention heads (H), while keeping the model dimension fixed at 192. This phase included evaluation of 4-, 6-, and 8-layer architectures with attention head counts of 4, 6, and 8.
Depth vs. Width Analysis: Our experiments revealed that deeper architectures with moderate width provided optimal performance for ASR, with clear limits to depth scaling (
Figure 2a,b). The optimal balance between model depth and attention width was achieved with the 8L6H configuration, yielding a WER of 86.67%. While 10L6H192D and 10L8H192D both completely failed with 100.00% WER, 10L4H192D achieved 99.78% WER, indicating considerable degradation in deeper designs (
Figure 2a). For later stages, this serves as the baseline setup.
Parameter Efficiency: We analyzed the relationship between parameter count and performance across all architectural variants. The findings demonstrate distinct failure points and diminishing rewards as depth increases. With a reasonable training period of 5.5 h and optimal performance with 3.98 M parameters, 8L6H192D was the most parameter-efficient configuration. However, despite the higher computational cost, deeper 10 L configurations with 4.87 M parameters showed dramatic performance reduction.
Scaling Patterns: We observed that deeper architectures failed to converge effectively. The 10L4H192D configuration achieved a WER of 99.78% with 4.87 M parameters and 5.8 h training time, while 10L6H192D and 10L8H192D both failed completely with WER scores of 100.00% (6.9 and 7.6 h training time, respectively). This implies that the 8L6H layout offers the best architectural balance, as shown in
Figure 2a,b.
In Phase 2, we explored the effect of model dimensionality by testing values of 132, 252, and 318, chosen to align with the optimal head count. The results indicate that higher dimensionality improves performance (
Figure 2c). The optimal model dimension of 318 achieved a WER of 81.03%, with a 6.51% improvement over the baseline 192-dimensional model. While the 132 D performed poorly with a WER of 99.54%, 252 D showed moderate performance with a WER of 90.63%. This indicates that adequate model capacity is crucial to capture the complexity of Arabic speech.
We compared three activation functions within the feed-forward layers: ReLU (baseline), GELU, and Swish. With a WER of 79.78%, the experimental results indicate that Swish activation performed the best, followed by ReLU baseline performance (
Figure 2d). GELU activation failed to converge (WER: 100.00%). The improved performance of Swish can be attributed to its smooth functional form, which facilitates more effective optimization in transformer-based Arabic ASR. The unstable convergence of GELU was likely caused by gradient fluctuations under limited data, as its input-dependent curvature can amplify small training noise. Additionally, the deeper model variants combined with GELU tended to overfit quickly, suggesting that the model capacity exceeded what the low-resource dataset could effectively support.
We evaluated the impact of normalization placement by comparing Post-LayerNorm (baseline) with Pre-LayerNorm configurations. The results showed that Post-LayerNorm remained optimal with a WER of 79.78% compared to 88.44% for Pre-LayerNorm, showing that Pre-LayerNorm significantly degraded performance (
Figure 2e).
4.2. Hyperparameter Ablation
To find the ideal regularization strength, we systematically changed the dropout rate between 0.05, 0.10 (baseline), 0.15, and 0.25.
The baseline dropout rate remained appropriate, according to the experimental results (
Figure 2f). A dropout rate of 0.1 achieved the best performance of a WER of 79.78%, while higher rates of 0.15 achieved a WER of 99.77% and 0.25 achieved a WER of 83.75%, leading to performance degradation. A 0.05 dropout rate resulted in complete training failure. This implies that moderate regularization at 0.1 prevents overfitting while preserving model capacity.
We investigated the impact of feed-forward layer dimensionality by testing ratios of 2×, 3×, 4× (baseline), and 6× relative to the model dimension.
Our findings indicate that higher feed-forward capacity improves performance (
Figure 2g). The 6× ratio achieved the best performance with a WER of 79.11% with a 0.84% improvement over the 4× baseline at the cost of an additional 3.24 M parameters with a total of 13.68 M. The 2× ratio failed to converge, while 3× showed competitive performance with a WER of 79.64%. These findings underscore the importance of sufficient feed-forward capacity for handling task complexity.
We explored different learning rates (1.4 × 10−4, 1.5 × 10−4 baseline, 2.9 × 10−4, 4.9 × 10−4) to optimize the training dynamics.
The results show that fine-tuning the learning rate provided substantial final improvements (
Figure 2h). A learning rate of 1.4 × 10
−4 achieved the best convergence, reaching a WER of 73.92%, representing a 6.56% relative improvement over the previous best. The higher learning rates (2.9 × 10
−4, 4.9 × 10
−4) led to training instability and convergence failure, while the baseline 1.5 × 10
−4 was better.
4.3. Overall Results
Our progressive ablation study systematically optimized each component of the initial baseline configuration (
Figure 2i). We started from 8L6H192D + ReLU + Post-LN + Dropout = 0.1 + FF = 4x + LR = 1.5 × 10
−4 (WER: 86.67) and achieved the final optimized configuration of 8L6H318D + Swish + Post-LN + Dropout = 0.1 + FF = 6x + LR = 1.4 × 10
−4 (WER: 73.92) that represents a substantial 14.71% relative improvement.
The systematic optimization revealed the relative contribution of each component: the largest improvement (6.51%) came from model dimension scaling, followed by learning rate adjustment (6.56%), activation function optimization (1.54%) and feedforward expansion (0.84%). Normalization and regularization settings remained optimal at their baseline values (
Figure 2a–h).
Table 3 summarizes the progressive optimization across all phases and presents the cumulative improvements achieved through systematic hyperparameter tuning.
The study reveals several key insights: moderate depth with sufficient attention heads (8 layers, 6 heads) provides the best performance (
Figure 2a,b). Expanding the model dimensionality to 318 improved performances despite higher computational costs, see
Figure 2c. The Swish activation outperforms traditional ReLU (
Figure 2d), Post-LN normalization and moderate dropout (0.1) are optimal for regularization (
Figure 2e,f), expanded feedforward capacity (6×) further enhances performance (
Figure 2g), and careful learning rate tuning provides substantial final improvements (
Figure 2h).
Figure 2i summarizes the progressive improvement.
Computational Efficiency Analysis: We analyzed the trade-offs between model performance and computational requirements across all experimental configurations. The parameter efficiency frontier shows that the final optimized model achieved the best performance-efficiency balance. The architecture (8L6H with 13.68 M parameters in the final configuration) achieved superior performance while requiring reasonable computational resources.
Training time analysis reveals that the best configuration required 10.2 h of training time and achieved convergence with early stopping patience of 10 epochs. Larger dimensionalities showed good returns in performance per additional training hour, which demonstrated practical feasibility for this task and dataset size.
Interpretation of the Configuration
The final configuration (8 layers, 6 heads, 318 model dimensions, and 6× feed-forward ratio) achieved superior accuracy since it effectively models the acoustic and linguistic structure of Arabic. The moderate depth and number of heads (8L6H) give enough context without causing overfitting or instability. The larger model dimension (318) helps the network capture small sound differences such as short vowels and consonant emphasis that define diacritics. The 6× feed-forward ratio improves how the model learns short, detailed sound patterns. These settings work together to find a balance between expressiveness, stability, and efficiency. The result is a lightweight Transformer that is suitable for diacritical Arabic ASR.
5. Training and Results
This section presents the training methodology and experimental evaluation of our proposed diacritical Arabic ASR system. We employ a two-stage training strategy that leverages the linguistic relationship between MSA and diacritical Arabic, followed by the specific optimization configuration and architectural advantages that make our approach particularly effective for this challenging task.
The evaluation involved extensive 10-fold cross-validation experiments. In this setup, the data were partitioned differently in each fold, with separate training, validation, and test splits to ensure a robust assessment of model stability and generalization. No fixed independent test set was used; instead, testing was performed within each fold to obtain an averaged performance across all folds. Three distinct training approaches were compared: direct training on diacritical Arabic datasets, traditional full fine-tuning, and three-phase progressive fine-tuning. In addition, the effectiveness of various data augmentation techniques was examined, and the results were compared with state-of-the-art diacritical Arabic ASR systems. The chapter concludes with an analysis of computational efficiency, practical deployment implications, and future research directions.
Our findings demonstrate that the three-phase progressive fine-tuning approach combined with data augmentation achieves the best Word Error Rate (WER) of 22.01% on the Diacritical Arabic dataset. This represents a substantial improvement over traditional training methods while maintaining computational efficiency and practical viability for real-world deployment scenarios.
5.1. The Training Data
The limited availability of annotated datasets is one of the main obstacles to developing diacritical Arabic ASR systems. To mitigate this limitation, we employed transfer learning to enhance recognition performance. Specifically, we first trained the model on non-diacritical datasets and then fine-tuned it on a diacritical Arabic dataset. This approach helped compensate for the scarcity of annotated diacritical data.
Non-Diacritical Dataset: Common Voice 6.1 [
54] is an MSA corpus containing 128 h of speech.
Diacritical Dataset: SASSC [
55] is a diacritical Arabic single-speaker corpus comprising 51,432 discretized words and more than seven hours of audio recordings.
Dataset Splits and Preprocessing
For Common Voice 6.1, an 80%/20% random split was applied for training–validation and testing. Because the corpus is multi-speaker, speaker overlap may occur between subsets. However, this dataset was used primarily for pretraining, so overlap does not affect the final evaluation.
For the SASSC corpus, which is a single-speaker dataset, the data were divided into 80% training–validation and 20% testing in each fold, ensuring that different test splits were used across folds to evaluate the model’s robustness and generalization performance.
After splitting:
Transcripts were normalized by removing special characters and non-Arabic symbols while retaining diacritical marks.
Audio recordings were used without additional preprocessing (e.g., silence trimming, noise reduction, or amplitude normalization) to preserve natural acoustic variability.
5.2. Training Strategy
We employed a Two-Stage Training Process:
Stage 1—MSA Pre-training: The model was first trained on a Modern Standard Arabic dataset that does not contain diacritics, CommonVoice 6.1.
Stage 2—Diacritical Fine-tuning: The pretrained model was then fine-tuned on the diacritical Arabic dataset (SASSC). Two different fine-tuning strategies were applied:
Full Fine-Tuning: the model obtained from Stage 1 was fully trained on the diacritical Arabic dataset.
Three-phase fine-tuning: To avoid catastrophic forgetting during adaptation, we implemented a gradual three-phase fine-tuning process with systematic parameter unfreezing. This approach preserved acoustic-phonetic representations while enabling vocabulary expansion. In the first phase (10 epochs), only the CTC head was trained, while all encoder layers were frozen using a learning rate of 1 × 10−3. This allowed the CTC head to learn diacritic mappings without disrupting the established feature extraction process. In the second phase (10 epochs), the final two encoder layers and final normalization layer were unfrozen and trained with a learning rate of 3 × 10−4. This strategy preserved audio processing in initial layers while allowing the higher layers to capture linguistic patterns related to diacritics. The final phase involved full model fine-tuning with 100 epochs with a learning rate of 1.5 × 10−4 and early stopping using a patience of 10 epochs.
This approach leverages the linguistic relationship between MSA and diacritical Arabic, enabling the model to learn first basic Arabic phonetic patterns before specializing in diacritical recognition.
The number of epochs and learning rates for the three-phase fine-tuning were determined empirically through preliminary validation experiments. A small portion (10%) of the SASSC training data was reserved as a validation set to monitor convergence speed and generalization stability during pilot runs. We found that 10 epochs with a learning rate of 1 × 10−3 allowed the CTC head to adapt rapidly without overfitting, while another 10 epochs with 3 × 10−4 enabled stable adaptation of the upper encoder layers. The final phase with 100 epochs and 1.5 × 10−4, combined with early stopping (patience = 10 epochs), consistently produced the lowest validation loss and best WER across folds. These hyperparameters therefore represent an optimal balance between convergence rate, stability, and computational efficiency for the diacritical fine-tuning process.
5.3. Model Advantages
This architecture offers several advantages that make it well-suited for diacritical Arabic ASR. The incorporation of CTC provides flexibility, as it allows the system to handle alignments implicitly during the training [
29,
56]. Furthermore, the use of RPE improves the model’s capacity to capture contextual relationships, as it generalizes to sequences of unseen lengths and facilitates the acquisition of position-relative patterns [
27,
28].
The two-stage training methodology enhances model robustness and generalization capabilities by leveraging the hierarchical relationship between MSA and diacritical Arabic. Fine-tuning is applied directly to the complete model architecture and further strengthened through a three-phase progressive training strategy that produces superior performance compared to conventional single-phase approaches. This progressive approach prevents knowledge loss by slowing updates to important weights from previous tasks and allows the model to learn diacritical recognition while preserving its foundational MSA knowledge.
5.4. Experimental Evaluation and Performance Analysis
We conducted extensive experiments with 10-fold cross-validation to ensure robust and reliable performance assessment, where three distinct training approaches were evaluated to determine the most effective strategy for diacritical Arabic ASR.
5.4.1. Performance Metrics
The performance of the proposed model is evaluated using the SASSC dataset. Two standard metrics—Word Error Rate (WER) and Character Error Rate (CER)—are employed to assess the accuracy of the models and to guide adjustments when necessary.
Word Error Rate (WER) is a common metric used to evaluate the performance of speech recognition systems. It measures the proportion of incorrectly predicted words compared to the reference transcription. The metric typically ranges from 0 to 1, where lower values indicate higher accuracy. WER is calculated as:
where S denotes the number of substitutions, D the number of deletions, I the number of insertions, and N the total number of words in the reference transcript (the sum of correct and incorrect words).
Character Error Rate (CER) is a similar measure computed at the character level, reflecting how well individual characters are recognized. Like WER, CER values range from 0 (perfect match) to 1 (completely different), with lower values indicating better accuracy. CER is computed as:
To further analyze model performance, Precision, Recall, and the F1-score are calculated to quantify the balance between false positives and false negatives:
where TP, FP, and FN represent the counts of true positives, false positives, and false negatives, respectively. High Precision and Recall values indicate accurate and consistent predictions, while the F1-score provides a single measure that balances both.
The performance of the proposed model will be compared with that reported in [
41], which uses the same dataset. Based on the experimental findings, the results will be disseminated through international conferences and peer-reviewed journals.
5.4.2. Performance Comparison of Training Strategies
The experimental results demonstrated significant performance variations across the training strategies. Direct training on the diacritical dataset achieved an average WER of 34.09% and CER of 8.88%, representing the baseline performance when no transfer learning is applied. The traditional full fine-tuning approach, which involves pre-training on non-diacritic data followed by complete model fine-tuning on diacritical data, improved performance to an average WER of 30.15% and CER of 5.35%, corresponding to relative improvements of 11.56% and 39.75%, respectively. In contrast, the three-phase progressive fine-tuning strategy achieved the best performance across all metrics, with an average WER of 23.99% and CER of 4.49%. This represents remarkable improvements of 29.63% in WER and 49.44% in CER compared to direct training, and 20.43% in WER and 16.07% in CER compared to traditional fine-tuning.
Table 4 shows the WER and CER results for the three training strategies.
5.4.3. Effectiveness of Three-Phase Progressive Fine-Tuning
The superior performance of the three-phase approach confirms our hypothesis that gradual unfreezing prevents knowledge loss, adapts to diacritical characteristics effectively, and yields improvements across all folds. For instance, Fold 10 showed a reduction from 39.76% to 27.46% WER when we compared direct training to three-phase fine-tuning, which represents a 30.94% relative improvement.
The CER improvements are particularly significant for diacritical Arabic recognition, as character-level accuracy directly reflects the system’s ability to distinguish between diacritical variants. The reductions in CER across all folds indicate that the three-phase approach enhances the sensitivity of the model to acoustic differences associated with diacritical marks.
5.4.4. Impact of Data Augmentation
Following the best results reported in [
5], we applied parallel speed and volume data augmentation techniques.
Table 5 shows the WER and CER for the original and augmented datasets.
All folds showed consistent improvements in the data augmentation experiments. The augmented dataset achieved an average WER of 22.01% compared to 23.99% for the original dataset representing an additional 8.25% relative improvement. Moreover, the CER decreased from 4.49% to 4.17%, corresponding to a 7.13% relative improvement. The augmented dataset outperforms the original dataset and achieves around 2% and 0.3% reduction in WER and CER, respectively. These results indicate that data augmentation provides complementary benefits to the three-phase fine-tuning strategy.
5.5. Comparison with State-of-the-Art Models
To evaluate the effectiveness of our proposed approach, we compared our best-performing model (three-phase fine-tuning with data augmentation) against existing state-of-the-art diacritical Arabic ASR systems.
Table 6 presents the comprehensive comparison results for the SASSC dataset, with WER values obtained from the respective publications. The results of a comparison with other diacritical Arabic ASR systems for the SASSC corpus are shown in
Figure 3.
Our proposed method achieved competitive performance with a WER of 22.01%, while also providing distinct methodological advantages. The comparison reveals important insights about different training paradigms and their impact on performance.
5.5.1. Cross-Fold Evaluation Metrics
To provide a broader evaluation beyond WER, the model’s performance was also analyzed across all 10 folds using precision, recall, and F1-score metrics. These measures assess how consistently the model distinguishes diacritical and non-diacritical characters throughout the dataset.
Table 7 reports the average performance consistency of the proposed model across all folds. The results show an overall mean F1-score of 92.99%, indicating strong generalization and stability.
5.5.2. Performance Analysis in Context
Our model achieved a WER of 22.01%, demonstrating strong performance compared to other diacritical Arabic ASR models. While the model does not achieve the lowest WER compared to XLSR models (12.17% WER and 21% WER), this comparison requires careful contextualization regarding computational resources and training data requirements.
The XLSR models use much more computational power and data. XLSR was pre-trained on 53 different languages using massive datasets with more than 300 M parameters, while our lightweight model has only 14 M parameters and was trained on a small MSA dataset first, then fine-tuned on diacritical Arabic. Despite using much fewer resources, our model (22.01% WER) performs almost as well as the single-stage XLSR model (21.0% WER). This small difference is impressive given that our model is much smaller and uses far less training data.
The XLSR models require massive computational resources and extensive training time. XLSR was pre-trained on 53 different languages using large-scale datasets with more than 300 M parameters over weeks or months on high-end GPU clusters. In contrast, our lightweight model with only 14 M parameters required around 15 h of total training time on a single Google Colab T4 GPU (12 h for MSA pre-training and 3 h for diacritical fine-tuning). Despite this dramatic difference in computational resources and training time, our model (22.01% WER) performs almost as well as the single-stage XLSR model (21.0% WER).
This small performance gap is remarkable, taking into consideration the resource efficiency. Our model achieves competitive results with over 100 times less training time and significantly lower computational requirements, making it more practical for researchers and developers with limited access to GPU clusters. The ability to train a competitive diacritical Arabic ASR model in under a day on freely available hardware represents a substantial advancement in accessibility.
Our model outperforms traditional methods, surpassing GMM systems (33.7–39.7% WER), HMM approaches (31.4% WER), and DNN models (34.4% WER) by large margins. Compared to other modern E2E systems, it also performs well against CNN-LSTM models (28.4% WER) and joint CTC-attention systems (31.1% WER) while offering superior training efficiency and deployment feasibility.
5.6. Computational Efficiency and Practical Implications
Our results establish a new benchmark for diacritical Arabic ASR systems that do not rely on extensive pretraining models. The achieved WER of 22.01% represents the best performance obtained with a transfer learning approach from MSA to diacritical Arabic. This makes our method particularly relevant for scenarios where computational resources for pre-training are limited. These findings demonstrate that effective diacritical Arabic speech recognition can be achieved through architectural innovations and progressive training strategies, offering a practical alternative to expensive pretraining methods while maintaining competitive accuracy. Our method ensures transparency and reproducibility through a structured two-stage training process, delivering competitive results without requiring extensive computational resources.
The gap between our method (22.01%) and traditional approaches (28.4–39.7% WER) demonstrates the effectiveness of our architecture and training strategy. Our approach provides an effective balance of performance and efficiency for diacritical Arabic speech recognition systems, while acknowledging the superior WER performance of extensive pretrained models such as XLSR.
Inference Efficiency Comparison
To assess the deployment practicality of the proposed lightweight Transformer, inference latency was measured on a Google Colab T4 GPU (16 GB VRAM, batch size = 1) using 5 s audio inputs from the SASSC dataset. The comparison focused on real-time capability rather than training efficiency.
The proposed Lightweight Transformer (≈14 M parameters) achieved an average inference latency of ≈33.85 ms per clip, confirming real-time performance with a Real-Time Factor (RTF) of 0.007 (≈148 × faster than real time). The XLSR-Large baseline (≈300 M parameters) processed the same input in ≈14.05 ms (RTF = 0.0147, ≈71× faster than real time). A detailed comparison of latency, parameter count, and real-time performance is summarized in
Table 8.
Although XLSR yields slightly lower latency, the proposed model delivers comparable inference responsiveness with 20× fewer parameters, enabling practical deployment on resource-limited systems while maintaining competitive WER performance.
5.7. Per-Diacritic Performance Analysis
A detailed analysis of CER was performed for each diacritic using the final model (three-phase fine-tuning with data augmentation) across all ten folds to gain a deeper understanding of the model’s behavior. The quantitative results are summarized in
Table 9 and
Table 10.
The analysis reveals that the model demonstrates high precision and recall across all diacritics, with mean F1-scores exceeding 93%. The lowest CERs were obtained for the short vowels Fatha (5.39%), Kasra (7.60%), and Damma (7.57%), which correspond to frequent and acoustically distinct phonemes in Arabic. Their clear spectral formant patterns enable the Transformer encoder to capture vowel-related cues effectively.
Moderate error rates were observed for Sukun (12.92%) and Shadda (13.23%). These marks are temporally or contextually complex—Sukun indicates the absence of a vowel, while Shadda represents gemination (consonant doubling). Both depend on subtle timing and intensity variations that are harder to learn consistently across speakers and recording conditions.
The highest CERs occurred in the Tanween forms—Tanween Fath (13.45%), Tanween Kasr (20.54%), and Tanween Damm (18.99%)—which exhibit strong data sparsity and weak acoustic salience. Their realization often relies on morphological context rather than clear acoustic cues, explaining the larger fold-to-fold variability (standard deviation ≈ 9–10%).
Overall, these findings indicate that the proposed model captures frequent vowelized diacritics with high consistency, while rarer and context-dependent Tanween marks remain challenging. This suggests that further improvement could be achieved through synthetic data balancing or morphologically aware augmentation strategies. The results confirm that the final Transformer effectively models the Arabic phonetic–diacritical structure, maintaining both linguistic accuracy and computational efficiency.
6. Conclusions and Limitations
This work presents a lightweight encoder-only Transformer architecture with CTC and RPE for diacritical Arabic ASR. We demonstrate that competitive performance can be achieved without extensive multilingual pre-training through systematic seven-phase ablation studies and three-phase progressive fine-tuning.
Our approach achieved a 22.01% WER on the SASSC dataset. This represents significant improvements compared to traditional methods (28.4–39.7% WER) and requires far fewer computational resources than pretrained ASR models (~14 M vs. ~300 M parameters for XLSR). The two-stage training strategy leverages the linguistic relationship between MSA and diacritical Arabic, effectively addressing the challenge of limited diacritical training data through strategic transfer learning.
The comprehensive experimental validation shows that post-normalization configuration, Swish activation, and 6× feedforward expansion are the best architectural choices for this task. The integration of RPE and CTC proves particularly effective in capturing temporal dependencies that are crucial for distinguishing subtle diacritical variations in Arabic speech.
Overall, our findings demonstrate that it is possible to build accurate and resource-efficient diacritical Arabic ASR systems without relying on massive multilingual pretraining. By combining architectural innovations with a progressive fine-tuning strategy, we provide a practical framework that balances accuracy and efficiency. This work not only establishes a new benchmark for diacritical Arabic ASR but also lays the foundation for extending lightweight ASR solutions to other low-resource languages and real-world applications where computational constraints are critical.
Although the proposed lightweight transformer with relative positional encoding demonstrates competitive accuracy and efficiency, several limitations remain. First, the training and evaluation were conducted on a single diacritical Arabic dataset (SASSC), which may limit generalization to broader dialectal or spontaneous speech. Second, although the proposed model substantially reduces parameter count and training cost, real-time inference on embedded or mobile devices was not fully explored. Third, this study focused exclusively on diacritical Arabic, without addressing local Arabic dialects, which may exhibit different phonetic and linguistic characteristics that affect recognition performance. Future work will focus on expanding dataset diversity, integrating more efficient decoding schemes, and testing deployment performance on edge hardware.