Next Article in Journal
Key Operator Vectorization for LeNet and ResNet Based on Buddy Compiler
Previous Article in Journal
Experimental Study on Failure Characteristics and Energy Release Evolution of Coal Under Microwave Irradiation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model †

by
Mohammad Alshboul
1,*,
Abdul Rahman Al Muaitah
2,
Suhad Al-Issa
1 and
Mahmoud Al-Ayyoub
3,4
1
Department of Computer Engineering, Jordan University of Science and Technology, Irbid 22110, Jordan
2
Department of Computer Science and Engineering, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab Emirates
3
Department of Information Technology, Ajman University, Ajman, United Arab Emirates
4
Department of Computer Science, Jordan University of Science and Technology, Irbid 22110, Jordan
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in MCNA’23 Conference, Valencia, Spain, 19–22 June 2023.
Appl. Sci. 2025, 15(17), 9521; https://doi.org/10.3390/app15179521 (registering DOI)
Submission received: 28 June 2025 / Revised: 3 August 2025 / Accepted: 5 August 2025 / Published: 29 August 2025

Abstract

In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was previously reported in our prior works. In addition to this dataset, we used various subsets of the QRFAM dataset for training, validation, and testing to build several basic NSR systems based on Mozilla’s DeepSpeech model. Our current efforts to optimize and enhance these baseline models have also been presented. In this study, we expand our efforts by utilizing one of the well-known speech recognition models, Whisper, and we describe the effect of this choice on the model’s accuracy, expressed as the word error rate (WER), in comparison to that of DeepSpeech.

1. Introduction

The Quran is the main source of Arabic vocabulary, with 77,794 unique words in total, including the Basmalah, a special phrase introduced at the start of each Surah. There are 28 distinct letters in the Arabic alphabet (29 when the hamza is taken into consideration).
Millions of Muslims worldwide memorize the Quran and practice reciting it daily. One of the main challenges they face is that they need someone to listen to their recitation to ensure that their memorization is correct. This has been known to be a very tedious and time-consuming process. Because of that, it will be very beneficial to have automatic recognition to help students check their memorization without the need for human help. However, given the special nature of Quranic recitation and phonetics, traditional speech recognition models are usually not very effective in Quranic speech recognition.
A computational procedure called automatic speech recognition (ASR) transforms recorded voice signals into a written representation. Preprocessing, feature extraction, and classification are the three main steps that constitute an ASR system, as described in the literature [1,2,3,4,5]. ASR systems have changed dramatically in recent years, moving from traditional methods such as CMU Sphinx [6,7,8,9,10] to sophisticated methods that make use of deep learning (DL) and deep neural networks (DNNs) [11,12].
To train ASR systems and ensure their optimal performance, large volumes of data are needed. Nevertheless, there are some issues with the audio recordings of Quranic recitations that are currently available. These recordings are extensive, but most of the performers are male reciters with highly developed recitation techniques who use expensive, professional equipment. Due to the reciters’ membership in a specific demographic group (adult males), the training data’s level of expertise and quality, and other factors, the resulting model may be biased to some extent, making it less able to process input from other groups such as non- or semiprofessionals, women, or children. This study aims to help towards mitigating this insufficiency [13,14].
This work focuses mainly on working towards addressing this problem for female recitations. Targeting children’s recitations can be very promising in improving demographic inclusiveness. However, we decided to keep it as a future work and not thoroughly investigate it in this paper because it has several challenges, such as availability and securing parents’ consent. Nevertheless, we are hopeful that each step in enriching the inclusiveness of the QSR (e.g., including more female recitations) will improve its overall accuracy even for other groups that the model might not have directly targeted.
The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was first described in our earlier articles [13,14]. We also developed multiple basic NSR systems using this dataset, which were based on Baidu’s DeepSpeech model implemented by Mozilla [11,15]. We reported our efforts to use the DeepSpeech model to tune and enhance these baseline models in our most recent work [16]. Here, we expand on our work by presenting a NSR system that makes use of a well-known speech recognition model (Whisper) [17]. We also describe the effect of this choice on the model’s performance, expressed in word error rate (WER), in comparison to DeepSpeech.
This work focused on the language mistakes in recitations because the main goal is to help people memorize the Quran. Therefore, incorporating Tajweed rules was beyond the scope of this paper. However, it is worth mentioning that all the recitations used to train and test the model are correct and have been thoroughly revised by a separate entity to guarantee their accuracy. This will help build a foundation for the system to be sensitive to correct recitations and pave the road for more advanced tasks.
This is how the remainder of the paper is structured. Section 2 offers an overview of the relevant literature. The process we used to optimize the NSR systems we had previously constructed with the DeepSpeech model is presented in Section 3. Additional details regarding the OpenAI Whisper model are given in Section 4. Section 5 presents and discusses the experimental outcomes. Finally, Section 6 concludes the paper.

2. Related Work

Presenting pertinent studies on speech recognition—both historical and contemporary—is crucial to the advancement of our work. Therefore, in this section, we investigate some of the main related works in this field.
When we approach speech as a signal with particular features and attributes, we can identify it as vocal sounds made by humans for communication. Theoretically, speech is thought of as an analog wave that varies with time [18].
Signal processing, acoustic models, pronunciation models, language models, and decoding are the various parts of a speech recognition (SRs) system [19]. There are two kinds of SR: classic and nontraditional (neural). For phonemic modeling, traditional systems typically use the hidden Markov model (HMM), which is a statistical model based on the Markov process [20]. CMU Sphinx [21] is an example of a continuous SR system based on a HMM with a Gaussian mixture model (GMM) for output possibilities.
Prior research on the application of speech recognition to Arabic has involved the identification of the Quran using the Sphinx tool, with particular attention given to Surat Al-Ikhlas. With male reciters, these works earned a word recognition rate of 90%, and with female reciters, the recognition rate was 85% [22]. A different system used the Sphinx tool to teach and recognize the Quran and achieved a 46% word error rate (WER) [23].
A speech recognition system based on the Quran was built [24], utilizing the Sphinx framework and HMM toolkit in the context of Arabic Quran recognition. An automated Tajweed verification engine specifically designed for Surat Al-Fatihah was presented for Quran recognition via Tajweed principles, and it achieved an accuracy of 91.95% at the Ayah level [25]. Moreover, a system built using CMU Sphinx 4 employs a dataset comprising Surat Al-Fatihah, Al-Ikhlas, Al-Falaq, and Al-Nass recited by 22 distinct renowned reciters. This system produced a WER of 1.5% with a small set of training audio files and 50.0% WER when 90% of the audio files were used for training [26].
Deep learning (DL) is an important component of neural (nontraditional) techniques. The following paragraphs discuss a few noteworthy works.
By presenting the CLDNN-HMM model, researchers from the Microsoft Research Institute suggested notable advancements in voice recognition. To estimate the output probabilities for hidden HMM states, this model combines a deep neural network (DNN) and a HMM, using DNN outputs in place of the GMM in the HMM-GMM framework [27].
A different study investigated ASR models other than HMM-DNN, including connectionist temporal classification (CTC), recurrent neural network (RNN)-transducer, and attention-based sequence-to-sequence models. According to the experiments, the DL-based models outperformed the HMM-based models in terms of recognition accuracy [1].
Deep neural networks are used by the DeepSpeech CTC model, a DL-based method, to translate speech to text. WERs of 12.6% and 19.3% are obtained for the CallHome data and the Switchboard Hub5’00 data, respectively, after training on massive datasets such as the Fisher and Switchboard datasets. For the English language, this model fared better than earlier SR models did [11].
The RNN and CTC techniques are used in DeepSpeech2, an upgraded version of DeepSpeech. SortaGrad is a component of the training methodology that progressively lengthens the speech in the training set. The foundation of DeepSpeech2 is the PaddlePaddle implementation [28]. Rather than using CTC, DeepSpeech3, an enhanced version of DeepSpeech1, uses the RNN-transducer loss function. This approach produced increased performance in end-to-end voice recognition by doing away with the requirement for a language model during inference [29].
Three parts make up Google’s Listening, Attending, and Spelling (LAS) ASR model: an encoder that uses bidirectional LSTM layers to translate sounds into higher-level representations. The model outperforms Google’s prior model by a margin of 2.3% in WER [30] after being trained on Google Voice Search data.
After discussing speech recognition in general, the rest of this section discusses research conducted in the area of DL-based Quran recognition.
In accordance with the Quranic recitation guidelines, a DL model that makes use of support vector machines (SVMs) and DBN convolutions was created to help in accurate Quran reading. The Weka tool was used to test this model on unseen records, and the accuracy was 97.7% [31].
Another study employed K-nearest neighbor (KNN) classifiers and artificial neural networks (ANNs) to recognize Quranic recitations. For Surahs 18 and 36, the ANN classifier yielded an accuracy of 97.6% and 96.7%, respectively, whereas the KNN classifier produced accuracies of 97.03% and 96.08%, respectively, for identical Surahs [32].
Based on the DeepSpeech foundation, the DeepSpeech-Quran model [33] aims to assist people in reading the Quran by creating an ASR system. It makes use of a dataset named “Imam-Recitations” and “Imam-Tusers-Recitations,” which contains reciters who are professional and semi-professional.
Using a CNN-bidirectional GRU encoder and the CTC objective function, a novel end-to-end ASR-based DL model for Quran recitation recognition was created. When trained on the Ar-DAD dataset, which consists of 37 Surahs recited by 30 distinct reciters with variable pronunciation norms and speeds, the model achieved a WER of 8.34% and a CER of 2.42% [34].
The use of a model that addresses the shortcomings of earlier Quran recognition studies was suggested. It uses LSTM neural networks to add mel-frequency cepstral coefficient (MFCC) features, taking into account the time series aspect of voice signals. With accuracies of 96% for separate stretching, 95% for tight tightening, and 96% for hiding, the model outperforms conventional machine learning models in its focus on verbal error detection in reading and is based on Tajweed rules [35].
The Quranic Database for Recitation Correction, or the QDRC dataset, was unveiled. It contained corrections for 54 errors committed by 17 volunteers during the recitation of Surat Al-Fatihah, along with their original recordings. Tajweed criteria for Quranic recitation recognition are used for this dataset, which also incorporates voice feature analysis [36].
To handle customary Tajweed-required Quran recitation, an artificial neural network-based model was created. The model trains the neural network via several techniques and extracts features via the mel-frequency cepstral coefficient. The Levenberg–Marquardt algorithm produced the best accuracy of 77.7% [37].
Another related example is QRFAM [13,14], which is a Quran recitation dataset by females and males. It includes a substantial benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels. This dataset is the main dataset used for the experiments in this paper. Table 1 illustrates the main QRFAM dataset statistical information. The full details of the QRFAM configuration can be found in [13].

3. Mozilla’s DeepSpeech Implementation

We use Mozilla’s DeepSpeech NSR model, which was previously published in [13], to build the basis of this work. On the other hand, we perform several experiments to optimize the neural models by varying the DeepSpeech hyperparameters for testing, assessment, and training [14]. The default DeepSpeech hyperparameters are shown in Table 2 and Table 3.
The language model interprets the phoneme sequence as words, whereas the DeepSpeech model uses an RNN model to generate outputs in the form of phonemes. Our goal was to enhance the decoding objective function to reduce the word error rate (WER). Specifically, we wanted to find the optimal phoneme sequence to substitute for the vocabulary generated by the RNN.
The main function was described by Hannun et al. [11] as a weighted combination of the RNN score “ P ( c | x ) ” and the language model score “Plm”, where “x” denotes the input utterance and “c” is the potential output character sequence that in our work represents the phonemes. Equation (1) therefore represents the function, where β denotes the word insertion weight and α matches the language model weight.
Q ( c ) = log ( P ( c | x ) ) + α log ( P l m ( c ) ) + β w o r d _ c o u n t ( c )
Using the English Language test dataset in DeepSpeech release 0.9.3 [21], the optimal coefficient values for α and β in Mozilla’s DeepSpeech were ( α , β ) = (0.75, 1.85), with a beam size of 1024. To ascertain the ideal coefficient values that would result in the lowest WER, we carried out real-world trials. Overall, we found that simultaneously modifying α and β produced the lowest WER.
We also looked into changing the number of hidden neurons in the model to improve it and obtain a lower WER [38]. The neural network’s size is determined by the number of hidden neurons; in situations where there is a shortage of data, such as in our female dataset, it is best to select a lower network size. For real-time applications, such as mobile phone apps, fewer buried neurons also save memory and CPU resources, increasing their effectiveness. We adjusted the number of hidden neurons in Mozilla DeepSpeech to match the default value of 2048 [39].
Additionally, we tried modifying the dropout rate in an effort to enhance the model even more. In neural networks, dropout is a regularization technique that efficiently lowers overfitting and enhances generalization mistakes. By adjusting the superior dropout rate coefficient in Mozilla’s DeepSpeech, which was initially set to 0.15 [39], we were able to obtain a lower WER through our experiments.
In conclusion, we optimized our neural models through a series of experiments involving the modification of coefficient values.

4. OpenAI’s Whisper Implementation

An advanced ASR model called Whisper [17] was developed as a consequence of training on approximately 680k hours of supervised web data. These data include multilingual and multitasking samples. The authors have shown how resilience against accents, background noise, and technical terminology is increased when a dataset of this size and variety is used. Additionally, it allows transcription in those languages as well as translations from several other languages into English. Whisper’s authors laid the foundation for future research on robust speech processing and the development of practical applications by providing the models and inference code [40,41,42,43]. This can be quite helpful when trying to adapt the model to other use cases, such as what we are doing in this project.
The Whisper model’s ability to handle diverse audio inputs makes it particularly suitable for applications in various domains, including healthcare, customer service, and education. For instance, in healthcare, Whisper can be used to transcribe doctor–patient conversations accurately, ensuring that medical records are precise and comprehensive. In customer service, it can enhance the efficiency of call centers by providing real-time transcriptions of customer interactions, allowing for better analysis and response to customer needs. In education, Whisper can assist in creating accessible learning materials by transcribing lectures and translating them into multiple languages, thus catering to a global audience.
Moreover, Whisper’s open-source nature encourages collaboration and innovation within the research community. By providing access to the models and inference code, OpenAI enables researchers and developers to build upon Whisper’s capabilities, tailoring it to specific needs and exploring new frontiers in speech recognition technology. This openness not only accelerated advancements in the field but also democratized access to cutting-edge ASR technology, allowing smaller organizations and individual developers to leverage Whisper’s powerful features without the need for extensive resources.
In this work, we experiment with the effectiveness of Whisper in performing some of the same Quranic recognition tasks we recently conducted via DeepSpeech [16]. We also expand this evaluation by presenting a thorough analysis when multiple versions of the Whisper model are employed.

5. Experiments and Results

We developed numerous iterations of each neural model, looking for areas where it may perform better. It has been improved in work [14] before, except for Experiment #3 (FFF).
This section investigates these optimizations that were conducted along with the corresponding WER enhancement. The WER of the baseline (non-finetuned) DeepSpeech model is shown in Table 4. The details of all conducted experiments are shown in Table 5 and Table 6. In all these experiments, speaker independence was ensured by guaranteeing that each reciter would only be included in one category (e.g., training or testing).

5.1. Experiments on Small Datasets of Males and Females

5.1.1. Experiment #1 (MMM)

We attempted to lower the WER of (0.406), which was the result of this trial, to (0.32). We began this experiment with a previous attempt (attempt #24), which, with the exception of gender, achieved the lowest WER value in experiment #3 because experiment #1 was identical to experiment #3. The goal here is to reduce the amount of time between attempts. Table 7 presents a list of this effort. Every effort is an improvement over an earlier attempt, as demonstrated by the column bearing the title BA # (MP), which stands for “Base Attempt # (Modified Parameter)”. As an illustration, Attempt #10 is modified β from Attempt #9, which it is based on. The original Experiment #1 served as the basis for Attempt #1. Table 7 shows that the best WER was achieved.

5.1.2. Experiment #2 (MMF)

The WER obtained in this experiment was (0.968), which we attempted to lower to (0.938). As with experiment #1, we began this experiment by starting with the previous attempt (attempt #24), which in experiment #3 produced the lowest WER value. Table 8 presents a list of this effort.

5.1.3. Experiment #4 (FFM)

We were able to lower the WER obtained in this experiment from (0.966) to (0.951) after multiple tries were performed. Every effort is detailed in Table 9.

5.1.4. Experiment #5 (MMF) (All Female Records That Are Available During Testing)

We were able to reduce the WER obtained in this experiment from (0.842) to (0.828) after multiple tries were performed. Table 10 contains a list of all the attempts.

5.1.5. Experiment #6 (FFM) (All Female Records That Are Available in the Train and Development)

We were able to reduce the WER obtained in this experiment from (0.857) to (0.789) after multiple tries were performed. Table 11 contains a list of all attempts.

5.2. Experiments on the Male Dataset

5.2.1. Experiment #7 (MMM) (≤10 s, C)

Despite our best efforts, we were unable to lower the WER in this experiment; instead, the WER number continued to rise, so we decided to settle for the initial value of 0.046, which was the lowest of all the tries. Table 12 contains a list of all the attempts.

5.2.2. Experiment #8 (MMM) (≤10 s, D)

The WER obtained in this experiment was 0.049, which we were able to lower to 0.044 after multiple tries were made. Table 13 contains a list of all the attempts.

5.2.3. Experiment #9 (MMM) (<1 s, ≤30 s, D)

We were able to lower the WER obtained in this experiment from (0.128) to (0.093) after multiple tries were made. Table 14 shows a list of every attempt.

5.2.4. Experiment #10 (MMM) (≤30 s, D)

The WER obtained in this experiment was (0.295), which we were able to lower to (0.289) after multiple tries were made. Table 15 shows a list of every attempt.

5.2.5. Experiment #11 (MMM) (≤45 s, D)

We were able to reduce the WER obtained in this experiment from 0.16 to 0.064 after multiple tries were made. Table 16 contains a list of all the attempts.
An overview of the improvements made in those experiments can be seen in Table 17. Every experiment (apart from Exp #7) has demonstrated an improvement in the WER, as the table illustrates. In certain studies, this improvement has exceeded 20% (Exp #1 and Exp #9).

5.3. Experiments on OpenAI’s Whisper

OpenAI’s Whisper was tested on the same test datasets used in Exp #1 and Exp #4, as well as Exp #2 and Exp #3. All five model sizes were tested. Table 18 presents the WERs achieved in each of the different model sizes. Table 19 provides a side-to-side WER performance comparison between the optimized DeepSpeech models and Whisper’s “Large” model, where “WER difference” represents the difference between the WERs of Whisper and DeepSpeech.
The results from these experiments highlight the strengths and weaknesses of Whisper across different model sizes. In Exp #1 and Exp #4, Whisper’s performance varied significantly depending on the model size, with the “Large” model consistently achieving the lowest WER. This finding indicates that the larger models, with their increased capacity, are better at handling the complexities of the test datasets. The smaller models, while still performing admirably, presented higher WERs, suggesting that they might struggle with more challenging audio inputs or less common linguistic patterns.
In Exp #2 and Exp #3, similar trends were observed. The “Large” model of Whisper outperforms the smaller models, demonstrating its robustness and accuracy in diverse scenarios. The comparison in Table 19 between the optimized DeepSpeech models and Whisper’s “Large” model further underscores Whisper’s superiority. The “WER difference” metric reveals that Whisper consistently achieves lower WERs than DeepSpeech does, highlighting its advanced capabilities in speech recognition. This performance gap suggests that Whisper’s architecture and training data provide it with a significant edge in accurately transcribing speech, even in challenging conditions. The fact that WER is much better on average supports the paper’s overall conclusion, even though that might not be very visible in every single experiment we have conducted, because some experiments reported small accuracy improvements.
Overall, these experiments demonstrate Whisper’s potential as a leading ASR system, capable of delivering high accuracy across various tasks and datasets. The detailed analysis of WER across different model sizes and comparisons with DeepSpeech provide valuable insights into the strengths of Whisper, making it a promising tool for future applications in speech recognition.
Table 19 provides a side-to-side WER performance comparison between the optimized DeepSpeech models and Whisper’s “Large” model, where “WER enhancement (%)” represents the percentage improvement in the WER of Whisper relative to the WER of DeepSpeech.
It is worth noting that improvements in word error rates (WERs) using different models (or versions of a model) often come with a variation in the model complexity and associated costs. To explore that, Table 20 provides a comparison of the execution time (in seconds) when using different versions of the Whisper model. This should help clarify the trade-offs involved with the different models we used and show the associated costs that can come with the accuracy improvements reported earlier in Table 18.

6. Conclusions

This research extends our prior work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to users of any age, gender, or level of experience. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters spanning a range of age groups and competency levels, was published in our previous publications. Furthermore, for training, validation, and testing, we built several basic NSR systems based on Mozilla’s DeepSpeech model, utilizing various subsets of the QRFAM dataset. Additionally, we demonstrated some improvements that may be used with the DeepSpeech model to enhance the WER. We expand on these efforts in this study by introducing a well-known speech recognition model and reporting the effect of this choice on the WER in comparison to DeepSpeech.
Our study has advanced significantly with the addition of OpenAI’s Whisper model to our experiments. Our goal was to improve the accuracy and dependability of our NSR systems by utilizing Whisper’s strong architecture and a large training dataset. The comparison between DeepSpeech and Whisper offered insightful information about the advantages and disadvantages of each model. Whisper has been shown to be an effective tool for Quranic recitation recognition, as reflected by its exceptional WER performance across multiple test datasets. With recordings from reciters varying in age, gender, and skill level, the QRFAM dataset is both tough and diverse, making this improvement all the more remarkable.
Moreover, Whisper’s open-source design promotes ongoing development and adoption. Building on our findings, scholars and developers might investigate new applications and improvements for Quranic recitation recognition. This cooperative approach guarantees that the technology remains useful and accessible to a broad spectrum of consumers while also expediting progress in the industry. Our research shows how crucial it is to include cutting-edge ASR models, such as Whisper, into specific applications to pave the way for future speech recognition technologies that will be more accurate and inclusive.

Author Contributions

Conceptualization, M.A. and M.A.-A.; Data curation, A.R.A.M. and S.A.-I.; Investigation, M.A.-A.; Methodology, A.R.A.M. and S.A.-I.; Project administration, M.A. and M.A.-A.; Software, A.R.A.M. and S.A.-I.; Writing—original draft, M.A.; Writing—review and editing, M.A., A.R.A.M., S.A.-I. and M.A.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jordan University of Science and Technology (Grant Research ID: 20230113). The authors would also like to thank Ajman University for partially supporting this work.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gaikwad, S.K.; Gawali, B.W.; Yannawar, P. A review on speech recognition technique. Int. J. Comput. Appl. 2010, 10, 16–24. [Google Scholar] [CrossRef]
  2. Ghai, W.; Singh, N. Literature review on automatic speech recognition. Int. J. Comput. Appl. 2012, 41, 42–50. [Google Scholar] [CrossRef]
  3. Shanthi, T.S.; Lingam, C. Review of feature extraction techniques in automatic speech recognition. Int. J. Sci. Eng. Technol. 2013, 2, 479–484. [Google Scholar]
  4. Dixit, R.; Kaur, N. Speech recognition using stochastic approach: A review. Int. J. Innov. Res. Sci. Eng. Technol. 2013, 2, 356–361. [Google Scholar]
  5. Gamit, M.R.; Dhameliya, K.; Bhatt, N.S. Classification techniques for speech recognition: A review. Int. J. Emerg. Technol. Adv. Eng. 2015, 5, 58–63. [Google Scholar]
  6. Lee, K.-F.; Hon, H.-W.; Reddy, R. An overview of the SPHINX speech recognition system. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 35–45. [Google Scholar] [CrossRef]
  7. Djemili, R.; Bedda, M.; Bourouba, H. Recognition of spoken Arabic digits using neural predictive hidden Markov models. Int. Arab J. Inf. Technol. 2004, 1, 226–233. [Google Scholar]
  8. Bourlard, H.A.; Morgan, N. Connectionist Speech Recognition: A Hybrid Approach; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1994; Volume 247. [Google Scholar]
  9. Burget, L.; Schwarz, P.; Agarwal, M.; Akyazi, P.; Feng, K.; Ghoshal, A.; Glembek, O.; Goel, N.; Karafiát, M.; Povey, D.; et al. Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4334–4337. [Google Scholar]
  10. Su, D.; Wu, X.; Xu, L. GMM-HMM acoustic model training by a two-level procedure with Gaussian components determined by automatic model selection. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4890–4893. [Google Scholar]
  11. Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
  12. Campomanes-Álvarez, B.R.; Quirós, P.; Fernández, B. Semi-Supervised Learning for Spanish Speech Recognition Using Deep Neural Networks. In Applications of Intelligent Systems; IOS Press: Amsterdam, The Netherlands, 2018; pp. 19–29. [Google Scholar]
  13. Al-Issa, S.; Al-Ayyoub, M.; Al-Khaleel, O.; Elmitwally, N. Towards Building a Speech Recognition System for Quranic Recitations: A Pilot Study Involving Female Reciters. Jordan J. Electr. Eng. 2022, 8, 307–321. [Google Scholar] [CrossRef]
  14. Al-Issa, S.; Al-Ayyoub, M.; Al-Khaleel, O.; Elmitwally, N. Building a neural speech recognizer for quranic recitations. Int. J. Speech Technol. 2023, 26, 1131–1151. [Google Scholar] [CrossRef]
  15. Mozilla/DeepSpeech · GitHub. Available online: https://github.com/mozilla/DeepSpeech (accessed on 3 August 2025).
  16. Al-Issa, S.; Alshboul, M.; Al-Ayyoub, M. Enhanced Neural Speech Recognizer for Quranic Recitations. In Proceedings of the 2023 International Conference on Multimedia Computing, Networking and Applications (MCNA), Valencia, Spain, 19–22 June 2023; pp. 62–66. [Google Scholar] [CrossRef]
  17. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
  18. Rabiner, L.; Juang, B.-H. Fundamentals of Speech Recognition; Prentice-Hall, Inc.: Hoboken, NJ, USA, 1993. [Google Scholar]
  19. Huang, X.; Deng, L. An overview of modern speech recognition. In Handbook of Natural Language Processing; Chapman and Hall/CRC: Boca Raton, FL, USA, 2010; Volume 2, pp. 339–366. [Google Scholar]
  20. Juang, B.-H.; Rabiner, L.R. Hidden Markov models for speech recognition. Technometrics 1991, 33, 251–272. [Google Scholar] [CrossRef]
  21. Lamere, P.; Kwok, P.; Gouvea, E.; Raj, B.; Singh, R.; Walker, W.; Warmuth, M.; Wolf, P. The CMU SPHINX-4 speech recognition system. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), Hong Kong, China, 6–10 April 2003; Volume 1, pp. 2–5. [Google Scholar]
  22. Tabbal, H.; El Falou, W.; Monla, B. Analysis and implementation of a ‘Quranic’ verses delimitation system in audio files using speech recognition techniques. In Proceedings of the 2006 2nd International Conference on Information & Communication Technologies, Damascus, Syria, 24–28 April 2006; Volume 2, pp. 2979–2984. [Google Scholar]
  23. Hyassat, H.; Abu Zitar, R. Arabic speech recognition using SPHINX engine. Int. J. Speech Technol. 2006, 9, 133–150. [Google Scholar] [CrossRef]
  24. Abushariah, M.A.M. TAMEEM V1.0: Speakers and text-independent Arabic automatic continuous speech recognizer. Int. J. Speech Technol. 2017, 20, 261–280. [Google Scholar] [CrossRef]
  25. Raja-Yusof, R.J.; Fadila Grine, D.; Jamaliah Ibrahim, N.; Yamani Idna Idris, M.; Razak, Z.; Naemah Abdul Rahman, N. Automated tajweed checking rules engine for Quranic learning. Multicult. Educ. Technol. J. 2013, 7, 275–287. [Google Scholar] [CrossRef]
  26. El Amrani, M.Y.; Rahman, M.H.; Wahiddin, M.R.; Shah, A. Building CMU Sphinx language model for the Holy Quran using simplified Arabic phonemes. Egypt. Inform. J. 2016, 17, 305–314. [Google Scholar] [CrossRef]
  27. Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 2011, 20, 30–42. [Google Scholar] [CrossRef]
  28. Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
  29. Battenberg, E.; Chen, J.; Child, R.; Coates, A.; Li, Y.G.Y.; Liu, H.; Satheesh, S.; Sriram, A.; Zhu, Z. Exploring neural transducers for end-to-end speech recognition. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 206–213. [Google Scholar]
  30. Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar]
  31. Al-Ayyoub, M.; Damer, N.A.; Hmeidi, I. Using deep learning for automatically determining correct application of basic Quranic recitation rules. Int. Arab J. Inf. Technol. 2018, 15, 620–625. [Google Scholar]
  32. Alkhateeb, J.H. A machine learning approach for recognizing the Holy Quran reciter. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 268–271. [Google Scholar] [CrossRef]
  33. Tariq Eldeeb· Mozilla/DeepSpeech · GitHub. Available online: https://github.com/tarekeldeeb/DeepSpeech-Quran (accessed on 3 August 2025).
  34. Harere, A.A.; Jallad, K.A. Quran Recitation Recognition using End-to-End Deep Learning. arXiv 2023, arXiv:2305.07034. [Google Scholar]
  35. Harere, A.A.; Jallad, K.A. Mispronunciation Detection of Basic Quranic Recitation Rules using Deep Learning. arXiv 2023, arXiv:2305.06429. [Google Scholar]
  36. Alobaylani, A.A.; Parvez, M.T.; Alsuhibany, S.A. A Database for Speech Processing Based Qur’anic Recitation Correction. In Proceedings of the 2020 International Conference on Computing and Information Technology (ICCIT-1441), Tabuk, Saudi Arabia, 9–10 September 2020; pp. 1–5. [Google Scholar]
  37. Ahmad, F.; Yahya, S.Z.; Saad, Z.; Ahmad, A.R. Tajweed classification using artificial neural network. In Proceedings of the 2018 International Conference on Smart Communications and Networking (SmartNets), Yasmine Hammamet, Tunisia, 16–17 November 2018; pp. 1–4. [Google Scholar]
  38. Lou, H.-L. Implementing the Viterbi algorithm. IEEE Signal Process. Mag. 1995, 12, 42–52. [Google Scholar] [CrossRef]
  39. Mozilla. Releases · Mozilla/DeepSpeech · GitHub. Available online: https://github.com/mozilla/DeepSpeech/releases (accessed on 3 August 2025).
  40. Galić, J.; Marković, B.; Grozdić, Đ.; Popović, B.; Šajić, S. Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering. Appl. Sci. 2024, 14, 8223. [Google Scholar] [CrossRef]
  41. Yang, J.; Zhou, R. Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral Speech. Information 2024, 15, 184. [Google Scholar] [CrossRef]
  42. Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
  43. Kong, S.; Li, C.; Fang, C.; Yang, P. Building a Speech Dataset and Recognition Model for the Minority Tu Language. Appl. Sci. 2024, 14, 6795. [Google Scholar] [CrossRef]
Table 1. The QRFAM dataset statistical information.
Table 1. The QRFAM dataset statistical information.
QRFAM InfoMalesFemales
Number of audio records56605660
Number of reciters2121
Number of hours24.1213.04
Minimum recording time [s]1:101:1
Maximum recording time [s]45:9045:67
Average recording time [s]15:348:29
Table 2. DeepSpeech’s hyperparameters were used for the training.
Table 2. DeepSpeech’s hyperparameters were used for the training.
DeepSpeech’s HyperparametersValue
Number of MFCC features26
Length of the audio window [ms]32
Step of the audio window [ms]20
Step the audio window [samples]320
Length the audio window [samples]512
The sample rate [Hz]16,000
Table 3. DeepSpeech’s hyperparameters were used for the evaluation and testing.
Table 3. DeepSpeech’s hyperparameters were used for the evaluation and testing.
DeepSpeech’s HyperparametersValue
The beam width1024
The language model decoding ( α )0.75
The language model decoding ( β )1.85
The test batch size48
Table 4. The WER of the baseline (non-finetuned) deepspeech model.
Table 4. The WER of the baseline (non-finetuned) deepspeech model.
Experiment Number124567891011
WER for Baseline DeepSpeech0.4060.9680.9660.8420.8570.0460.0490.1280.2950.16
Table 5. The first six experiments are summarized below.
Table 5. The first six experiments are summarized below.
ExperimentSet# of RecordsTime RangeAvg. Time (s)# of SpeakersGenderWERCER
Experiment #1Train4541 (80.22%)[1.38–45.89]16.8813M0.4060.23
Dev559 (9.87%)[1.09–45.79]9.583M
Test560 (9.89%)[1.14–45.68]8.555M
Experiment #2Train4541 (80.22%)[1.38–45.89]16.8813F0.9680.758
Dev559 (9.87%)[1.09–45.79]9.583F
Test560 (9.89%)[1.1–45.67]6.785F
Experiment #3Train4541 (80.22%)[1.4–45.60]8.5513F0.6080.396
Dev559 (9.87%)[1.74–45.18]7.723F
Test560 (9.89%)[1.1–45.67]6.785F
Experiment #4Train4541 (80.22%)[1.4–45.60]8.5513M0.9660.664
Dev559 (9.87%)[1.74–45.18]7.723M
Test560 (9.89%)[1.14–45.68]8.555M
Experiment #5Train45,900 (80.21%)[1.38–45.89]13.728M0.8420.595
Dev5660 (9.89%)[1.20–45.74]15.678M
Test5660 (9.89%)[1.1–45.67]8.2921F
Experiment #6Train4992 (78.88%)[1.4–45.60]8.517F0.8570.601
Dev668 (10.55%)[1.1–45.67]6.694F
Test668 (10.55%)[4.3–45.71]19.164M
Table 6. Summary of the next 5 experiments.
Table 6. Summary of the next 5 experiments.
ExperimentSet# of RecordsTime Range# of SpeakersGenderClean/DiacriticizedWERCER
Experiment #7Train90,257 (79.99%)≤1034MClean0.0460.025
Dev11,289 (10.005%)≤104M
Test11,285 (10.001%)≤104M
Experiment #8Train90,257 (79.99%)≤1034MDiacriticized0.0490.025
Dev11,289 (10.005%)≤104M
Test11,285 (10.001%)≤104M
Experiment #9Train92,964 (80.40%)<1134MDiacriticized0.1280.086
Dev11,642 (10.06%)<114M
Test11,014 (9.52%)≤304M
Experiment #10Train183,690 (80.29%)≤3034MDiacriticized0.2950.251
Dev23,022 (10.06%)≤304M
Test22,057 (9.64%)≤304M
Experiment #11Train200,213 (80.54%)≤4534MDiacriticized0.160.107
Dev24,199 (9.73%)≤454M
Test24,165 (9.72%)≤454M
Table 7. The improvement outcome of experiment #1 (baseline WER is 0.406).
Table 7. The improvement outcome of experiment #1 (baseline WER is 0.406).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.9312891.183413810240.00010.5Default Exp750.32
Table 8. The improvement outcome of experiment #2 (baseline WER is 0.968).
Table 8. The improvement outcome of experiment #2 (baseline WER is 0.968).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.9312891.183413810240.00010.5Default Exp750.938
Table 9. The improvement outcome of experiment #4 (baseline WER is 0.966).
Table 9. The improvement outcome of experiment #4 (baseline WER is 0.966).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.966
20.751.8510240.00010.151 (n-hidden)750.964
30.9312891.8510240.00010.152 ( α )750.983
40.9312891.183413810240.00010.153 ( β )750.996
50.9312891.183413810240.00010.54 (dropout)750.951
Table 10. The improvement outcome of experiment #5 (baseline WER is 0.842).
Table 10. The improvement outcome of experiment #5 (baseline WER is 0.842).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.842
20.751.8510240.00010.151 (n-hidden)750.834
30.9312891.8510240.00010.152 ( α )750.883
40.9312891.183413810240.00010.153 ( β )750.866
50.9312891.183413810240.00010.54 (dropout)750.828
Table 11. The improvement outcome of experiment #6 (baseline WER is 0.857).
Table 11. The improvement outcome of experiment #6 (baseline WER is 0.857).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.857
20.751.8510240.00010.151 (n-hidden)750.848
30.9312891.8510240.00010.152 ( α )750.864
40.9312891.183413810240.00010.153 ( β )750.806
50.9312891.183413810240.00010.54 (dropout)750.789
Table 12. The improvement outcome of experiment #7 (baseline WER is 0.046).
Table 12. The improvement outcome of experiment #7 (baseline WER is 0.046).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.046
20.751.8510240.00010.151 (n-hidden)750.990
30.9312891.8510240.00010.152 ( α )750.954
40.9312891.183413810240.00010.153 ( β )750.966
50.9312891.183413810240.00010.54 (dropout)750.999
60.9312891.183413820480.00010.154 (n-hidden)750.990
70.9312891.183413810240.00010.55 (epoch)150.980
80.9312891.183413810240.00010.55 (epoch)110.993
80.9312891.183413810240.00010.55 (epoch)300.990
90.9312891.183413810240.00010.55 (epoch)1000.991
100.9312891.183413810240.00010.54 (dropout)750.999
110.9312891.8510240.00010.510 ( β )750.998
Table 13. The improvement outcome of experiment #8 (baseline WER is 0.049).
Table 13. The improvement outcome of experiment #8 (baseline WER is 0.049).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.049
20.751.8510240.00010.151 (n-hidden)750.048
30.9312891.8510240.00010.152 ( α )750.059
40.9312891.183413810240.00010.153 ( β )750.044
50.9312891.183413810240.00010.54 (dropout)750.055
Table 14. The improvement outcome of experiment #9 (baseline WER is 0.128).
Table 14. The improvement outcome of experiment #9 (baseline WER is 0.128).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.128
20.751.8510240.00010.151 (n-hidden)750.118
30.9312891.8510240.00010.152 ( α )750.136
40.9312891.183413810240.00010.153 ( β )750.093
50.9312891.183413810240.00010.54 (dropout)750.139
Table 15. The improvement outcome of experiment #10 (baseline WER is 0.295).
Table 15. The improvement outcome of experiment #10 (baseline WER is 0.295).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.295
20.751.8510240.00010.151 (n-hidden)750.293
30.751.8510240.00010.52 (dropout)750.296
40.9312891.8510240.00010.152 ( α )750.290
50.9312891.183413810240.00010.153 ( β )750.289
60.9312891.183413810240.00010.54 (dropout)750.302
Table 16. The improvement outcome of experiment #11 (baseline WER is 0.16).
Table 16. The improvement outcome of experiment #11 (baseline WER is 0.16).
Attempt α β n-HiddenlrDropoutBA # (MP)EpWER
10.751.8520480.00010.15Default Exp750.16
20.751.8510240.00010.151 (n-hidden)750.156
30.751.8510240.00010.52 (dropout)750.163
40.9312891.8510240.00010.152 ( α )750.158
50.9312891.183413810240.00010.153 ( β )750.168
60.9312891.183413810240.00010.54 (dropout)750.064
Table 17. An overview of WER improvement over all tests.
Table 17. An overview of WER improvement over all tests.
Exp #Best WER (Old)Best WER (New)WER Enhancement (%)
10.4060.3221.18
20.9680.9383.10
40.9660.9511.55
50.8420.8281.66
60.8570.7897.93
70.0460.0460
80.0490.04410.20
90.1280.09327.34
100.2950.2892.03
110.160.06460
Table 18. Whisper results.
Table 18. Whisper results.
Model Size# of ParametersWER (Exp #1 and Exp #4)WER (Exp #2 and Exp #3)
Tiny39 M0.92980.8149
Base74 M0.74920.6445
Small244 M0.37720.3828
Medium769 M0.19530.2539
Large1550 M0.09990.1579
Table 19. WER performance comparison between Deepspeech and Whisper.
Table 19. WER performance comparison between Deepspeech and Whisper.
# of ExperimentWhisper WERDeepSpeech WERWER Enhancement (%)
Experiment #10.0990.32069.06
Experiment #20.1570.93883.26
Experiment #40.1570.95183.49
Table 20. Comparing execution time (in seconds) when using five different versions of the Whisper model (i.e., tiny, base, small, medium, large).
Table 20. Comparing execution time (in seconds) when using five different versions of the Whisper model (i.e., tiny, base, small, medium, large).
Execution Time (in Seconds)
BaseTinySmallMediumLarge
Experiment #10.80.981.041.42.44
Experiment #20.530.610.550.991.17
Experiment #40.80.981.041.42.44
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshboul, M.; Al Muaitah, A.R.; Al-Issa, S.; Al-Ayyoub, M. Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model. Appl. Sci. 2025, 15, 9521. https://doi.org/10.3390/app15179521

AMA Style

Alshboul M, Al Muaitah AR, Al-Issa S, Al-Ayyoub M. Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model. Applied Sciences. 2025; 15(17):9521. https://doi.org/10.3390/app15179521

Chicago/Turabian Style

Alshboul, Mohammad, Abdul Rahman Al Muaitah, Suhad Al-Issa, and Mahmoud Al-Ayyoub. 2025. "Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model" Applied Sciences 15, no. 17: 9521. https://doi.org/10.3390/app15179521

APA Style

Alshboul, M., Al Muaitah, A. R., Al-Issa, S., & Al-Ayyoub, M. (2025). Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model. Applied Sciences, 15(17), 9521. https://doi.org/10.3390/app15179521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop