Skip Content
You are currently on the new version of our website. Access the old version .
Applied SciencesApplied Sciences
  • Article
  • Open Access

29 August 2025

Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model †

,
,
and
1
Department of Computer Engineering, Jordan University of Science and Technology, Irbid 22110, Jordan
2
Department of Computer Science and Engineering, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab Emirates
3
Department of Information Technology, Ajman University, Ajman, United Arab Emirates
4
Department of Computer Science, Jordan University of Science and Technology, Irbid 22110, Jordan

Abstract

In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was previously reported in our prior works. In addition to this dataset, we used various subsets of the QRFAM dataset for training, validation, and testing to build several basic NSR systems based on Mozilla’s DeepSpeech model. Our current efforts to optimize and enhance these baseline models have also been presented. In this study, we expand our efforts by utilizing one of the well-known speech recognition models, Whisper, and we describe the effect of this choice on the model’s accuracy, expressed as the word error rate (WER), in comparison to that of DeepSpeech.

1. Introduction

The Quran is the main source of Arabic vocabulary, with 77,794 unique words in total, including the Basmalah, a special phrase introduced at the start of each Surah. There are 28 distinct letters in the Arabic alphabet (29 when the hamza is taken into consideration).
Millions of Muslims worldwide memorize the Quran and practice reciting it daily. One of the main challenges they face is that they need someone to listen to their recitation to ensure that their memorization is correct. This has been known to be a very tedious and time-consuming process. Because of that, it will be very beneficial to have automatic recognition to help students check their memorization without the need for human help. However, given the special nature of Quranic recitation and phonetics, traditional speech recognition models are usually not very effective in Quranic speech recognition.
A computational procedure called automatic speech recognition (ASR) transforms recorded voice signals into a written representation. Preprocessing, feature extraction, and classification are the three main steps that constitute an ASR system, as described in the literature [1,2,3,4,5]. ASR systems have changed dramatically in recent years, moving from traditional methods such as CMU Sphinx [6,7,8,9,10] to sophisticated methods that make use of deep learning (DL) and deep neural networks (DNNs) [11,12].
To train ASR systems and ensure their optimal performance, large volumes of data are needed. Nevertheless, there are some issues with the audio recordings of Quranic recitations that are currently available. These recordings are extensive, but most of the performers are male reciters with highly developed recitation techniques who use expensive, professional equipment. Due to the reciters’ membership in a specific demographic group (adult males), the training data’s level of expertise and quality, and other factors, the resulting model may be biased to some extent, making it less able to process input from other groups such as non- or semiprofessionals, women, or children. This study aims to help towards mitigating this insufficiency [13,14].
This work focuses mainly on working towards addressing this problem for female recitations. Targeting children’s recitations can be very promising in improving demographic inclusiveness. However, we decided to keep it as a future work and not thoroughly investigate it in this paper because it has several challenges, such as availability and securing parents’ consent. Nevertheless, we are hopeful that each step in enriching the inclusiveness of the QSR (e.g., including more female recitations) will improve its overall accuracy even for other groups that the model might not have directly targeted.
The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was first described in our earlier articles [13,14]. We also developed multiple basic NSR systems using this dataset, which were based on Baidu’s DeepSpeech model implemented by Mozilla [11,15]. We reported our efforts to use the DeepSpeech model to tune and enhance these baseline models in our most recent work [16]. Here, we expand on our work by presenting a NSR system that makes use of a well-known speech recognition model (Whisper) [17]. We also describe the effect of this choice on the model’s performance, expressed in word error rate (WER), in comparison to DeepSpeech.
This work focused on the language mistakes in recitations because the main goal is to help people memorize the Quran. Therefore, incorporating Tajweed rules was beyond the scope of this paper. However, it is worth mentioning that all the recitations used to train and test the model are correct and have been thoroughly revised by a separate entity to guarantee their accuracy. This will help build a foundation for the system to be sensitive to correct recitations and pave the road for more advanced tasks.
This is how the remainder of the paper is structured. Section 2 offers an overview of the relevant literature. The process we used to optimize the NSR systems we had previously constructed with the DeepSpeech model is presented in Section 3. Additional details regarding the OpenAI Whisper model are given in Section 4. Section 5 presents and discusses the experimental outcomes. Finally, Section 6 concludes the paper.

3. Mozilla’s DeepSpeech Implementation

We use Mozilla’s DeepSpeech NSR model, which was previously published in [13], to build the basis of this work. On the other hand, we perform several experiments to optimize the neural models by varying the DeepSpeech hyperparameters for testing, assessment, and training [14]. The default DeepSpeech hyperparameters are shown in Table 2 and Table 3.
Table 2. DeepSpeech’s hyperparameters were used for the training.
Table 3. DeepSpeech’s hyperparameters were used for the evaluation and testing.
The language model interprets the phoneme sequence as words, whereas the DeepSpeech model uses an RNN model to generate outputs in the form of phonemes. Our goal was to enhance the decoding objective function to reduce the word error rate (WER). Specifically, we wanted to find the optimal phoneme sequence to substitute for the vocabulary generated by the RNN.
The main function was described by Hannun et al. [11] as a weighted combination of the RNN score “ P ( c | x ) ” and the language model score “Plm”, where “x” denotes the input utterance and “c” is the potential output character sequence that in our work represents the phonemes. Equation (1) therefore represents the function, where β denotes the word insertion weight and α matches the language model weight.
Q ( c ) = log ( P ( c | x ) ) + α log ( P l m ( c ) ) + β w o r d _ c o u n t ( c )
Using the English Language test dataset in DeepSpeech release 0.9.3 [21], the optimal coefficient values for α and β in Mozilla’s DeepSpeech were ( α , β ) = (0.75, 1.85), with a beam size of 1024. To ascertain the ideal coefficient values that would result in the lowest WER, we carried out real-world trials. Overall, we found that simultaneously modifying α and β produced the lowest WER.
We also looked into changing the number of hidden neurons in the model to improve it and obtain a lower WER [38]. The neural network’s size is determined by the number of hidden neurons; in situations where there is a shortage of data, such as in our female dataset, it is best to select a lower network size. For real-time applications, such as mobile phone apps, fewer buried neurons also save memory and CPU resources, increasing their effectiveness. We adjusted the number of hidden neurons in Mozilla DeepSpeech to match the default value of 2048 [39].
Additionally, we tried modifying the dropout rate in an effort to enhance the model even more. In neural networks, dropout is a regularization technique that efficiently lowers overfitting and enhances generalization mistakes. By adjusting the superior dropout rate coefficient in Mozilla’s DeepSpeech, which was initially set to 0.15 [39], we were able to obtain a lower WER through our experiments.
In conclusion, we optimized our neural models through a series of experiments involving the modification of coefficient values.

4. OpenAI’s Whisper Implementation

An advanced ASR model called Whisper [17] was developed as a consequence of training on approximately 680k hours of supervised web data. These data include multilingual and multitasking samples. The authors have shown how resilience against accents, background noise, and technical terminology is increased when a dataset of this size and variety is used. Additionally, it allows transcription in those languages as well as translations from several other languages into English. Whisper’s authors laid the foundation for future research on robust speech processing and the development of practical applications by providing the models and inference code [40,41,42,43]. This can be quite helpful when trying to adapt the model to other use cases, such as what we are doing in this project.
The Whisper model’s ability to handle diverse audio inputs makes it particularly suitable for applications in various domains, including healthcare, customer service, and education. For instance, in healthcare, Whisper can be used to transcribe doctor–patient conversations accurately, ensuring that medical records are precise and comprehensive. In customer service, it can enhance the efficiency of call centers by providing real-time transcriptions of customer interactions, allowing for better analysis and response to customer needs. In education, Whisper can assist in creating accessible learning materials by transcribing lectures and translating them into multiple languages, thus catering to a global audience.
Moreover, Whisper’s open-source nature encourages collaboration and innovation within the research community. By providing access to the models and inference code, OpenAI enables researchers and developers to build upon Whisper’s capabilities, tailoring it to specific needs and exploring new frontiers in speech recognition technology. This openness not only accelerated advancements in the field but also democratized access to cutting-edge ASR technology, allowing smaller organizations and individual developers to leverage Whisper’s powerful features without the need for extensive resources.
In this work, we experiment with the effectiveness of Whisper in performing some of the same Quranic recognition tasks we recently conducted via DeepSpeech [16]. We also expand this evaluation by presenting a thorough analysis when multiple versions of the Whisper model are employed.

5. Experiments and Results

We developed numerous iterations of each neural model, looking for areas where it may perform better. It has been improved in work [14] before, except for Experiment #3 (FFF).
This section investigates these optimizations that were conducted along with the corresponding WER enhancement. The WER of the baseline (non-finetuned) DeepSpeech model is shown in Table 4. The details of all conducted experiments are shown in Table 5 and Table 6. In all these experiments, speaker independence was ensured by guaranteeing that each reciter would only be included in one category (e.g., training or testing).
Table 4. The WER of the baseline (non-finetuned) deepspeech model.
Table 5. The first six experiments are summarized below.
Table 6. Summary of the next 5 experiments.

5.1. Experiments on Small Datasets of Males and Females

5.1.1. Experiment #1 (MMM)

We attempted to lower the WER of (0.406), which was the result of this trial, to (0.32). We began this experiment with a previous attempt (attempt #24), which, with the exception of gender, achieved the lowest WER value in experiment #3 because experiment #1 was identical to experiment #3. The goal here is to reduce the amount of time between attempts. Table 7 presents a list of this effort. Every effort is an improvement over an earlier attempt, as demonstrated by the column bearing the title BA # (MP), which stands for “Base Attempt # (Modified Parameter)”. As an illustration, Attempt #10 is modified β from Attempt #9, which it is based on. The original Experiment #1 served as the basis for Attempt #1. Table 7 shows that the best WER was achieved.
Table 7. The improvement outcome of experiment #1 (baseline WER is 0.406).

5.1.2. Experiment #2 (MMF)

The WER obtained in this experiment was (0.968), which we attempted to lower to (0.938). As with experiment #1, we began this experiment by starting with the previous attempt (attempt #24), which in experiment #3 produced the lowest WER value. Table 8 presents a list of this effort.
Table 8. The improvement outcome of experiment #2 (baseline WER is 0.968).

5.1.3. Experiment #4 (FFM)

We were able to lower the WER obtained in this experiment from (0.966) to (0.951) after multiple tries were performed. Every effort is detailed in Table 9.
Table 9. The improvement outcome of experiment #4 (baseline WER is 0.966).

5.1.4. Experiment #5 (MMF) (All Female Records That Are Available During Testing)

We were able to reduce the WER obtained in this experiment from (0.842) to (0.828) after multiple tries were performed. Table 10 contains a list of all the attempts.
Table 10. The improvement outcome of experiment #5 (baseline WER is 0.842).

5.1.5. Experiment #6 (FFM) (All Female Records That Are Available in the Train and Development)

We were able to reduce the WER obtained in this experiment from (0.857) to (0.789) after multiple tries were performed. Table 11 contains a list of all attempts.
Table 11. The improvement outcome of experiment #6 (baseline WER is 0.857).

5.2. Experiments on the Male Dataset

5.2.1. Experiment #7 (MMM) (≤10 s, C)

Despite our best efforts, we were unable to lower the WER in this experiment; instead, the WER number continued to rise, so we decided to settle for the initial value of 0.046, which was the lowest of all the tries. Table 12 contains a list of all the attempts.
Table 12. The improvement outcome of experiment #7 (baseline WER is 0.046).

5.2.2. Experiment #8 (MMM) (≤10 s, D)

The WER obtained in this experiment was 0.049, which we were able to lower to 0.044 after multiple tries were made. Table 13 contains a list of all the attempts.
Table 13. The improvement outcome of experiment #8 (baseline WER is 0.049).

5.2.3. Experiment #9 (MMM) (<1 s, ≤30 s, D)

We were able to lower the WER obtained in this experiment from (0.128) to (0.093) after multiple tries were made. Table 14 shows a list of every attempt.
Table 14. The improvement outcome of experiment #9 (baseline WER is 0.128).

5.2.4. Experiment #10 (MMM) (≤30 s, D)

The WER obtained in this experiment was (0.295), which we were able to lower to (0.289) after multiple tries were made. Table 15 shows a list of every attempt.
Table 15. The improvement outcome of experiment #10 (baseline WER is 0.295).

5.2.5. Experiment #11 (MMM) (≤45 s, D)

We were able to reduce the WER obtained in this experiment from 0.16 to 0.064 after multiple tries were made. Table 16 contains a list of all the attempts.
Table 16. The improvement outcome of experiment #11 (baseline WER is 0.16).
An overview of the improvements made in those experiments can be seen in Table 17. Every experiment (apart from Exp #7) has demonstrated an improvement in the WER, as the table illustrates. In certain studies, this improvement has exceeded 20% (Exp #1 and Exp #9).
Table 17. An overview of WER improvement over all tests.

5.3. Experiments on OpenAI’s Whisper

OpenAI’s Whisper was tested on the same test datasets used in Exp #1 and Exp #4, as well as Exp #2 and Exp #3. All five model sizes were tested. Table 18 presents the WERs achieved in each of the different model sizes. Table 19 provides a side-to-side WER performance comparison between the optimized DeepSpeech models and Whisper’s “Large” model, where “WER difference” represents the difference between the WERs of Whisper and DeepSpeech.
Table 18. Whisper results.
Table 19. WER performance comparison between Deepspeech and Whisper.
The results from these experiments highlight the strengths and weaknesses of Whisper across different model sizes. In Exp #1 and Exp #4, Whisper’s performance varied significantly depending on the model size, with the “Large” model consistently achieving the lowest WER. This finding indicates that the larger models, with their increased capacity, are better at handling the complexities of the test datasets. The smaller models, while still performing admirably, presented higher WERs, suggesting that they might struggle with more challenging audio inputs or less common linguistic patterns.
In Exp #2 and Exp #3, similar trends were observed. The “Large” model of Whisper outperforms the smaller models, demonstrating its robustness and accuracy in diverse scenarios. The comparison in Table 19 between the optimized DeepSpeech models and Whisper’s “Large” model further underscores Whisper’s superiority. The “WER difference” metric reveals that Whisper consistently achieves lower WERs than DeepSpeech does, highlighting its advanced capabilities in speech recognition. This performance gap suggests that Whisper’s architecture and training data provide it with a significant edge in accurately transcribing speech, even in challenging conditions. The fact that WER is much better on average supports the paper’s overall conclusion, even though that might not be very visible in every single experiment we have conducted, because some experiments reported small accuracy improvements.
Overall, these experiments demonstrate Whisper’s potential as a leading ASR system, capable of delivering high accuracy across various tasks and datasets. The detailed analysis of WER across different model sizes and comparisons with DeepSpeech provide valuable insights into the strengths of Whisper, making it a promising tool for future applications in speech recognition.
Table 19 provides a side-to-side WER performance comparison between the optimized DeepSpeech models and Whisper’s “Large” model, where “WER enhancement (%)” represents the percentage improvement in the WER of Whisper relative to the WER of DeepSpeech.
It is worth noting that improvements in word error rates (WERs) using different models (or versions of a model) often come with a variation in the model complexity and associated costs. To explore that, Table 20 provides a comparison of the execution time (in seconds) when using different versions of the Whisper model. This should help clarify the trade-offs involved with the different models we used and show the associated costs that can come with the accuracy improvements reported earlier in Table 18.
Table 20. Comparing execution time (in seconds) when using five different versions of the Whisper model (i.e., tiny, base, small, medium, large).

6. Conclusions

This research extends our prior work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to users of any age, gender, or level of experience. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters spanning a range of age groups and competency levels, was published in our previous publications. Furthermore, for training, validation, and testing, we built several basic NSR systems based on Mozilla’s DeepSpeech model, utilizing various subsets of the QRFAM dataset. Additionally, we demonstrated some improvements that may be used with the DeepSpeech model to enhance the WER. We expand on these efforts in this study by introducing a well-known speech recognition model and reporting the effect of this choice on the WER in comparison to DeepSpeech.
Our study has advanced significantly with the addition of OpenAI’s Whisper model to our experiments. Our goal was to improve the accuracy and dependability of our NSR systems by utilizing Whisper’s strong architecture and a large training dataset. The comparison between DeepSpeech and Whisper offered insightful information about the advantages and disadvantages of each model. Whisper has been shown to be an effective tool for Quranic recitation recognition, as reflected by its exceptional WER performance across multiple test datasets. With recordings from reciters varying in age, gender, and skill level, the QRFAM dataset is both tough and diverse, making this improvement all the more remarkable.
Moreover, Whisper’s open-source design promotes ongoing development and adoption. Building on our findings, scholars and developers might investigate new applications and improvements for Quranic recitation recognition. This cooperative approach guarantees that the technology remains useful and accessible to a broad spectrum of consumers while also expediting progress in the industry. Our research shows how crucial it is to include cutting-edge ASR models, such as Whisper, into specific applications to pave the way for future speech recognition technologies that will be more accurate and inclusive.

Author Contributions

Conceptualization, M.A. and M.A.-A.; Data curation, A.R.A.M. and S.A.-I.; Investigation, M.A.-A.; Methodology, A.R.A.M. and S.A.-I.; Project administration, M.A. and M.A.-A.; Software, A.R.A.M. and S.A.-I.; Writing—original draft, M.A.; Writing—review and editing, M.A., A.R.A.M., S.A.-I. and M.A.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jordan University of Science and Technology (Grant Research ID: 20230113). The authors would also like to thank Ajman University for partially supporting this work.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gaikwad, S.K.; Gawali, B.W.; Yannawar, P. A review on speech recognition technique. Int. J. Comput. Appl. 2010, 10, 16–24. [Google Scholar] [CrossRef]
  2. Ghai, W.; Singh, N. Literature review on automatic speech recognition. Int. J. Comput. Appl. 2012, 41, 42–50. [Google Scholar] [CrossRef]
  3. Shanthi, T.S.; Lingam, C. Review of feature extraction techniques in automatic speech recognition. Int. J. Sci. Eng. Technol. 2013, 2, 479–484. [Google Scholar]
  4. Dixit, R.; Kaur, N. Speech recognition using stochastic approach: A review. Int. J. Innov. Res. Sci. Eng. Technol. 2013, 2, 356–361. [Google Scholar]
  5. Gamit, M.R.; Dhameliya, K.; Bhatt, N.S. Classification techniques for speech recognition: A review. Int. J. Emerg. Technol. Adv. Eng. 2015, 5, 58–63. [Google Scholar]
  6. Lee, K.-F.; Hon, H.-W.; Reddy, R. An overview of the SPHINX speech recognition system. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 35–45. [Google Scholar] [CrossRef]
  7. Djemili, R.; Bedda, M.; Bourouba, H. Recognition of spoken Arabic digits using neural predictive hidden Markov models. Int. Arab J. Inf. Technol. 2004, 1, 226–233. [Google Scholar]
  8. Bourlard, H.A.; Morgan, N. Connectionist Speech Recognition: A Hybrid Approach; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1994; Volume 247. [Google Scholar]
  9. Burget, L.; Schwarz, P.; Agarwal, M.; Akyazi, P.; Feng, K.; Ghoshal, A.; Glembek, O.; Goel, N.; Karafiát, M.; Povey, D.; et al. Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4334–4337. [Google Scholar]
  10. Su, D.; Wu, X.; Xu, L. GMM-HMM acoustic model training by a two-level procedure with Gaussian components determined by automatic model selection. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4890–4893. [Google Scholar]
  11. Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
  12. Campomanes-Álvarez, B.R.; Quirós, P.; Fernández, B. Semi-Supervised Learning for Spanish Speech Recognition Using Deep Neural Networks. In Applications of Intelligent Systems; IOS Press: Amsterdam, The Netherlands, 2018; pp. 19–29. [Google Scholar]
  13. Al-Issa, S.; Al-Ayyoub, M.; Al-Khaleel, O.; Elmitwally, N. Towards Building a Speech Recognition System for Quranic Recitations: A Pilot Study Involving Female Reciters. Jordan J. Electr. Eng. 2022, 8, 307–321. [Google Scholar] [CrossRef]
  14. Al-Issa, S.; Al-Ayyoub, M.; Al-Khaleel, O.; Elmitwally, N. Building a neural speech recognizer for quranic recitations. Int. J. Speech Technol. 2023, 26, 1131–1151. [Google Scholar] [CrossRef]
  15. Mozilla/DeepSpeech · GitHub. Available online: https://github.com/mozilla/DeepSpeech (accessed on 3 August 2025).
  16. Al-Issa, S.; Alshboul, M.; Al-Ayyoub, M. Enhanced Neural Speech Recognizer for Quranic Recitations. In Proceedings of the 2023 International Conference on Multimedia Computing, Networking and Applications (MCNA), Valencia, Spain, 19–22 June 2023; pp. 62–66. [Google Scholar] [CrossRef]
  17. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
  18. Rabiner, L.; Juang, B.-H. Fundamentals of Speech Recognition; Prentice-Hall, Inc.: Hoboken, NJ, USA, 1993. [Google Scholar]
  19. Huang, X.; Deng, L. An overview of modern speech recognition. In Handbook of Natural Language Processing; Chapman and Hall/CRC: Boca Raton, FL, USA, 2010; Volume 2, pp. 339–366. [Google Scholar]
  20. Juang, B.-H.; Rabiner, L.R. Hidden Markov models for speech recognition. Technometrics 1991, 33, 251–272. [Google Scholar] [CrossRef]
  21. Lamere, P.; Kwok, P.; Gouvea, E.; Raj, B.; Singh, R.; Walker, W.; Warmuth, M.; Wolf, P. The CMU SPHINX-4 speech recognition system. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), Hong Kong, China, 6–10 April 2003; Volume 1, pp. 2–5. [Google Scholar]
  22. Tabbal, H.; El Falou, W.; Monla, B. Analysis and implementation of a ‘Quranic’ verses delimitation system in audio files using speech recognition techniques. In Proceedings of the 2006 2nd International Conference on Information & Communication Technologies, Damascus, Syria, 24–28 April 2006; Volume 2, pp. 2979–2984. [Google Scholar]
  23. Hyassat, H.; Abu Zitar, R. Arabic speech recognition using SPHINX engine. Int. J. Speech Technol. 2006, 9, 133–150. [Google Scholar] [CrossRef]
  24. Abushariah, M.A.M. TAMEEM V1.0: Speakers and text-independent Arabic automatic continuous speech recognizer. Int. J. Speech Technol. 2017, 20, 261–280. [Google Scholar] [CrossRef]
  25. Raja-Yusof, R.J.; Fadila Grine, D.; Jamaliah Ibrahim, N.; Yamani Idna Idris, M.; Razak, Z.; Naemah Abdul Rahman, N. Automated tajweed checking rules engine for Quranic learning. Multicult. Educ. Technol. J. 2013, 7, 275–287. [Google Scholar] [CrossRef]
  26. El Amrani, M.Y.; Rahman, M.H.; Wahiddin, M.R.; Shah, A. Building CMU Sphinx language model for the Holy Quran using simplified Arabic phonemes. Egypt. Inform. J. 2016, 17, 305–314. [Google Scholar] [CrossRef]
  27. Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 2011, 20, 30–42. [Google Scholar] [CrossRef]
  28. Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
  29. Battenberg, E.; Chen, J.; Child, R.; Coates, A.; Li, Y.G.Y.; Liu, H.; Satheesh, S.; Sriram, A.; Zhu, Z. Exploring neural transducers for end-to-end speech recognition. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 206–213. [Google Scholar]
  30. Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar]
  31. Al-Ayyoub, M.; Damer, N.A.; Hmeidi, I. Using deep learning for automatically determining correct application of basic Quranic recitation rules. Int. Arab J. Inf. Technol. 2018, 15, 620–625. [Google Scholar]
  32. Alkhateeb, J.H. A machine learning approach for recognizing the Holy Quran reciter. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 268–271. [Google Scholar] [CrossRef]
  33. Tariq Eldeeb· Mozilla/DeepSpeech · GitHub. Available online: https://github.com/tarekeldeeb/DeepSpeech-Quran (accessed on 3 August 2025).
  34. Harere, A.A.; Jallad, K.A. Quran Recitation Recognition using End-to-End Deep Learning. arXiv 2023, arXiv:2305.07034. [Google Scholar]
  35. Harere, A.A.; Jallad, K.A. Mispronunciation Detection of Basic Quranic Recitation Rules using Deep Learning. arXiv 2023, arXiv:2305.06429. [Google Scholar]
  36. Alobaylani, A.A.; Parvez, M.T.; Alsuhibany, S.A. A Database for Speech Processing Based Qur’anic Recitation Correction. In Proceedings of the 2020 International Conference on Computing and Information Technology (ICCIT-1441), Tabuk, Saudi Arabia, 9–10 September 2020; pp. 1–5. [Google Scholar]
  37. Ahmad, F.; Yahya, S.Z.; Saad, Z.; Ahmad, A.R. Tajweed classification using artificial neural network. In Proceedings of the 2018 International Conference on Smart Communications and Networking (SmartNets), Yasmine Hammamet, Tunisia, 16–17 November 2018; pp. 1–4. [Google Scholar]
  38. Lou, H.-L. Implementing the Viterbi algorithm. IEEE Signal Process. Mag. 1995, 12, 42–52. [Google Scholar] [CrossRef]
  39. Mozilla. Releases · Mozilla/DeepSpeech · GitHub. Available online: https://github.com/mozilla/DeepSpeech/releases (accessed on 3 August 2025).
  40. Galić, J.; Marković, B.; Grozdić, Đ.; Popović, B.; Šajić, S. Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering. Appl. Sci. 2024, 14, 8223. [Google Scholar] [CrossRef]
  41. Yang, J.; Zhou, R. Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral Speech. Information 2024, 15, 184. [Google Scholar] [CrossRef]
  42. Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
  43. Kong, S.; Li, C.; Fang, C.; Yang, P. Building a Speech Dataset and Recognition Model for the Minority Tu Language. Appl. Sci. 2024, 14, 6795. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.