1. Introduction
The Quran is the main source of Arabic vocabulary, with 77,794 unique words in total, including the Basmalah, a special phrase introduced at the start of each Surah. There are 28 distinct letters in the Arabic alphabet (29 when the hamza is taken into consideration).
Millions of Muslims worldwide memorize the Quran and practice reciting it daily. One of the main challenges they face is that they need someone to listen to their recitation to ensure that their memorization is correct. This has been known to be a very tedious and time-consuming process. Because of that, it will be very beneficial to have automatic recognition to help students check their memorization without the need for human help. However, given the special nature of Quranic recitation and phonetics, traditional speech recognition models are usually not very effective in Quranic speech recognition.
A computational procedure called automatic speech recognition (ASR) transforms recorded voice signals into a written representation. Preprocessing, feature extraction, and classification are the three main steps that constitute an ASR system, as described in the literature [
1,
2,
3,
4,
5]. ASR systems have changed dramatically in recent years, moving from traditional methods such as CMU Sphinx [
6,
7,
8,
9,
10] to sophisticated methods that make use of deep learning (DL) and deep neural networks (DNNs) [
11,
12].
To train ASR systems and ensure their optimal performance, large volumes of data are needed. Nevertheless, there are some issues with the audio recordings of Quranic recitations that are currently available. These recordings are extensive, but most of the performers are male reciters with highly developed recitation techniques who use expensive, professional equipment. Due to the reciters’ membership in a specific demographic group (adult males), the training data’s level of expertise and quality, and other factors, the resulting model may be biased to some extent, making it less able to process input from other groups such as non- or semiprofessionals, women, or children. This study aims to help towards mitigating this insufficiency [
13,
14].
This work focuses mainly on working towards addressing this problem for female recitations. Targeting children’s recitations can be very promising in improving demographic inclusiveness. However, we decided to keep it as a future work and not thoroughly investigate it in this paper because it has several challenges, such as availability and securing parents’ consent. Nevertheless, we are hopeful that each step in enriching the inclusiveness of the QSR (e.g., including more female recitations) will improve its overall accuracy even for other groups that the model might not have directly targeted.
The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was first described in our earlier articles [
13,
14]. We also developed multiple basic NSR systems using this dataset, which were based on Baidu’s DeepSpeech model implemented by Mozilla [
11,
15]. We reported our efforts to use the DeepSpeech model to tune and enhance these baseline models in our most recent work [
16]. Here, we expand on our work by presenting a NSR system that makes use of a well-known speech recognition model (Whisper) [
17]. We also describe the effect of this choice on the model’s performance, expressed in word error rate (WER), in comparison to DeepSpeech.
This work focused on the language mistakes in recitations because the main goal is to help people memorize the Quran. Therefore, incorporating Tajweed rules was beyond the scope of this paper. However, it is worth mentioning that all the recitations used to train and test the model are correct and have been thoroughly revised by a separate entity to guarantee their accuracy. This will help build a foundation for the system to be sensitive to correct recitations and pave the road for more advanced tasks.
This is how the remainder of the paper is structured.
Section 2 offers an overview of the relevant literature. The process we used to optimize the NSR systems we had previously constructed with the DeepSpeech model is presented in
Section 3. Additional details regarding the OpenAI Whisper model are given in
Section 4.
Section 5 presents and discusses the experimental outcomes. Finally,
Section 6 concludes the paper.
2. Related Work
Presenting pertinent studies on speech recognition—both historical and contemporary—is crucial to the advancement of our work. Therefore, in this section, we investigate some of the main related works in this field.
When we approach speech as a signal with particular features and attributes, we can identify it as vocal sounds made by humans for communication. Theoretically, speech is thought of as an analog wave that varies with time [
18].
Signal processing, acoustic models, pronunciation models, language models, and decoding are the various parts of a speech recognition (SRs) system [
19]. There are two kinds of SR: classic and nontraditional (neural). For phonemic modeling, traditional systems typically use the hidden Markov model (HMM), which is a statistical model based on the Markov process [
20]. CMU Sphinx [
21] is an example of a continuous SR system based on a HMM with a Gaussian mixture model (GMM) for output possibilities.
Prior research on the application of speech recognition to Arabic has involved the identification of the Quran using the Sphinx tool, with particular attention given to Surat Al-Ikhlas. With male reciters, these works earned a word recognition rate of 90%, and with female reciters, the recognition rate was 85% [
22]. A different system used the Sphinx tool to teach and recognize the Quran and achieved a 46% word error rate (WER) [
23].
A speech recognition system based on the Quran was built [
24], utilizing the Sphinx framework and HMM toolkit in the context of Arabic Quran recognition. An automated Tajweed verification engine specifically designed for Surat Al-Fatihah was presented for Quran recognition via Tajweed principles, and it achieved an accuracy of 91.95% at the Ayah level [
25]. Moreover, a system built using CMU Sphinx 4 employs a dataset comprising Surat Al-Fatihah, Al-Ikhlas, Al-Falaq, and Al-Nass recited by 22 distinct renowned reciters. This system produced a WER of 1.5% with a small set of training audio files and 50.0% WER when 90% of the audio files were used for training [
26].
Deep learning (DL) is an important component of neural (nontraditional) techniques. The following paragraphs discuss a few noteworthy works.
By presenting the CLDNN-HMM model, researchers from the Microsoft Research Institute suggested notable advancements in voice recognition. To estimate the output probabilities for hidden HMM states, this model combines a deep neural network (DNN) and a HMM, using DNN outputs in place of the GMM in the HMM-GMM framework [
27].
A different study investigated ASR models other than HMM-DNN, including connectionist temporal classification (CTC), recurrent neural network (RNN)-transducer, and attention-based sequence-to-sequence models. According to the experiments, the DL-based models outperformed the HMM-based models in terms of recognition accuracy [
1].
Deep neural networks are used by the DeepSpeech CTC model, a DL-based method, to translate speech to text. WERs of 12.6% and 19.3% are obtained for the CallHome data and the Switchboard Hub5’00 data, respectively, after training on massive datasets such as the Fisher and Switchboard datasets. For the English language, this model fared better than earlier SR models did [
11].
The RNN and CTC techniques are used in DeepSpeech2, an upgraded version of DeepSpeech. SortaGrad is a component of the training methodology that progressively lengthens the speech in the training set. The foundation of DeepSpeech2 is the PaddlePaddle implementation [
28]. Rather than using CTC, DeepSpeech3, an enhanced version of DeepSpeech1, uses the RNN-transducer loss function. This approach produced increased performance in end-to-end voice recognition by doing away with the requirement for a language model during inference [
29].
Three parts make up Google’s Listening, Attending, and Spelling (LAS) ASR model: an encoder that uses bidirectional LSTM layers to translate sounds into higher-level representations. The model outperforms Google’s prior model by a margin of 2.3% in WER [
30] after being trained on Google Voice Search data.
After discussing speech recognition in general, the rest of this section discusses research conducted in the area of DL-based Quran recognition.
In accordance with the Quranic recitation guidelines, a DL model that makes use of support vector machines (SVMs) and DBN convolutions was created to help in accurate Quran reading. The Weka tool was used to test this model on unseen records, and the accuracy was 97.7% [
31].
Another study employed K-nearest neighbor (KNN) classifiers and artificial neural networks (ANNs) to recognize Quranic recitations. For Surahs 18 and 36, the ANN classifier yielded an accuracy of 97.6% and 96.7%, respectively, whereas the KNN classifier produced accuracies of 97.03% and 96.08%, respectively, for identical Surahs [
32].
Based on the DeepSpeech foundation, the DeepSpeech-Quran model [
33] aims to assist people in reading the Quran by creating an ASR system. It makes use of a dataset named “Imam-Recitations” and “Imam-Tusers-Recitations,” which contains reciters who are professional and semi-professional.
Using a CNN-bidirectional GRU encoder and the CTC objective function, a novel end-to-end ASR-based DL model for Quran recitation recognition was created. When trained on the Ar-DAD dataset, which consists of 37 Surahs recited by 30 distinct reciters with variable pronunciation norms and speeds, the model achieved a WER of 8.34% and a CER of 2.42% [
34].
The use of a model that addresses the shortcomings of earlier Quran recognition studies was suggested. It uses LSTM neural networks to add mel-frequency cepstral coefficient (MFCC) features, taking into account the time series aspect of voice signals. With accuracies of 96% for separate stretching, 95% for tight tightening, and 96% for hiding, the model outperforms conventional machine learning models in its focus on verbal error detection in reading and is based on Tajweed rules [
35].
The Quranic Database for Recitation Correction, or the QDRC dataset, was unveiled. It contained corrections for 54 errors committed by 17 volunteers during the recitation of Surat Al-Fatihah, along with their original recordings. Tajweed criteria for Quranic recitation recognition are used for this dataset, which also incorporates voice feature analysis [
36].
To handle customary Tajweed-required Quran recitation, an artificial neural network-based model was created. The model trains the neural network via several techniques and extracts features via the mel-frequency cepstral coefficient. The Levenberg–Marquardt algorithm produced the best accuracy of 77.7% [
37].
Another related example is QRFAM [
13,
14], which is a Quran recitation dataset by females and males. It includes a substantial benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels. This dataset is the main dataset used for the experiments in this paper.
Table 1 illustrates the main QRFAM dataset statistical information. The full details of the QRFAM configuration can be found in [
13].
3. Mozilla’s DeepSpeech Implementation
We use Mozilla’s DeepSpeech NSR model, which was previously published in [
13], to build the basis of this work. On the other hand, we perform several experiments to optimize the neural models by varying the DeepSpeech hyperparameters for testing, assessment, and training [
14]. The default DeepSpeech hyperparameters are shown in
Table 2 and
Table 3.
The language model interprets the phoneme sequence as words, whereas the DeepSpeech model uses an RNN model to generate outputs in the form of phonemes. Our goal was to enhance the decoding objective function to reduce the word error rate (WER). Specifically, we wanted to find the optimal phoneme sequence to substitute for the vocabulary generated by the RNN.
The main function was described by Hannun et al. [
11] as a weighted combination of the RNN score “
” and the language model score “
Plm”, where “
x” denotes the input utterance and “
c” is the potential output character sequence that in our work represents the phonemes. Equation (
1) therefore represents the function, where
denotes the word insertion weight and
matches the language model weight.
Using the English Language test dataset in DeepSpeech release 0.9.3 [
21], the optimal coefficient values for
and
in Mozilla’s DeepSpeech were (
,
) = (0.75, 1.85), with a beam size of 1024. To ascertain the ideal coefficient values that would result in the lowest WER, we carried out real-world trials. Overall, we found that simultaneously modifying
and
produced the lowest WER.
We also looked into changing the number of hidden neurons in the model to improve it and obtain a lower WER [
38]. The neural network’s size is determined by the number of hidden neurons; in situations where there is a shortage of data, such as in our female dataset, it is best to select a lower network size. For real-time applications, such as mobile phone apps, fewer buried neurons also save memory and CPU resources, increasing their effectiveness. We adjusted the number of hidden neurons in Mozilla DeepSpeech to match the default value of 2048 [
39].
Additionally, we tried modifying the dropout rate in an effort to enhance the model even more. In neural networks, dropout is a regularization technique that efficiently lowers overfitting and enhances generalization mistakes. By adjusting the superior dropout rate coefficient in Mozilla’s DeepSpeech, which was initially set to 0.15 [
39], we were able to obtain a lower WER through our experiments.
In conclusion, we optimized our neural models through a series of experiments involving the modification of coefficient values.
4. OpenAI’s Whisper Implementation
An advanced ASR model called Whisper [
17] was developed as a consequence of training on approximately 680k hours of supervised web data. These data include multilingual and multitasking samples. The authors have shown how resilience against accents, background noise, and technical terminology is increased when a dataset of this size and variety is used. Additionally, it allows transcription in those languages as well as translations from several other languages into English. Whisper’s authors laid the foundation for future research on robust speech processing and the development of practical applications by providing the models and inference code [
40,
41,
42,
43]. This can be quite helpful when trying to adapt the model to other use cases, such as what we are doing in this project.
The Whisper model’s ability to handle diverse audio inputs makes it particularly suitable for applications in various domains, including healthcare, customer service, and education. For instance, in healthcare, Whisper can be used to transcribe doctor–patient conversations accurately, ensuring that medical records are precise and comprehensive. In customer service, it can enhance the efficiency of call centers by providing real-time transcriptions of customer interactions, allowing for better analysis and response to customer needs. In education, Whisper can assist in creating accessible learning materials by transcribing lectures and translating them into multiple languages, thus catering to a global audience.
Moreover, Whisper’s open-source nature encourages collaboration and innovation within the research community. By providing access to the models and inference code, OpenAI enables researchers and developers to build upon Whisper’s capabilities, tailoring it to specific needs and exploring new frontiers in speech recognition technology. This openness not only accelerated advancements in the field but also democratized access to cutting-edge ASR technology, allowing smaller organizations and individual developers to leverage Whisper’s powerful features without the need for extensive resources.
In this work, we experiment with the effectiveness of Whisper in performing some of the same Quranic recognition tasks we recently conducted via DeepSpeech [
16]. We also expand this evaluation by presenting a thorough analysis when multiple versions of the Whisper model are employed.
6. Conclusions
This research extends our prior work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to users of any age, gender, or level of experience. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters spanning a range of age groups and competency levels, was published in our previous publications. Furthermore, for training, validation, and testing, we built several basic NSR systems based on Mozilla’s DeepSpeech model, utilizing various subsets of the QRFAM dataset. Additionally, we demonstrated some improvements that may be used with the DeepSpeech model to enhance the WER. We expand on these efforts in this study by introducing a well-known speech recognition model and reporting the effect of this choice on the WER in comparison to DeepSpeech.
Our study has advanced significantly with the addition of OpenAI’s Whisper model to our experiments. Our goal was to improve the accuracy and dependability of our NSR systems by utilizing Whisper’s strong architecture and a large training dataset. The comparison between DeepSpeech and Whisper offered insightful information about the advantages and disadvantages of each model. Whisper has been shown to be an effective tool for Quranic recitation recognition, as reflected by its exceptional WER performance across multiple test datasets. With recordings from reciters varying in age, gender, and skill level, the QRFAM dataset is both tough and diverse, making this improvement all the more remarkable.
Moreover, Whisper’s open-source design promotes ongoing development and adoption. Building on our findings, scholars and developers might investigate new applications and improvements for Quranic recitation recognition. This cooperative approach guarantees that the technology remains useful and accessible to a broad spectrum of consumers while also expediting progress in the industry. Our research shows how crucial it is to include cutting-edge ASR models, such as Whisper, into specific applications to pave the way for future speech recognition technologies that will be more accurate and inclusive.