Arabic Automatic Speech Recognition: A Systematic Literature Review

: Automatic Speech Recognition (ASR), also known as Speech-To-Text (STT) or computer speech recognition, has been an active ﬁeld of research recently. This study aims to chart this ﬁeld by performing a Systematic Literature Review (SLR) to give insight into the ASR studies proposed, especially for the Arabic language. The purpose is to highlight the trends of research about Arabic ASR and guide researchers with the most signiﬁcant studies published over ten years from 2011 to 2021. This SLR attempts to tackle seven speciﬁc research questions related to the toolkits used for developing and evaluating Arabic ASR, the supported type of the Arabic language, the used feature extraction/classiﬁcation techniques, the type of speech recognition, the performance of Arabic ASR, the existing gaps facing researchers, along with some future research. Across ﬁve databases, 38 studies met our deﬁned inclusion criteria. Our results showed different open-source toolkits to support Arabic speech recognition. The most prominent ones were KALDI, HTK, then CMU Sphinx toolkits. A total of 89.47% of the retained studies cover modern standard Arabic, whereas 26.32% of them were dedicated to different dialects of Arabic. MFCC and HMM were presented as the most used feature extraction and classiﬁcation techniques, respectively: 63% of the papers were based on MFCC and 21% were based on HMM. The review also shows that the performance of Arabic ASR systems depends mainly on different criteria related to the availability of resources, the techniques used for acoustic modeling, and the used datasets.


Introduction
Automatic Speech Recognition (ASR) represents a particular case of digital signal processing, which comprises statistics, phonetics, linguistics, and machine learning. It can be defined as a technology by which the spoken words are converted into textual representation, using software to recognize human voice and speech [1,2]. Recently, automatic speech recognition systems have become the subject of increasing interest for diverse speech/language researchers and academics. This interest is reflected in their emergence in various areas, such as health, education, dictation, and robotics [3]. With the rapid progress of technologies, ASR systems are adopted by various applications due to their functionality and ease of use. For example, they are applied in dictation software, which can be a constructive PC tool for accessibility benefits. Another application is using voice control commands and search with mobile devices. They can also be associated with speech translation from a source to a target language. The ASR systems have been considered for a long time as a valuable and helpful input technique for a range of categories of disabilities since it is based on speech as an input technique alternating traditional manual techniques via keyboards and mouses [4]. However, automatic speech recognition is considered a challenging task in the signal processing field as it requires several layers of processing to reach a high level of accuracy and lower Word Error Rate (WER).
The Arabic language is considered one of the official languages in twenty-two countries situated in the Middle East, Africa, and the Gulf. It is ranked as the fifth most extensively • Classical Arabic represents the most formal and standard form of Arabic as it is mainly used in the Holy Quran and the religious instructions of Islam [1]; • Modern Standard Arabic (MSA) represents the current formal linguistic standard of the Arabic language. It is generally used in written communication and media, and is taught in educational institutions [1]; • Dialectal Arabic (DA), also called colloquial Arabic, is a variation of the same language specific to countries or social groups used in everyday life. Various dialects of Arabic exist, and, sometimes, more than one DA can be used within a country [1].
Based on the study of Elnagar et al. [6], the DA can be categorized according to the following varieties: 1.
Nile Basin, including the Egyptian and Sudanese dialects; 4.
Levantine dialect is often used in Syria, Lebanon, Palestine, and western Jordan [9].
Compared to the existing research on ASR for the English language, the ASR for the Arabic language received little attention due to its consideration as a limited resource language [10]. The main challenges of the Arabic language remain specific to the existence of enormous dialects with various pronunciations, the morphological complexity, and the difficulty of acquiring a diacritized transcription of the speech corpora, which is not very commonly open-source, etc. [11]. Toward building a robust Arabic ASR system, it is highly recommended and more accurate to use extensive speech collections. According to [12], an extensive vocabulary means that the dataset contains approximately 20k to 60k words.
Several overviews and survey studies have been published to review various aspects of Arabic speech recognition. In 2018, the authors of [13] published a literature survey paper that discusses the Arabic ASR. The survey shows that few freely available continuous speech corpora exist. It also shows a need to compile large corpora. In another study, Algihab et al. [14] review the available studies on Arabic speech recognition along with the available services and toolkits for the development of Arabic speech recognition systems. The focus was on Arabic ASR using deep learning. Seventeen papers were reviewed and presented according to the recognized entity and learning techniques. A more recent study by Abdelhamid et al. [8] presents Arabic speech recognition systems from the end-to-end methodology perspective. The study focuses on two types of the Arabic language, namely MSA and dialectal Arabic. It presents the end-to-end Arabic speech recognition systems proposed between 2017 and 2019. It also presents the available API Services and toolkits essential for building end-to-end models. Another work that reports on the reviews of ASR systems for isolated Arabic words was proposed in 2021 by Shareef and Irhayim [15]. The authors focused on ASR systems based on artificial intelligence techniques and summarized 16 studies according to four criteria. These include speech recognition types, classification techniques, feature extraction techniques, and accuracy rates.
This paper is a follow-up of studies conducted about automatic speech recognition studies proposed for the Arabic language, where the need for broader research on this topic was recognized. The goal is to conduct a Systematic Literature Review (SLR) of Arabic automatic speech recognition to guide researchers by providing them with the most significant studies published recently. To the best of our knowledge, this is the first systematic review that presents the landscape of Arabic ASR studies. Our goal is to highlight the progress made in the Arabic ASR field over ten years, starting from 2011 till 2021. This systematic literature review will also guide speech and language researchers and academics to define the significant research gaps in the field and to open perspectives for future research.
The remaining paper is organized into five sections. Section 2 presents a brief background of Arabic ASR. Section 3 describes the adopted research method in this systematic literature review. The formulated primary and secondary research questions are answered in Section 4. The conclusions are presented in the last section.

Background
Automatic speech recognition concerns the automated conversion of speech or audio waves into texts exploitable by a machine through analyzing and processing speech signals using different techniques such as Convolutional Neural Network (CNN) [16] or deep learning [17]. The design of an ASR architecture system depends on various components and tasks like preprocessing, noise detection, speech classification, and feature extraction. Figure 1 presents a generic architecture used in the development of ASR systems. Three main modules can be identified in a traditional speech recognition system [4]. The first one corresponds to speech pre-processing, which aims to remove undesirable noises from the speech signal and identify speech activity [18]. The second module concerns feature extraction, in which essential data are extracted from a speech. The third module refers to the classification, which aims to find the parameter set from memory.
Arabic automatic speech recognition to guide researchers by providing them with the most significant studies published recently. To the best of our knowledge, this is the first systematic review that presents the landscape of Arabic ASR studies. Our goal is to highlight the progress made in the Arabic ASR field over ten years, starting from 2011 till 2021. This systematic literature review will also guide speech and language researchers and academics to define the significant research gaps in the field and to open perspectives for future research.
The remaining paper is organized into five sections. Section 2 presents a brief background of Arabic ASR. Section 3 describes the adopted research method in this systematic literature review. The formulated primary and secondary research questions are answered in Section 4. The conclusions are presented in the last section.

Background
Automatic speech recognition concerns the automated conversion of speech or audio waves into texts exploitable by a machine through analyzing and processing speech signals using different techniques such as Convolutional Neural Network (CNN) [16] or deep learning [17]. The design of an ASR architecture system depends on various components and tasks like preprocessing, noise detection, speech classification, and feature extraction. Figure 1 presents a generic architecture used in the development of ASR systems. Three main modules can be identified in a traditional speech recognition system [4]. The first one corresponds to speech pre-processing, which aims to remove undesirable noises from the speech signal and identify speech activity [18]. The second module concerns feature extraction, in which essential data are extracted from a speech. The third module refers to the classification, which aims to find the parameter set from memory. The critical challenge in developing highly accurate Arabic ASR systems is selecting feature extraction and classification techniques [19]. Mel Frequency Cepstral Coefficient (MFCC) and Perceptual Linear Predictive (PLP) are the most common techniques used for feature extraction. In the systematic literature review presented by Nassif et al. [20], for instance, 69.5% of the retained papers used the MFCC technique to extract features from speech. A wide range of techniques can also be used for classification. Examples of these techniques are Artificial Neutral Network (ANN), Hidden Markov Model (HMM), and Dynamic Time Warping (DTW).
The development of Arabic speech recognition systems has increased in the past decades thanks to the availability of different open sources and toolkits for building and assessing ASR. These toolkits are Hidden Markov Model Toolkit (HTK), Carnegie Mellon University (CMU) Sphinx engine, and KALDI Speech Recognition Toolkit. Speech recognition can be subdivided into four types [15], namely:  Isolated word recognition in which speakers pause momentarily between every spoken word;  Continuous speech recognition allows speakers to speak almost naturally, with little or no breaks between words. The systems related to the second type are more The critical challenge in developing highly accurate Arabic ASR systems is selecting feature extraction and classification techniques [19]. Mel Frequency Cepstral Coefficient (MFCC) and Perceptual Linear Predictive (PLP) are the most common techniques used for feature extraction. In the systematic literature review presented by Nassif et al. [20], for instance, 69.5% of the retained papers used the MFCC technique to extract features from speech. A wide range of techniques can also be used for classification. Examples of these techniques are Artificial Neutral Network (ANN), Hidden Markov Model (HMM), and Dynamic Time Warping (DTW).
The development of Arabic speech recognition systems has increased in the past decades thanks to the availability of different open sources and toolkits for building and assessing ASR. These toolkits are Hidden Markov Model Toolkit (HTK), Carnegie Mellon University (CMU) Sphinx engine, and KALDI Speech Recognition Toolkit. Speech recognition can be subdivided into four types [15], namely: • Isolated word recognition in which speakers pause momentarily between every spoken word; • Continuous speech recognition allows speakers to speak almost naturally, with little or no breaks between words. The systems related to the second type are more complex than isolated word recognition and need large volumes of data to achieve excellent recognition rates; • Connected words allow a minimal pause between the isolated utterances to be used together; • Spontaneous speech remains normal-sounding and not conversational speech.
Each category of speech recognition can be further categorized according to two sub-categories, namely:

1.
Speaker-dependent, in which the system is based only on the speech of a specific speaker for which the system is trained [21]; 2.
Speaker-independent means speech recognition systems can be found in any speaker's speech [21].
As presented earlier, this paper aims to review and analyze the existing studies on automatic speech recognition for the Arabic language. Seven fundamental research questions are tackled to provide insight, for instance, into the used toolkits for Arabic ASR and the applied feature extraction and classification techniques. The following section presents these questions and details the adopted search method in this SLR.

Method
The systematic literature review was based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol [22]. First, the research questions are formulated, followed by the search strategy. Next, the inclusion and exclusion criteria are presented. Finally, the quality assessment and data extraction process are stated. Figure 2 illustrates the PRISMA flow chart showing a report of the obtained outcomes in each phase for the current systematic literature review of Arabic Speech to Text. complex than isolated word recognition and need large volumes of data to achieve excellent recognition rates;  Connected words allow a minimal pause between the isolated utterances to be used together;  Spontaneous speech remains normal-sounding and not conversational speech. Each category of speech recognition can be further categorized according to two subcategories, namely: 1.
Speaker-dependent, in which the system is based only on the speech of a specific speaker for which the system is trained [21]; 2.
Speaker-independent means speech recognition systems can be found in any speaker's speech [21]. As presented earlier, this paper aims to review and analyze the existing studies on automatic speech recognition for the Arabic language. Seven fundamental research questions are tackled to provide insight, for instance, into the used toolkits for Arabic ASR and the applied feature extraction and classification techniques. The following section presents these questions and details the adopted search method in this SLR.

Method
The systematic literature review was based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol [22]. First, the research questions are formulated, followed by the search strategy. Next, the inclusion and exclusion criteria are presented. Finally, the quality assessment and data extraction process are stated. Figure 2 illustrates the PRISMA flow chart showing a report of the obtained outcomes in each phase for the current systematic literature review of Arabic Speech to Text.

Research Questions (RQ)
The first step of this SLR consists of defining the research questions and Secondary Research Questions (SRQ). As presented earlier, the purpose is to review the Arabic ASR studies conducted between 2011 and 2021. A total of seven RQs and three SRQs were defined to carry out a detailed review of the field. The RQs and SRQs related to the purposes are as follows: RQ 1. What is the bibliographic information of the existing studies?  SRQ 1.

Research Questions (RQ)
The first step of this SLR consists of defining the research questions and Secondary Research Questions (SRQ). As presented earlier, the purpose is to review the Arabic ASR studies conducted between 2011 and 2021. A total of seven RQs and three SRQs were defined to carry out a detailed review of the field. The RQs and SRQs related to the purposes are as follows: RQ 1. What is the bibliographic information of the existing studies? We started our SLR by defining the main keywords used in the related research studies and the research questions. To ensure a more comprehensive search, alternate synonyms, acronyms, and spelling variations of words were included for the different keywords. In the following, we divided the keywords into four categories. The Boolean operator OR was used by combining the keywords in each category. Then, we used the Boolean operator AND incorporated the keywords across the categories. Table 1 presents the defined categories along with the keywords. The search string induced based on the keywords in each category is as follows: • C1: "Arabic" OR "Arabic Language" OR "Multilingual"; • C2: "Automat*" OR Computer"; • C3: "Speech recogni*" OR "Speech trans*" OR "Speech to text" OR "Voice to text" OR "Voice recogni*" OR "SRT" OR "ASR" OR "STT"; • C4: "System" OR "Tool" OR "Technology" The resulting string can be formulated as (C1) AND (C2) AND (C3) AND (C4). The results of the search were then imported to Mendeley Reference Management Software.

Electronic Databases
Five electronic databases were used to collect data. These include ACM Digital Library, ScienceDirect, IEEE Xplore, SpringerLink, and Google Scholar. Table 2 presents the used electronic databases along with their links. In this study, the search was performed for published journal papers, conferences, and workshop proceedings.  Table 3 illustrates the procedure for conducting queries in each electronic database, along with some notes. Table 3. Online electronic databases.

Databases Query String Notes
Google Scholar allintitle: ("Arabic" OR "Multilingual") AND ("Speech recognition" OR "Speech recognizer" OR "Speech transformation" OR "Speech to text" OR "voice to text" OR "Voice recognition") AND ("system" OR "technology") Custom range: 2011-2021 The number of characters per query is limited, so the shorter query was applied.

Study Selection
A total of 1559 articles were retrieved from the search string presented in Table 3. We added 9 papers from other overviews and surveys. Out of 1568 papers, 88 were duplicates, and consequently, they were removed. A total of 1480 papers were retained after this step. Then, the abstract, keywords, and title of all the papers were checked by one author referred to as a reviewer. A total of 236 articles were retrieved after this step. In the following, a set of inclusion and exclusion criteria were applied to decide which research papers to review. Table 4 illustrates the adopted inclusion and exclusion criteria. By applying these criteria, the number of research papers was reduced further down to 127. Table 4. Inclusion and exclusion criteria.

Inclusion Criteria Exclusion Criteria
Papers undertaking Arabic speech recognition. Papers venue is not journals, conferences, or workshops. Papers focused only on spoken words in Arabic.
Papers undertaking spoken Arabic digit recognition. Papers directly answer one or more of the RQ.
Papers not being available. Published papers are between 2011 and 2021.
Studies in duplicity. Papers are written in English Theoretical papers.
Next, two reviewers were involved in checking anonymously if the papers addressed one or more of the research questions presented previously in Section 3.1. In this way, the papers unable to cover the research questions related to this SLR would not be retained. In this step, the candidate papers were imported to a Web application called Rayyan [23], using the RIS format. This Web application supports the collaboration of the authors of systematic literature reviews by voting on papers based on the S/RQs criteria.
Three voting options can be used in the Rayyan application, namely: "exclude", "include", and "maybe".

•
The papers with two "include" votes or one "maybe" and one "include" vote were retained.
• The papers that received two "exclude" votes or one "maybe" and one "exclude" vote were eliminated from the dataset.

•
The papers that received two "maybe" votes or one "include" and one "exclude" vote were resolved through discussion. In those cases, a deciding vote on including or excluding the paper are made by the third reviewer.
Finally, the Quality Assessment (QA) of 41 candidate papers for inclusion was performed. The quality assessment of papers is described in the next section.

Quality Assessment
During this step, the quality assessment of each candidate's paper for inclusion was carried out. The goal was to assess the quality and relevance of the papers' contents. In this SLR paper, the quality assessment procedure was based on the study of [24]. The following quality assessment questions were proposed accordingly: Three-point scales were used to answer each QA question, namely "yes", which was given 1 point, "partially", which was given 0.5 points, and "no", which was given 0 points. If the paper answers the QA question, it receives 1 point.
It scores 0.5 points if it partially addresses the QA question. In the case of a paper that did not address the QA question, it receives 0. The quality assessment of the research papers was performed by evaluating their quality against the QA questions. The answers to the QA questions of all the papers are presented in Table 5. The total score was calculated for each research study. A threshold was defined, such as if the total score was equal to or greater than three, then the study was included. In the case when the research study was less than three, then it was excluded. In this study, three papers were excluded. These papers are highlighted in blue color as we can see from Table 5. At the end of this process, the final number of studies to be retained was 38.

Results and Discussion
The 38 papers related to Arabic speech recognition constitute the dataset that will be analyzed. Based on the research questions in Section 3.1, each study is analyzed using the following criteria: feature techniques, type of Arabic language, toolkit, and speech recognition. Table 6 presents the retained papers classified according to these criteria. The important legends for Table 6 Table 7 presents the number of papers related to MSA and DA contributed by each country of the first author. The analysis illustrates that the first authors of seven papers are from Tunisia, with four studies dedicated to MSA and three to the Tunisian dialect. Algeria is second with five research papers, while Qatar is third with four papers. Two countries, namely Morocco and Kuwait, had three studies each. The research focus in these two countries was specifically on MSA. France, Jordan, and Malaysia had two research studies each. Lastly, eleven countries (i.e., Libya, Palestine, Saudi Arabia, Finland, Spain, Indonesia, USA, UK, Yemen, Egypt, Iraq) had one study each. Based on our analysis, it is noticed that most of the studies related to DA focus on the Egyptian dialect (five out of ten studies explicitly dedicated to dialectal Arabic). It can be noted that the Egyptian dialect is considered the first-ranked DA in terms of the number of speakers in the Arab world.  The total number of retained studies is 38. As we can see from Figure 3, a total of 19 studies were published in journals, 15 studies were presented at conferences, and 4 studies were presented in workshops.

RQ2. What Is the Considered Variant of Arabic in Speech Recognition Studies?
In this section, we describe the types or variants of Arabic adopted by the reviewed studies. As we can see from Figure 3, the recognition of MSA has piqued the interest of several researchers. Among the 38 reviewed studies, 34 of them cover MSA (89.47% of the retained studies), whereas 10 of the retained studies were dedicated to different dialects of Arabic (26.32% of the reviewed studies). Note that some studies can be dedicated to more than one variant of Arabic. The studies dedicated for MSA include, for example, [4,[25][26][27][28]30,32,33,35,[37][38][39][40]43]. Some of these studies cover MSA along with other dialectal Arabic, such as the Algerian dialect [36] (1 study out of 38), the Qatari dialect [61] (1 study out of 38), and the Egyptian dialect (4 studies out of 38) [30]. It can also be observed that some Arabic dialects were not supported by retained studies such as the Iraqi and Yemeni dialects.  Figure 3. Distribution of Arabic speech recognition studies per publication years, types, and variants of Arabic language and dialects. Some studies support more than one variant of Arabic and can focus on multi-dialects of Arabic. Accordingly, the same study has been repeated, which increases the total number of studies per type of Arabic language. Figure 3. Distribution of Arabic speech recognition studies per publication years, types, and variants of Arabic language and dialects. Some studies support more than one variant of Arabic and can focus on multi-dialects of Arabic. Accordingly, the same study has been repeated, which increases the total number of studies per type of Arabic language.

RQ 3. What Are the Toolkits Most Often Used in the Arabic Speech Recognition Field?
Across the retained studies, researchers have used different toolkits to support implementing and evaluating Arabic speech recognition systems. Two studies have not mentioned the toolkit used [32,58]. Most of the retained studies have either used KALDI or CMU Sphinx, or HTK toolkits. The study of [1] used the CMU Sphinx toolkit to evaluate the speech corpus. The author claimed that there are many technical differences between CMU sphinx and HTK tools. These include (1) CMU Sphinx has more advanced features compared to HTK; (2) HTK toolkit is more user-friendly than CMU Sphinx; (3) CMU Sphinx is often better than HTK, mainly in terms of accuracy rate.
According to Abushariah [1], the CMU Sphinx toolkit is more suitable to be used, particularly for speaker-independent, extensive vocabulary, and continuous speech recognition systems. The CMU sphinx along with HTK toolkits are suitable to be used for training acoustic models due to their abilities to implement speaker-independent, large vocabulary, continuous speech recognition systems in any language [10,71,72]. However, it is noticed based on Table 6 that the CMU Sphinx and HTK tools were used to support isolated words in a few studies such as [26] and [40], respectively.
Other different toolkits such as SRILM language modeling were used in [37]. The previous study was based on the KALDI speech recognition toolkit to build and evaluate acoustic models and CNTK to train acoustic models. The KALDI toolkit was used as well in many other research studies such as [25,[34][35][36]61]. Most of these studies were proposed to support dialectal Arabic. Some other research efforts such as [19,33,41,49,60] used MATLAB, a closed-source software. All these speech recognizers were dedicated mainly to MSA, particularly to the MSA isolated words category. Few of the retained studies have used other toolkits such as MADA [25] and KenLM [53]. Figure 4 illustrates the frequency of use of each toolkit according to the variant of Arabic.  Some studies were based on more than one toolkit. Then, the same study was repeated, which increased the total number of studies.

RQ 4. Which Datasets Were Most Often Used, and What Types of Arabic Speech
Recognition Were Identified in These Datasets?
As we can see from Table 6, most of the datasets were used only once. Examples of these datasets are TunSpeech and TARIC. A limited number of studies were adopted on the same dataset. For instance, two studies, namely [27,50], were based on the Algerian Arabic speech database [63]. The Arabic multi-genre broadcast datasets were also adopted by three studies, namely [30,35,37].
Among the retained studies, two main modes of speech recognition have been supported, namely isolated words and continuous words. The following subsections describe the existing studies according to each of these modes.  Some studies were based on more than one toolkit. Then, the same study was repeated, which increased the total number of studies.

RQ 4. Which Datasets Were Most Often Used, and What Types of Arabic Speech Recognition
Were Identified in These Datasets?
As we can see from Table 6, most of the datasets were used only once. Examples of these datasets are TunSpeech and TARIC. A limited number of studies were adopted on the same dataset. For instance, two studies, namely [27,50], were based on the Algerian Arabic speech database [63]. The Arabic multi-genre broadcast datasets were also adopted by three studies, namely [30,35,37].
Among the retained studies, two main modes of speech recognition have been supported, namely isolated words and continuous words. The following subsections describe the existing studies according to each of these modes.

Isolated Words
Within the retained studies, 19 were dedicated to the isolated words category. For this category, studies including [19,26,32,33,40,41,45,[47][48][49]58,60] were directed towards recognizing MSA. Some other studies for the isolated words category have contributed to MSA along with dialect Arabic such as [48], which focused on two variants of dialect Arabic (i.e., Egyptian and Gulf).

Continuous Words
Based on the data extracted from the 38 retained studies, all the research contributions for continuous words focused on recognizing speech from broadcast news or broadcasted reports, or conversations. These studies include, for example, [11,25,35,37,51,54,57,59,61] as shown in Table 6. Furthermore, the research contributions toward continuous Arabic words for speech-independent are more than what we have seen for speech-dependent, which makes sense as most of the continuous speech systems are dedicated to recognizing broadcast news/conversions. Among these studies, we can cite [25,59].

RQ 5. What Are the Used Feature Extraction and Classification Techniques for Arabic Speech Recognition Studies?
Results show that a wide range of feature techniques was used in Arabic speech recognition systems (see Table 6). These techniques can be classified into two categories: (1) feature extraction and (2) classification techniques. The following sub-sections present the retained studies according to these two categories.

Feature Extraction Techniques
The analysis of the retained studies shows that the MFCC acoustic feature extraction technique was the most used. A total of 24 studies out of 38 were based on the MFCC technique, which constitutes 63% of the reviewed papers. Other studies were based on alternate speech feature extraction methods, for instance, Linear prediction coefficient (LPC) (8% of the reviewed studies). Table 8 illustrates the commonly used feature extraction techniques, the percentage of their use, and the studies that were based on these techniques. As shown in this Table, 10% of the studies were based on the PLP technique, and 18% were based on the LDA technique. Some researchers have adopted a combination between MFCC and other feature extraction methods, such as [19,32], and they achieved better accuracy results than others.  Table 9 shows the most common feature classification techniques used in the retained papers. Different studies adopt the HMM as a feature classification technique, which constituted 21% of the reviewed studies. Other research efforts have been based on a hybrid of the Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM), especially for continuous speech recognition. In this case, the HMM was used to identify the temporal variability of speech while the Gaussian Mixture Model was applied to define how HMM states fit the frame acoustic input. As shown in Table 9, 13% of the retained studies were based on the combination of GMM-HMM. Despite the successful applications of Arabic speech recognition in different areas, there are still many gaps and limitations. One of the main limitations addressed in the studies focusing on DA is the minimal number of large datasets for dialect speech. It is known in the speech recognition community that preparing large training datasets for DA acoustic modeling is too tricky compared to MSA. It has been noticed that many of the retained studies were based on a small corpus. Using common datasets can be a costeffective and possible way to gradually enhance the research area since the results can be compared and enhanced. Few speech datasets are publicly available for Arabic dialects, such as Spoken Tunisian Arabic Corpus (STAC) (It consists of five transcribed hours obtained from different Tunisian TV radios and channels TV) [73], MGB-3 (It emphasizes dialectal Arabic using a multi-genre collection of Egyptian YouTube videos), etc.
The adoption of a manual diacritized corpus and the lack of diacritized text remain significant limitations in many studies [38,51]. One adopted solution for the latter limitation consists of using grapheme units instead of the phoneme, which are the natural units of speech. Additionally, some software, such as Apptek, has been used for automatic diacritization of an Arabic corpus [10].
Aside from the limitations mentioned above, it should be noted that most of the selected studies focus on the non-diacritized Arabic scripts instead of a diacritized version. Only 6 studies out of 38 have been focused on diacritized Arabic speech [38,48,51,52,54,55]. A possible explanation is that the diacritized version of Arabic scripts may decrease the accuracy compared with the non-diacritized version, which prompts researchers to focus especially on non-diacritized scripts.
In terms of the techniques used for implementing speech recognition models, it was observed that the use of deep learning techniques (e.g., neural networks, recurrent neural networks, deep neural networks, etc.) is effective in increasing the accuracy of ASR. However, limited studies have been based on these techniques compared to the ones using alternative feature techniques. It is essential to highlight that using deep learning techniques requires many corpora to train a model. As already presented, a very limited common large corpus exists, which explains and is in line with the restricted number of studies using deep learning techniques. More practice-led research is needed to build more common large datasets for both MSA and dialectal Arabic.
Furthermore, many studies have claimed that the pronunciation variation phenomenon represents an additional challenge to ASR systems. A possible solution consists of adopting data extracted from pronunciation dictionaries to create rules, which enable the generation of pronunciation variants. Another possible solution is to estimate variants using speech data. The prominent approach that can be adopted for modeling pronunciation is the decision tree. The study of [74], for instance, was based on this approach to avoid over-generation pronunciation and pronunciation variant generation. According to Loots and Niesler [75], a decision tree is a practical approach to producing pronunciation variants by generalizing the pronunciation.

RQ 7. What Is the Performance of Arabic Speech Recognition Systems?
The performance of ASR is typically defined in terms of accuracy and speed. Accuracy is usually rated with the word error rate, whereas speed is estimated with the real-time factor. In this section, the analysis of studies is presented based on RQ4 and RQ5. Table 10 summarizes the performance of each of the retained Arabic ASR. It can be noted that excellent performances on the ASR proposed, for instance, by [41,60], have been achieved using deep learning techniques. Table 10. Performance of the retained Arabic speech recognition systems.

References
Performance [25] The obtained WER scored 15.81% on BR and 32.21% on a broadcast conversation. [26] The best WER for clean data was 96.2%. It was obtained with 256 mixtures per state. For the noisy data test, the best WER was 49.2% average for SNR levels under babble noise obtained with 256 mixtures. [27] The WER was 9%. [28] The overall system accuracy was 98%, and it was enhanced by around 1% by implementing HSMM instead of standard HMM to reach 99%. [30] The obtained WER was 13.2% on the MGB2 test. The obtained WER was 37.5% on MGB3. [31] The obtained average accuracy rate was 91.56% by using MFCC. The obtained average accuracy rate was 95.34% by using PLP. The obtained average accuracy rate was 86.15% by using LPC. [32] The WER reached 0.26% when using a combination of RASTA-PLP, PCA, and FFBPNN techniques.
[34] The obtained WER was 22.6% on the test set.
The recognition rate reached 94% for system 2.
The recognition rate reached 97% for system 3. [36] The WER was 14.02% for MSA. The WER was 89% for the Algerian dialect. [37] The overall WER was 18.3%. [38] The WER scored 4.68%. Adding diacritics increased WER by 0.59. [39] The lowest average of WER was 11.42% for SVM/HMM, 11.92% for MLP/HMM, and 13.42% for HMM standards. [40] The use of HMM led to a recognition rate of 74.18%. The hybridization of MLP with HMM led to a recognition rate of 77.74%. The combination of SVM with HMM led to a recognition rate of 78.06%. The hybridization of SVM with DBN realized the best performance, which was 87.6%. [41] The achieved accuracy was 98%.
[1] The average WER was 2.22% for speakers independent of the text-dependent data set. The achieved average WER was 7.82% for speakers independent with text-independent data set. [48] The system has achieved an overall WER of 14%. [49] The best recognition rate given by the system is 79% for multi-speaker recognition and 65% for independent speaker recognition.
[50] The performance of the ACSRS setup gave the region of Algiers a recognition rate of 97.74% for words and 94.67% for sentences. References Performance [51] The WER scored 76.4% for the non-diacritized text system and 63.8% for the diacritized text-based system. [52] The obtained accuracy was 90.18%.
[53] The best performance achieved was 24.4% of WER. [54] The experimental results show that the non-diacritized text system scored 81.2%, while the diacritized text-based system scored 69.1%. [55] The achieved result is 31.10% WER using the standard CTC-attention method. For CNN-LSTM with the attention method, the best result is obtained from this model: 28.48% as WER.
[56] Accuracy when using GFCC with CNN was 99.77%. The maximum accuracy obtained when using GFCC with CNN was 99.77%. [57] The best WER obtained on MGB-3 using a 4-g re-scoring strategy is 42.25% for a BLSTM system, compared to 65.44% for a DNN system. [58] For TV commands, accuracy is over 95% for all models. [11] The WER is much higher on the dialectal Arabic dataset, ranging from 40% to 50%. The WER for the proposed ASR system using all five references achieved a score of 25.3%. [59] The final system was a combination of five systems where the result obtained succeeded the best single LIUM ASR system with a 9% of WER reduction and succeeded the baseline MGB system that the organizers provided with a 43% WER reduction. [19] The system accuracy reached 94.56%. [60] The recognition rate of trained data reached 97.8%. The recognition rate of non-trained data reached 81.1%. [61] The proposed ASR achieved a 28.9% relative reduction in WER.
In terms of the used datasets, it was found that the larger the speech corpus size used to train the recognizer, the better the accuracy and the lower the WER. In [38], for example, the authors claim that the WER continuously decreases as the corpus size increases. As presented in Section 4.6, the speech recognition system using the non-diacritics dataset can achieve better performance than the non-diacritics version.
In [38], Abed et al. examined the effect of diacritization on Arabic ASR. The authors used diacritized and non-diacritized versions and checked how diacritics could impact the word error rate. In all their results (except a few models with few corpora), the diacritics increased WER for the used models. Additionally, in [51], the experimental results show that the word error rate scored 76.4% for the non-diacritized text system, while it scored 63.8% for the diacritized text-based system. It can also be noticed that better accuracy is achieved with speaker-dependent systems compared to speaker-independent systems since the former is adapted to an individual user. According to [38], speaker-independent systems might struggle when a new user uses the ASR system. In [1], for instance, Abushariah conducted two experiments with and without an adaptation to the speakers using different sentences. In their results, the obtained average WER was 7.64% for speaker-dependent, whereas the average WER for speaker-independent was 7.82%.
To summarize, the overall performance of ASR systems depends significantly on different factors, mainly the used datasets, the techniques for acoustic modeling, and the type of speech recognition. Accordingly, building a precise acoustic model using large datasets can be considered the key to suitable recognition performance.

Conclusions
This paper aims to compile the existing scientific knowledge about Arabic ASR studies published between 2011 and 2021. For that, a systematic review of the literature is conducted. A total of 38 conferences, workshops, and articles papers were reviewed from five academic databases: Google Scholar, IEEE Xplore, Science Direct, ACM Digital Library, and Springer Link. Our results and discussion revolve around seven fundamental research questions. The purposes were to provide insight into the used toolkits for implementing and evaluating Arabic ASR, the variants of the Arabic language, the used feature extraction and classification techniques, the performance of Arabic ASR systems, the type of speech recognition, the existing gaps, and future research in the Arabic ASR field.
Our findings illustrate that this is still an emerging research area where the number of studies has increased over the years. Many studies focus on MSA, whereas a relatively limited number of papers concentrate on dialectal Arabic. Some of these papers were dedicated to more than one variant of Arabic, and the reviewed studies did not support different dialects of the Arabic language. It would be interesting to focus on DA and use common datasets for dialect speech to enhance this research area gradually. Many toolkits were used to build and assess Arabic ASR. The most prominent ones were KALDI, HTK, and CMU Sphinx open-source toolkits. Concerning the used feature extraction techniques, MFCC was the most used one, followed by LDA, then PLP, and LPC. The results show also that HMM was the most adopted classification technique in the reviewed studies. Different limitations have also been addressed in the reviewed studies. The pronunciation variation phenomenon and the low availability of common large diacritized text for the Arabic language can be considered significant challenges that might limit research in this field. It would be interesting then to focus on the non-diacritized Arabic scripts and to develop more large and common datasets for both MSA and dialectal Arabic.

Conflicts of Interest:
The authors declare no conflict of interest.