Correct Pronunciation Detection of the Arabic Alphabet Using Deep Learning

: Automatic speech recognition for Arabic has its unique challenges and there has been relatively slow progress in this domain. Speciﬁcally, Classic Arabic has received even less research attention. The correct pronunciation of the Arabic alphabet has signiﬁcant implications on the meaning of words. In this work, we have designed learning models for the Arabic alphabet classiﬁcation based on the correct pronunciation of an alphabet. The correct pronunciation classiﬁcation of the Arabic alphabet is a challenging task for the research community. We divide the problem into two steps, ﬁrstly we train the model to recognize an alphabet, namely Arabic alphabet classiﬁcation. Secondly, we train the model to determine its quality of pronunciation, namely Arabic alphabet pronunciation classiﬁcation. Due to the less availability of audio data of this kind, we had to collect audio data from the experts, and novices for our model’s training. To train these models, we extract pronunciation features from audio data of the Arabic alphabet using mel-spectrogram. We have employed a deep convolution neural network (DCNN), AlexNet with transfer learning, and bidirectional long short-term memory (BLSTM), a type of recurrent neural network (RNN), for the classiﬁcation of the audio data. For alphabet classiﬁcation, DCNN, AlexNet, and BLSTM achieve an accuracy of 95.95%, 98.41%, and 88.32%, respectively. For Arabic alphabet pronunciation classiﬁcation, DCNN, AlexNet, and BLSTM achieve an accuracy of 97.88%, 99.14%, and 77.71%, respectively.


Introduction
The Arabic language is one of the oldest languages and is characterized due to its uniqueness and flexibility. Among many Semitic languages, Arabic is the most widely spoken language with over 290 million native speakers and 132 million non-native speakers [1]. Arabic is one of the six official languages of the United Nations (UN) [2]. Classical Arabic (CA) and modern standard Arabic (MSA) are the two main dialects of Arabic. CA is the language of the Quran while MSA is its modified version, which is currently used in everyday communication.
Rules of pronunciation are very well-defined for CA to preserve the accurate meaning of the words and constitute basic building blocks to help natives as well as non-natives to learn the Arabic language. The requirements to consider for correct pronunciation are the articulation points of the alphabets, characteristics of the alphabets, and extensive practicing of vocals [3]. This research work focuses on developing an automated system that can recognize the correct pronunciation of the Arabic alphabet. This research is an important milestone for developing and classify a more sophisticated system that can automatically classify words and sentences to help in teaching classical Arabic pronunciation.
In this research, we take users' audio data, process, and train neural networks (NN) over this data. The network learns from the data and classifies audio data and hence provides feedback to a learner on the alphabet pronunciation.
The HMM [17,18] determines the set of states and associates them with the probabilities of transitions between these states called the Markov chain. GMM [19] is a probabilistic model, which represents a normally distributed subclass within a class. Mixture models do not know a data point belonging to a subclass, and it allows the model to learn automatically. During the past decade, a few HMM and NN speech recognition systems have demonstrated to provide higher accuracy in the classical Arabic alphabet and verse recognition tasks. CMU Sphinx is one of the well-known open-source tools for CA based on HMM [22][23][24][25]. Researchers worked on different tasks using HMM such as an 'E-Hafiz system' which was proposed for CA learning using HMM and MFCC as a feature learning technique [26]. This system achieved an accuracy of 92% for men and 90% for children. In [27], the proposed system helps to improve the pronunciation of alphabets using mean square error (MSE) for pattern matching and MFCC as a feature extraction technique. This system successfully recognized correct pronunciation for various alphabets. In [28], the 'Tajweed checking system' demonstrates detection and correction of students' mistakes during recitations using MFCC, and vector quantization (VQ) with an accuracy of 82.1-95%. In [29], a 'Qalqalah letter pronunciation' is proposed using spectrogram, this technique illustrates the mechanism of Qalqalah sound pronunciation. In [30], a 'mispronunciation detection system for Qalqalah letters' is proposed using the MFCC, and support vector machine (SVM) classifier, which provides an accuracy of 97.5%.
Deep learning (DL) algorithms learn a hierarchical representation from data with numerous layers [31]. The hidden layers are responsible to extract important features from the raw data to achieve a better representation of the audio data. In [32], the author proposes an Arabic alphabet recognition model using RNN with back-propagation through time, with an accuracy of 82.3% tested for 20 alphabets. In recent research [33], the authors demonstrated a mispronunciation detection system using different handcrafted techniques for feature extraction and SVM, KNN, and NN as classifiers. This experiment achieved an accuracy of 74.37% for KNN, 83.90% for SVM, and 90.1% for NN.
In this paper, we propose a DL algorithm, i.e., DCNN, AlexNet, and BLSTM neural networks, for the Arabic alphabet classification. Our research is different from the previous works in terms of the dataset, features extraction, network architecture, and performance. We employ mel-spectrogram for extracting features of the audio dataset of alphabets. The mel-spectrogram is the conversion of audio frames into frequency-domain representation, which are scaled on an equally spaced mel-scale. The magnitude or power spectrum passes through the mel-filter to obtain the mel-spectrogram. The previous approaches mostly use MFCC, which is related to mel-spectrogram. MFCC coefficients are obtained by passing a mel-spectrum through a logarithmic scale and then discrete cosine transform (DCT). Due to excessive use of DL in speech systems, DCT is no longer a necessary step [34].
We are working on two classification problems using an audio dataset of the Arabic alphabet. The first problem is a multi-class classification task to detect and classify the alphabet to their respective classes. The second problem is a binary class classification task that detects and classifies the correct and incorrect pronunciation to their respective class. The models we use in this research are CNN model learns features from melspectrogram and BLSTM learns to use the spectral features technique. In this paper, our major contributions are: 1. Collection of an audio dataset for the Arabic alphabet with correct and mispronunciation. 2. Arabic alphabet classification (recognize each alphabet). 3. Arabic alphabet pronunciation classification (detect correct pronunciation of the alphabet). 4. Exploration of DCNN, AlexNet, and BLSTM to perform classification of the audio set of the Arabic alphabet.
The Arabic language has 29 alphabets, and we consider each alphabet as a class. The Arabic alphabet classification is a multi-class classification problem, which involves the Arabic alphabet audio dataset, and the classification task recognizes each alphabet's class. The audio file with the alphabet sequence shown in Figure 1, is fed into NN. The network learns and extracts the feature set of each alphabet based on its characteristics. The classifier then evaluates and differentiates the alphabet in their respected class. On the other hand, Arabic alphabet pronunciation classification is a binary classification problem. In this task, our focus is to detect the correct pronunciation of the alphabet. The network learns characteristics of the dataset and classifies them into correct pronunciation and mispronunciation classes. The organization of the rest of the paper is as follows. Section 2 explains the collection and preprocessing of the data. Section 3 presents the proposed methodology and DL classification models. In Section 4, we present experimental results and their comparative analysis. Section 5 concludes our work.

Data Collection and Preprocessing
In this section, we present the data collection and preprocessing technique applied to the dataset. These techniques can have a significant impact on the training of the learning model [35]. The collected audio samples of the Arabic alphabet may have noise and background speech, which causes distortion and can affect the decision of the classifier [36]. The preprocessing reduces noise and background speech from the collected samples.

Noise Reduction
Several algorithms and applications are available for speech enhancement. We performed noise suppression using spectral subtraction and voice activity detection [37] over noisy audio samples. We have obtained spectral estimates for the background noise from the input signal. Figure 2 demonstrates the performance of the proposed method; Figure 2a shows the time-waveform of the alphabet with background noise whereas Figure 2b shows the time-waveform of the clean alphabet.

Audio Segmentation and Silence Removal
Silence is an unvoiced part of a speech signal and it is useful in detecting pauses between speech but most of the time it is useless because it makes extraction of actual information difficult [36]. We adjusted each clip to have minimal silence because silence makes it difficult for the network to classify an alphabet accurately due to the presence of useless information. Recorded data files consist of 29 letters and each of them is separated through silence, which is useful in the segmentation of a large file. We implemented a speech detection algorithm over the audio dataset, the algorithm is based on [38]. The algorithm detects the boundaries of the speech and discards the silence at the end and beginning of the speech. It transforms the audio signal to time-frequency representation with specified 'Window' and 'OverlapLength' (Number of samples overlapping between adjacent windows). For each frame, it calculates short-term energy and spectral spread and then creates their histogram.
The spectral spread and short-term energy are smoothed over time by passing through the successive moving median filter to alleviate spikes that are left after noise removal and compared with their respective threshold to create the mask. The masks are combined and the speech regions are merged with 'MergeDistance' (Number of samples over which merge positive speech is detected) to declare a frame with speech. Figure 3 shows the detected speech levels discarding the silence between them. Later, we save these speech segments in separate audio files, so we can use them in an audio classification task.

Data Augmentation
Data augmentation is a familiar ML strategy, and we use it to increase data quantity [39]. We enhanced the data by modifying the existing source data, we augmented about 20 samples of each alphabet from the existing dataset. In audio data augmentation, we used a pitch variation factor to retain the originality of the audio dataset and have minimal effect on the pronunciation. We found this technique suitable for this work after cross-checking audio files audibly. It was also the only technique that did not have any negative impact on the Arabic audio dataset. We obtained 6 modified samples from each alphabet by varying pitch between levels −0.3, and 0.3. Figure 4a shows time-frequency representation without augmentation and Figure 4b shows time-frequency representation with augmentation. This augmented dataset was cross-checked by an expert to ensure the pronunciation of the alphabet is not compromised during augmentation.

Methodology
This section presents the methodology and different stages of this research work. This research work consists of five stages: data collection, preprocessing, feature extraction, network training, and classification of unseen data. These stages are described through the system architecture as shown in Figure 5.
The first and second stage of this proposed methodology involves the collection of the dataset and preprocessing, we have already discussed these two stages in the previous section. The third stage involves feature extraction; the features are extracted from the raw data and input to the fourth stage for training the network using deep learning models. In the end, we compare the training data with unseen testing data. Then we estimate the accuracy and display the confusion charts for each class.

Feature Extraction
In ASR systems, we extract a feature set from the speech signals. The classification algorithm is performed on the features set instead of speech signals directly. Feature extraction provides a compact representation of speech waveforms. Classification-based feature extraction reduces redundancy and removes the irrelevant information in large datasets [40]. A large dataset requires huge memory and computation power and leads to over-fitting.
CNN extracts feature autonomously and converts the raw audio data into melspectrogram [41]. We have done this conversion only for CNN (DCNN and AlexNet) as it takes an input image, processes it, and then classifies it in different categories. CNN extracts and filters an enormous number of features to get useful features for the classification of the audio alphabet. In this work, we are using filtered features, from the FC layer.
On the other hand, BLSTM needs assistance for extracting features. In the BLSTM network, we extract the information of the given dataset using spectral features from the raw audio data. The extracted data are stored and later given as input to the BLSTM network for training, testing, and evaluation of audio alphabet [42]. First, we use melspectrum with BLSTM, but the results were not promising, so we opted toward handcrafted features. We extract 12 spectral features from the raw data including spectral centroid, spectral spread, spectral skewness, spectral kurtosis, spectral entropy, spectral flatness, spectral crest, spectral flux, spectral slope, spectral decrease, spectral roll-off point, and pitch. These features are widely used in machine learning, deep learning applications, and perceptual analysis. We are using these features to differentiate notes, pitch, rhythm, and melody of speech.

Neural Network Model Training
The development of the learning model requires a history of the training data and provide observation of the data with input. The network captures the meaning of these observations in the output. The neural network learns a mapping function to find an optimal set of model parameters. We tested different network parameters and after their empirical analysis, the following parameter values are used as shown in Table 1.

Deep Learning Models for Classification
DL consists of vast models and several associated algorithms. The dataset and the type of tasks performed play a significant role in selecting a model. The audio alphabet dataset is trained and tested using deep learning models to achieve better accuracy and minimum loss function. The pre-trained models on the Quranic dataset are not available, so we trained the algorithms from scratch and by fine-tuning the existing models. Two types of models are selected for the classification of audio alphabets:

. Convolution Neural Network
Convolution neural network [44] consists of an independent filter used for image data, classification prediction problems, and regression prediction problems due to its deep structure, it is also called DCNN. The number of features depends on the number of filters and extracts the mel-spectrogram of raw data (wav file). Each convolution layer learns features from the mel-spectrogram and the remaining layers process the useful information from the learned feature. This network consists of the 24-layer architecture of DCNN given in Figure 6. We have used the following Algorithm 1 for DCNN.  The transfer learning (TL) [45,46] technique is used in ML and its sub-field DL. This method is designed for one task and can be reused as a starting point for a new related task. Pre-trained networks are used as a starting point in new research, as these networks help us save vast computation and time resources required to design a network. There are two ways to use transfer learning. Firstly, by extracting features using a pre-trained network, and then train the network model. Secondly, by fine-tuning the pre-trained network by keeping the weights learned as an initial parameter. Fine-tuning is used when we are using an NN that has been designed and trained by someone else. It allows taking advantage without having to develop it from scratch. Therefore, we are relying on the second method.
AlexNet is trained on millions of images from the ImageNet database [44]. It is trained in 1000 categories and is enriched with a wide range of feature representations. The standard size of this network is 227 × 227 × 3 (In image size 227 represents the number of frames, 227 represents the number of bands, and 3 represents the spectrum.) and consists of 25-layers shown in Figure 7. We convert the raw data to the mel-spectrogram because AlexNet is trained on the ImageNet dataset. The mel-spectrograms are resized according to AlexNet's standard input size, and inputs to the model. The standard AlexNet is trained on 1000 categories, whereas this work consists of 29 classes of the alphabet classification problem and 2 classes of the alphabet pronunciation classification problem. Therefore, to use pre-trained AlexNet, we have replaced 3 final layers of AlexNet named fully connected layer (Fc8), SoftMax layer, and classification layer (output layer). We have fine-tuned them according to our classification problems. After this, the network extracts features from the mel-spectrogram autonomously and learns from the dataset. We need to specify the output size of the fully connected layers according to the number of classes of our data. Other parameters are learned after their empirical analysis, and network training. The test dataset is compared to the training dataset to observe the performance of the network. We have used Algorithm 2 for the AlexNet pre-trained network model.

Recurrent Neural Network
The recurrent neural network is designed to work with sequence prediction problems. Long short-term memory (LSTM) network is a special kind of RNN, which is skilled in learning long-term dependencies to help RNN in remembering long-term information lost during training. RNNs and LSTMs have received a high success rate when working with sequences of words and paragraphs. This includes both sequences of text and spoken words represented as time series. RNNs are mostly used for text data, speech data, classification prediction problems, regression prediction problems, and generative models. LSTM is a unidirectional network that learns only forward sequence as it can only see past. Whereas BLSTM is used to predict backward and forward sequence: one from past to future and the other from future to past [47]. We have used BLSTM for the training of our dataset and the architecture is shown in Figure 8. The most commonly used in audio sequences for spectral feature extraction for input in RNN networks. We have used the following Algorithm 3.

Classification of Unseen Data
Partitioning of the data is useful for the training and evaluation of machine learning models. The dataset is usually divided into two non-overlapping groups: training data and testing data. The training data are used for the modeling and feature set development. The test data are used to measure the model's performance. We have divided our dataset into 80% training data and 20% testing data.

Results and Discussion
We collected 8 audio samples from the web (see data availability section for dataset) and recorded 20 audio samples from native and non-native speakers at the sampling frequency of 44.1 kHz. We used a 16-bit pulse-coded modulated (PCM) raw format to collect audio samples. We collected 2 audio samples (per alphabet) from 11 male experts and 9 children (boys).
The dataset consists of speech samples from male subjects including 19 (adults) and 9 (children). The source data has 48 speech samples or 1392 files while augmented data has 3480 files. The description is given below in Table 2. A binary classification task is performed on correct versus non-correct pronunciation of the speech samples. We collected 20 samples from non-expert adults (male). Source data consist of 580 correct and 580 non-correct data samples. After augmentation, we have 140 samples per alphabet, and source data are 20 samples and augmented data of 120 samples per alphabet. This experiment needs an equal number of samples for correct versus non-correct pronunciation for the dataset so we selected source data of 20 samples and augmented data of 120 samples per alphabet of adult speakers. The description of the dataset is given in Table 3.

Confusion Matrix
The confusion matrix is a performance measure of classification models [48], it consists of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) measures. The accuracy (ACC) of the model in terms of aforementioned measures is: Here, TP is the number of the alphabet that is positive (current class) and predicted correctly as positive, TN is the number of the alphabet that is negative (classes other than current class) and predicted correctly as negative, FP is the number of the negative alphabet that is negative but predicted incorrectly as positive, and FN is the number of the alphabet that is positive but predicted incorrectly as negative.

Validation Strategies
The validation step helps us find the optimum parameters for our model while preventing it from becoming overfitted. Two of the most known validation strategies are: 1. Hold-out strategy. 2. K-fold strategy.
In the hold-out validation [49], the dataset is divided into two non-overlapping sets of the training and testing dataset. The test dataset is held out while training the network model. It prevents overlapping and estimates the more accurate and generalized performance of an algorithm. It also reduces the computational cost because it only needs to be run once. The drawback of this procedure is that it does not use all the available data and results are highly dependent on the choice of training and testing data split.
Cross-validation [49] is a very powerful re-sampling technique used to evaluate ML models for limited data. It consists of a single parameter 'K' referred to as the number of groups in which the dataset is divided, this method is also referred to a K-fold crossvalidation. We have performed 5 cross-fold validation, so it can use all the available data for validation.

Arabic Alphabet Classification
In this section, we are going to present the result of the dataset before data augmentation and after data augmentation for Arabic alphabet classification-based on Arabic alphabet pronunciation classification using the deep learning network models. This section is concluded with a brief discussion of the comparison of the results of the deep learning models using DCNN, AlexNet, and BLTSM, respectively.
For the alphabet classification without data augmentation, the accuracy of the DCNN model is 65.89% with random splitting, 64.56% with the hold-out validation, and 64.03% with 5-fold CV. We trained the model 5 times by altering the neighboring sequence and then averaging out the accuracy. By using the TL approach, we have achieved an accuracy of 78.03% with random splitting, 78.73% with the hold-out validation, and we have achieved an accuracy of 79.15% with CV. BLSTM achieved 53.18% accuracy with random splitting, 52.62% with the hold-out validation, and after alternating sequences using 5-fold CV, the results are 53.17% as shown in Table 4.
For the alphabet classification with data augmentation, the accuracy of the DCNN model is 95.95% with random splitting, 93.32% with the hold-out validation, and 93.46% with 5-fold CV. The accuracy of the pre-trained AlexNet is 90.91% with SVM classifier which increased up to 98.41% using Adam optimizer with random splitting, 96.72% with the hold-out validation, and 96.36% with 5-fold CV. In BLSTM, we achieve an accuracy of 87.90% with random splitting, 88.38% with the hold-out validation, and 89.95% with the CV experiment as shown in Table 4. In Table 5 the values of mean, standard deviation (SD), and standard mean error (SME) are shown for DCNN, AlexNet, and BLSTM network. In this section, we are going to discuss the DCNN, AlexNet, and BLSTM network trained over the dataset without data augmentation in detail.
DCNN: The DCNN without data augmentation has an accuracy of 65.89%. The alphabet 'jeem', 'saud', and 'wao' have 100% accuracy, whereas 10 alphabets have accuracy above 70% and the other 10 alphabets have accuracy above 50% and the remaining alphabet have very low accuracy rate. The alphabet 'sa' is confused with 'hha', 'saud', 'fa', and 'ya', due to which these samples are misclassified.
BLSTM: By using the BLSTM network, we have achieved an accuracy of 53.18% which is comparatively less than the other networks (DCNN and AlexNet). 'Alif' has 100% accuracy whereas only 2 alphabets have 80% accuracy and the other 10 alphabets have an accuracy between 70% and 60%. The remaining alphabet, i.e., 'sa' and 'fa', have the lowest accuracy of 16.7% because the network is confusing 'sa' with 'fa', 'ta', and 'ha'. The network is also confusing 'fa' with 'sa', 'za', and 'ha'.

With Data Augmentation
In this section, we are going to discuss the results of the DCNN, AlexNet, and BLSTM network trained over the whole dataset including source data and augmentation data in detail.
DCNN: The columns and rows represent the predicted class and actual class. The diagonal cells show the number of correctly classified observations and the off-diagonal cells show the number of incorrect observations. The alphabet 'alif', 'ba', 'zhal', 'za', 'seen', 'sheen', 'saud', 'aain', 'ghain', 'kaaf', 'meem', and 'hamzah' have 100% accuracy. Only the observations of the misclassified alphabet are shown in Table 6. The overall accuracy of the DCNN is 95.95%. Table 6. DCNN with data augmentation. In cross validation, we have applied a 5-fold validation experiment on the DCNN model. We cross-check and validate the non-overlapping data k-fold times. The alphabet 'alif', 'zua', 'qauf', 'wao', and 'hamzah' have 100% accuracy. The accuracy of 93.46% by splitting the train and test datasets into 80% and 20%.
By using a 5-fold CV experiment using the BLSTM network, we have achieved an accuracy of 87.95%. 'Kha' is the only alphabet that is accurately classified with 100% accuracy.

Arabic Alphabet Pronunciation Classification
For Arabic alphabet pronunciation classification without data augmentation, the accuracy of the DCNN model is 96.41% with random split, 95.46% with the hold-out, and 96.37 with the 5-fold CV experiment. The AlexNet transfer learning approach achieved an accuracy of 97.12%, 96.89% with the hold-out, and 96.55% with CV. The BLSTM achieved 72.41% accuracy with random split, 73.54% with the hold-out, and 74.35% with 5-fold CV as shown in Table 8.
For Arabic alphabet pronunciation classification with data augmentation using the DCNN model, we achieved an accuracy of 97.88% with the random split method, 95.28% with the hold-out validation, and 96.24% with 5-fold CV. The AlexNet transfer learning model achieved an accuracy of 99.14% with random split, 97.42% with the hold-out validation, and 98.43% with CV. The BLSTM model achieved an accuracy of 77.71% with random split, 76.12% with the hold-out validation, and 78.17% with 5-fold CV. In Table 9, the values of mean, standard deviation (SD), and standard mean error (SME) are shown for DCNN, AlexNet, and BLSTM. The confusion matrix of the Arabic alphabet pronunciation classification shows the number of correctly classified and misclassified samples. The accuracy of these samples can be seen in Table 10, it shows the confusion matrix of DCNN, AlexNet, and BLSTM models for the dataset without data augmentation. AlexNet has a lesser error rate than DCNN and BLSTM.

. With Data Augmentation
The dataset is split into a ratio of 80% training set and 20% test data. This approach consists of two classes because of less interference between classes. As shown in Table 11 both classes in DCNN have an accuracy of more than 97%, confusing 11 samples of Correct pronunciation class and 24 of mispronunciation class. Whereas, AlexNet has a few misclassified samples as compared to DCNN and BLSTM networks. The AlexNet provides the highest accuracy in pronunciation classification.

Discussion
The classification model's performance with data augmentation outperformed the model's performance without data augmentation. We can increase the performance of the network by reducing overfitting and improving the accuracy of the network as can be seen in Figure 9. In this figure, the left bar represents DCNN, the middle one represents AlexNet, and the right one represents the BLSTM network tested on two validation techniques including 5-fold CV, and the hold-out validation for the dataset with data augmentation and without data augmentation. Figure 9a shows alphabet classification results with and without data augmentation and Figure 9b shows pronunciation classification results with and without data augmentation. The results in Figure 9b are so close because it has only two classes, by increasing the number of classes we might see the difference very clearly.
We can also see from the results given in previous sections that the AlexNet with transfer learning outperformed other networks, although its architecture is similar but deeper than the DCNN network. As a result that AlexNet is trained with 60 million parameters, the spectrograms are augmented by mirroring and cropping the images which increases variation in the training dataset. It uses overlapped pooling layers after some convolution layers which improves the error rate as compared to other networks. DCNN is the second network in the lead, we construct a small DCNN network as an array of layers seen in Figure 6 and then trained it from scratch. Due to over-fitting while training, its error rate is more than the AlexNet network. Now, the third network (BLSTM) has accuracy less than the other networks because it cannot extract features on its own while both DCNN and AlexNet network extracts numerous features on their own from the spectrograms. Whereas, while using BLSTM we first extract the fixed number of features and then train it by using BLSTM, which is why its results are less satisfying than the other networks. The comparison of the ANNs with and without data augmentation for alphabet classification and alphabet pronunciation classification can be seen in Figure 9.

Conclusions
In this paper, we have proposed a framework for CA speech recognition using deep learning techniques including DCNN, AlexNet, and BLSTM. We implemented these learning models and demonstrated their results on the Arabic alphabet audio dataset. Several experiments are performed using three different validation techniques including random splitting, 5-fold cross-validation, and hold-out validation. AlexNet outperformed the DCNN and BLSTM in the classification tasks. We have performed two tasks, i.e., Arabic alphabet classification and Arabic alphabet pronunciation classification using augmented and non-augmented dataset while we have achieved promising results with data augmentation.
The first part of this research is Arabic alphabet classification, which is successfully performed by using AlexNet and yielded an accuracy of 98.41% with data augmentation.
The second part of this research is the Arabic alphabet pronunciation classification using the AlexNet model and, we achieved an accuracy of 99.14% with data augmentation. As future work, we would like to extend the proposed method to incorporate more feature sets and increase the size of the dataset for words and sentence recognition. We would further like to investigate some new network architectures, i.e., Xception, Inception, ResNet, and NASNet. Institutional Review Board Statement: The study has been performed as per the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards and approved by Institutional Review Board (or Ethics Committee) of Qauid-I-Azam University Islamabad, Pakistan (7 July 2020).

Informed Consent Statement:
Informed consent was obtained from the guardians of minor participants and all the individual participants included in this research work while collecting the audio dataset.