AUDD: Audio Urdu Digits Dataset for Automatic Audio Urdu Digit Recognition

: The ongoing development of audio datasets for numerous languages has spurred research activities towards designing smart speech recognition systems. A typical speech recognition system can be applied in many emerging applications, such as smartphone dialing, airline reservations, and automatic wheelchairs, among others. Urdu is a national language of Pakistan and is also widely spoken in many other South Asian countries (e.g., India, Afghanistan). Therefore, we present a comprehensive dataset of spoken Urdu digits ranging from 0 to 9. Our dataset has 25,518 sound samples that are collected from 740 participants. To test the proposed dataset, we apply different existing classiﬁcation algorithms on the datasets including Support Vector Machine (SVM), Multilayer Perceptron (MLP), and ﬂavors of the EfﬁcientNet. These algorithms serve as a baseline. Furthermore, we propose a convolutional neural network (CNN) for audio digit classiﬁcation. We conduct the experiment using these networks, and the results show that the proposed CNN is efﬁcient and outperforms the baseline algorithms in terms of classiﬁcation accuracy.


Introduction
Deep Learning has been successful in multiple domains including image classification [1][2][3], text classification [4,5], speech recognition [6][7][8], and many more [9]. In speech recognition, there has been a tremendous amount of work done ranging from speech recognition to speech mimicry for multiple languages. However, in all languages, data have been a key challenge for any deep learning task, as data are difficult to find, and data collection is a time-consuming and tedious task to do. Without data, research and development for any language is not possible. To continue the robust and reliable development of speech recognition, it is necessary to have publicly available data so researchers can continue to develop speech-related systems. Similarly, for Urdu audio digit classification, there is no dataset available publicly. Inspired by the need for audio digit recognition, we release the Urdu audio digit dataset. Typically, a framework for automatic speech recognition [10] in Urdu can assist voice recognition in other languages that are similar [11].
A few works [12,13] considered speech processing schemes in the Urdu language. The field of computer science associated with the interactions with computers and natural languages is known as Natural Language Processing (NLP). This work is related to Automatic Speech Recognition (ASR), which is the advanced research domain of the NLP [12]. Automatic Speech Recognition (ASR) is the transformation of an acoustical waveform into the text equivalent of the information conveyed by the spoken word. The main objective of ASR research is to enable the computers that can recognize continuous speech with limitless vocabulary. This aim remains unfulfilled, as yet, although substantial progress has been achieved. Artificial neural networks (ANNs) were created to mimic the human brain. ANNs perform excellent classification and are used in a variety of pattern classification applications. ANNs are a suitable selection for Automatic Speech Recognition (ASR) systems because of their exceptional classification abilities. Most of the modern speech recognition systems that have been created use ANNs for speech signals classification [13]. Despite the importance of ASR, the research for Urdu ASR is a quite new domain because of two reasons: (1) lack of Urdu language studies research and (2) lack of a balanced speech dataset.
Currently, some work studied ASR for Urdu [13][14][15][16][17][18][19][20]. Additionally, numerous works [12,17,[21][22][23] have been done on spoken digit recognition for Urdu. The abovementioned works have shown significant results. However, the considered datasets were not comprehensive; thus, trained machine learning models can be further improved by using large datasets. To address this issue, we propose a comprehensive Audio Urdu Digits Dataset (AUDD). The main contributions of our work are as follows.

•
We present an AUDD for Urdu digits that comprises 0-9. We have collected these voice samples from 740 people. These people were of different age groups ranging from 5 to 89 years. • We perform extensive experiments for different networks to provide classification accuracy that will serve as baseline accuracy. • We propose a convolutional neural network (CNN) for classification that shows impressive performance being a simple CNN compared to complex flavors of efficientNet.
The remaining work is organized as follows: Section 2 gives an overview of relevant work that has been done previously, Section 3 describes data collection in detail, Section 5 describes the details of the experiment and baseline results, and finally Section 6 concludes the overall contribution.

Related Work
There have been several studies performed on spoken digits recognition in different languages. Messer et al. proposed [24] the first complete audio visual database of spoken digits, known as Extended XM2VTS, which is made up of recordings from 295 people of all ages and genders. Each speaker's audio recording consists of two continuous digit strings. Bailly et al. proposed the BANCA database [25], which has utterances in English, Spanish, Italian, and French, among other languages. The goal of this database was to put person verification to the test in a variety of settings, including controlled, degraded, and unfavorable. Jain et al. [26] proposed Discrete Cosine Transform and Local Binary Patterns to conduct an audiovisual digit recognition challenge. Brahme et al. [27] created an in-house database for Marathi speech digits to explore lip reading movement and digit classification.Wazir et al. [28] proposed LSTM to accomplish the recognition task, and MFCCs were used to extract features from audio files to develop a speech recognition solution for Arabic numerals using RNN. This model is capable of classifying digits from noisy spoken digits. The results are promising, with a 69 percent of overall recognition accuracy. Dalsaniya et al. [29] presented a novel, publicly available audio dataset of spoken digits in the Gujarati language, as well as some preliminary results. A comparison of the categorization algorithm with an English language database was also performed. Many works [12,17,22] considered Urdu language voice data for various applications. Hasnain et al. in [17] described the frequency analysis of spoken Urdu numbers from 0 to 9. In this work, the authors collected voice samples from 15 speakers who spoke the same digit set from 0 to 9. The voice recordings were curtailed for 0.9 min. The data were recorded using a microphone connected to a Windows-based PC and saved into .wav file format. The initial processing of data was done using Simulink and MATLAB. Fourier descriptions and correlation calculations were performed using these same tools. The same words were delivered by the same and different people, and words were distinguished using correlation. A feed-forward neural network developed in MATLAB was used to classify the voices with 100 percent accuracy. Although authors obtained maximum accuracy and found a trade-off between goal, epochs, and learning rate, the dataset used in this work was too small.
All of the above works [12,17,21,22] have used Urdu datasets with limited data (e.g., [17] used a dataset of 150 samples). In contrast, our dataset is comprehensive and diverse and consists of 25,518 samples, which is helpful for classification.

Data Collection
Urdu is widely spoken in South Asia and is the national language of Pakistan. Predicting a digit from speech and making automatic speech digit classification efficient and robust is the basic reason for our motivation. Secondly, there is no big dataset available publicly. To the best of our knowledge, this is the largest publicly available dataset that can be used for many research purposes.
In our data collection, different age groups of people participated. There was a total of 740 participants, the age ranging from 5 to 89. Most participants were 5 to 14 years old, as shown in Figure 1. Comparing male and female participants, male participants were slightly more numerous than female participants, as shown in Figure 2. We also checked the number of samples in each class; most classes sampled more than 2000 samples, overall samples were 25,518, and the class-wise distribution is shown in Figure 3. For diversity purposes, we recorded each sample in different environments, i.e., participants record samples in a normal environment, noisy environment, closed room environment, at different heights, and different distances. Furthermore, we asked the participant to record samples with different characteristics, such as slow voice, fast voice, low loudness, high loudness, etc.

Visualization
We visualized the samples in different ways, i.e., wave form, spectrogram, and Mel spectrogram. To visualize the wave form of the samples of each class, first we loaded the samples using Librosa library [30], and we show each digit audio in the waveform, as in Figure 4. As the samples were already loaded, we just calculated the Mel spectrometer using Librosa and showed each sample in Mel spectrogram, as shown in Figure 5. Finally, we calculated the log of Mel spectrogram using the library and visualized it, as shown in Figure 6. From this visualization, we can understand that any representation (waveform, spectrogram, or Mel spectrogram) for single digits is unique. Based on this uniqueness, it is easy for CNN to learn these unique features. Thus, we applied CNN and other neural network models.

Models
In this section, we explain various models used for the classification of audio Urdu digits.

Support Vector Machine
SVM is a supervised machine learning approach, which can be applied to classification and regression problems. It is, however, mostly utilized in classification problems. In the SVM algorithm, each data item is plotted as a point in an n-dimensional space (where n is the number of features), and each feature value corresponds to a particular coordinate. The classification is performed by searching the hyper-plane that differentiates the different classes. The coordinates of individual observations are known as support vectors [31].

Multilayer Perceptron
MLP is an artificial neural network that comprises multiple layers of neurons in a feed-forward fashion. A multilayer perceptron has three or more layers, each with an input, output, and one or more hidden layers. The neurons in MLP have a nonlinear activation function, and each layer is fully connected to the next. Using non-linear/linear activation functions, several perceptrons are merged to generate a decision boundary. A non-linear mapping to a new dimension is provided by each perceptron [32].

EfficientNet
EfficientNet is a new model scaling method recently developed by Google [33] for scaling up CNNs. It uses a simple, greatly effective compound coefficient. EfficientNet works differently from traditional methods that scale dimensions of networks, such as width, depth, and resolution; it scales each dimension with a fixed set of scaling coefficients uniformly. Practically, scaling individual dimensions improves model performance. However, balancing all dimensions of the network concerning the available resources effectively improves the whole performance.

Experiments
In this section, we explain preprocessing, training setup, and results as a baseline by different models.

Preprocessing
We perform a single-step preprocessing on audio samples that is to calculate Log-Mel Spectrogram [34]. To calculate the Log-Mel Spectrogram, first Fast Fourier Transform (FFT) is calculated on an audio signal using Equation (1).
In Equation (1), h(n) and s i are the N sample long analysis window and time-domain samples, respectively, and S i (p) and N are frequency-domain samples and Fast Fourier Transform size, respectively. After frequency domain samples, the next task is to map its amplitude to the Mel scale of perceptual excitation using a filter bank. A Mel filter bank is used to convert the spectrum to a Mel spectrum. The Mel-scale is calculated based on human hearing frequencies [35]. Thus, the Mel-scale is used for tone measuring, and Mel-frequency scale is calculated as shown in Equation (2).
We performed the above calculation using Librosa library. After Mel-Spectrogram calculation, we resized it to 32 × 32 using the resize feature of OpenCV [36] and saved it as an image.

Training Setup
We used multiple deep learning models. For SVM and MLP, we used Scikit learn [37,38]. SVM was used with default parameters, MLP was used with two hidden layers having dimensions of 512 and 256, and the maximum iterations were 200.
We used the Keras library [39] for built-in deep learning models; the last layer of each model was removed, and the new last layer was added with 10 classes, as in our dataset we had only ten classes. During loading the model weight is set as none since we did not want a pretrained model. For each model, we used a batch size 256 and a learning rate of 0.01. We used Adam optimizer [40].
We devised a simple 3-layer convolutional neural network (CNN). Interestingly, it has shown impressive performance being a simple network. Further architecture detail is given in Table 1.

Classification Results
We performed experiments for different models. Each experiment was repeated three times, and average accuracy with variance is reported as shown in Table 2. In Table 2, Support Vector Machine [31] and Multilayer perceptron [13] are multi class classifiers, CNN is our proposed architecture, and other models are the flavors of efficient Net that range from B0 to B7. Accuracy, defined as below, was used for evaluating the performance: where A is accuracy, C is the number of samples recognized correctly, and T is the number of all samples. Furthermore, we also showed the accuracy at each epoch for all models, as in Figure 7. Among them, our proposed CNN model showed better performance. Additionally, our proposed CNN showed faster convergence, as in Figure 7. For evaluating the effectiveness of our proposed CNN on different languages, we performed the experiments on Gujarati [29] and English [41,42] spoken digits. Our proposed CNN also outperformed the Gujarati Digit Model by absolute 22% accuracy. We compared the proposed model's accuracy on the English digit dataset with multiple baseline models; the proposed CNN outperformed CNNDigit Reco [43], SVM [44], random forest [44], and English model results by absolute 19.3%, 7.3%, and 0.3%, respectively. Gujarati digit model and English digit model performances are compared in Table 3 and in Table 4. Furthermore, we checked the effectiveness of the proposed CNN on the Urdu Corpus [11] dataset. It is found that the proposed CNN outperformed LDA, SVM, and RF by absolute accuracy margins of 34.53%, 24.53%, and 34.53%, respectively, as it is shown in Table 5.

Conclusions
To the best of our knowledge, this is the first study that provides the largest and publicly available Audio Urdu Digits dataset having diverse characteristics. The dataset comprised 25,518 samples of 10 classes (0-9) that were collected from 740 participants of diverse age groups in different environmental conditions.The comprehensive data analysis in the form of waveform, spectrogram, and Mel spectrogram shows that a CNN with a small receptive size, more numbers of filters, and a small max pooling window can improve the results. We have also provided the baseline results of this novel study for the research community. Furthermore, to evaluate the effectiveness of the proposed CNN, it was also tested with two different language digit datasets, i.e., Gujarati and English, and promising results were obtained.  Data Availability Statement: Our released dataset can be found at Urdu Audio Dataset (https://www.kaggle.com/zeroaishazero/urduaudiodigit). Code is available: github Code.
Acknowledgments: Apart from my own efforts, the success of this research work depends mainly on the encouragement and guidelines of my supervisor Yao Shen. His positive outlook and confidence in my research inspired me and gave me confidence. I also want to express my most significant appreciation to Rodrigo Gantier and Hassam Ahmad. I can not say thank you enough for their tremendous support and help.

Conflicts of Interest:
The authors declare no conflict of interest.