Introducing Urdu Digits Dataset with Demonstration of an Efﬁcient and Robust Noisy Decoder-Based Pseudo Example Generator

: In the present work, we propose a novel method utilizing only a decoder for generation of pseudo-examples, which has shown great success in image classification tasks. The proposed method is particularly constructive when the data are in a limited quantity used for semi-supervised learning (SSL) or few-shot learning (FSL). While most of the previous works have used an autoencoder to improve the classification performance for SSL, using a single autoencoder may generate confusing pseudo-examples that could degrade the classifier’s performance. On the other hand, various models that utilize encoder– decoder architecture for sample generation can significantly increase computational overhead. To address the issues mentioned above, we propose an efficient means of generating pseudo-examples by using only the generator (decoder) network separately for each class that has shown to be effective for both SSL and FSL. In our approach, the decoder is trained for each class sample using random noise, and multiple samples are generated using the trained decoder. Our generator-based approach outperforms previous state-of-the-art SSL and FSL approaches. In addition, we released the Urdu digits dataset consisting of 10,000 images, including 8000 training and 2000 test images collected through three different methods for purposes of diversity. Furthermore, we explored the effectiveness of our proposed method on the Urdu digits dataset by using both SSL and FSL, which demonstrated improvement of 3.04% and 1.50% in terms of average accuracy, respectively, illustrating the superiority of the proposed method compared to the current state-of-the-art models.

With the emergence of big data technology, unlabeled data are sufficiently available on a large scale [32,33], whereas there is only a handful of labeled samples available [34]. The labeling of the large dataset can be expensive, time-consuming, and often unreliable [31,32,[34][35][36][37][38]. In this regard, semi-supervised learning (SSL) helps to auto-label the unlabeled datasets using a few labeled data samples. There are several ways to label the unlabeled data, but in the conventional method, the model is first trained on labeled data, then the trained model is employed to assign the pseudo-labels to unlabeled data. Finally, both the initial labeled data and the pseudo-labeled data can be merged. Thus, SSL can significantly reduce errors and human annotation efforts. However, SSL can result in erroneous results if a significant distribution gap exists between labeled and unlabeled data.
To resolve the issue, various data augmentation methods [2] have been applied to a few available labeled data to match the diversity between labeled and unlabeled data. In recent years, several studies have been geared toward various semi-supervised approaches such as the manifold embedding technique using the pre-constructed graph of unlabeled data [2,35], whereas, in a separate study, the latent representation was exploited by dividing the variational autoencoder (VAE) into two parts, then regularizing the autoencoder by imposing a prior distribution on both parts by making them independent, which led to latent representation [37]. More recently, a new approach was proposed to exploit VAE by adding a classification layer on the topmost layer of the encoder and then merging it with the re-sampled latent layer of the decoder [38].
As mentioned earlier, in some scenarios, only a handful of labeled data are available in the absence of unlabeled data. This can pose challenges concerning obtaining good performance while utilizing a limited handful of only labeled data. Few-shot learning (FSL) is an emerging technique that could be applicable in such cases. In recent years, several approaches considering FSL have been employed. Notably, in [34], firstly, the large network was trained using a few samples; then, knowledge distillation that transfers knowledge from the large model to the small model was optimized to generate pseudo-examples. Along similar lines, a large network was trained for each class separately and then distilled to a small network using a linear function for both small and large networks [39]. In [40], both labeled and unlabeled data were trained simultaneously in a supervised manner where, at first, pseudo-labels were assigned to unlabeled data; subsequently, a denoising autoencoder and dropout were utilized.
However, the aforementioned methods suffer from mediocre performance in terms of accuracy and robustness. To overcome such an issue, in the present work, we proposed an efficient and robust model combining FSL and semi-supervised learning in a unique and efficient way that can significantly improve the accuracy of the model.
The key contributions of the present work can be summarized as follows: • We propose an efficient way of generating pseudo-examples by using only the decoder network separately for each class that has shown to be effective for both SSL and FSL. • In the proposed approach, the decoder is trained for each class sample using random noise, and multiple samples are generated using the trained decoder. • Furthermore, we are the first to release a manually labeled Urdu digits dataset consisting of 10,000 images in total collected through various methods for diversity (https://www.kaggle.com/teerathkumar142/Urdudigits, accessed on (11 April 2022). • A varied range of experiments were performed, specifically on the Urdu digits dataset, which elucidate the competitiveness and superiority of the proposed network in terms of performance over existing state-of-the-art models. • Our generator-based approach outperforms previous state-of-the-art SSL and FSL approaches, obtaining an absolute average improvement of 3.04 and 1.50 in terms of accuracy, respectively.

Semi-Supervised Learning
Semi-supervised learning (SSL) can be helpful when significantly fewer labeled data are available than large-scale unlabeled data. In recent years, there has been tremendous progress in SSL. Considering that, relevant work on SSL has been briefly reviewed. Recently, an SSL-based encoder-decoder network was extended to VAE that combines the classification layer, mean layer, and standard deviation layer with the topmost encoder layer, combined with the resampled latent layer for the decoder structure [38]. For this architecture, new samples were generated from Gaussian noise fed to the classifier using mean and standard deviation, which has shown impressive performance. In [41], a joint framework considering representation learning and supervised learning was proposed and then applied to SSL. During training, encoder and supervised classifier loss were significantly reduced. In [37], the latent representation of the autoencoder was divided into two parts, one for content and the other for style. It was concluded that the latent representation associated with the content can be beneficial for classification data. The work demonstrated better performance compared to the vanilla autoencoder. Along similar lines, in [4], firstly, encoder-decoder architecture was trained for each class. In the next phase, the encoder was removed, and noise was passed to the decoder several times to generate diverse samples. However, our experimental results suggest that only training a decoder can be an effective strategy for generating samples for each class. Additionally, the proposed approach by replacing the encoder-decoder [4] with a decoder network can significantly minimize training time and saves computational overhead.

Few-Shot Learning
Few-shot learning (FSL) can be effective when the availability of labeled data is limited, and the model has to learn utilizing the shallow data. Although numerous methods have been proposed for FSL, we will cover only the relevant works for a fair comparison. In [34], a relatively large network was considered a reference model trained on a few label samples, and knowledge distillation from significant to small models was employed. In addition, pseudo-examples were generated and optimized by employing a high-fidelity optimization procedure. This method illustrated that a relatively small network trained on fewer labels can outperform an initially trained larger network model. In [39], a linear predictor was trained for each class separately and simultaneously distilled to the target model. Subsequently, the bidirectional distillation method was employed, passing the sample to the target and the reference model. During training, the specific class predictor was activated, trained, and distilled to the target model employing MSE. Such a linear distillation technique achieved significant performance improvement. Additionally, various SSL and FSL methods exist which thrive in terms of improving performance. In the current study, the proposed approach can be used for FSL by generating pseudo-examples. To this end, we designed a novel FSL technique to improve performance and achieve state-of-theart results.

Proposed Approach
We propose a novel pseudo-examples generation technique to improve semi-supervised and few-shot learning performance in the present work. The schematic of the basic architecture of our approach is shown in Figure 1, where we train the decoder for a single class. Once trained, we pass normal distribution noise to generate samples of a specific class. We repeat this for all classes. As shown in Figure 2, the overall process of our approach is to employ a separate decoder for each class.

Decoder Architecture
As previously mentioned, we used only a decoder to generate the examples of each class. While varieties of decoder architecture exist, in the present work, we chose standard dense layer decoder architecture, where input is the noise of dimension d and is passed and mapped to examples of the specific class C i . For the decoder, five layers with dimensions of 10, 2000, 500, 500, and 784 are used as shown in Figure 3. We trained the decoder using stochastic gradient descent (SGD) [42]. Each layer uses ReLu activation and kernel initialization [43] with a scale parameter value of 1/3 with normal distribution.

Training
During training, we set the batch size to 5. Two different learning rates: 0.1 (for the MNIST dataset) and 0.04 (for FashionMNIST) are prescribed with a momentum value of 0.9. For all cases, standard MSE is evaluated following Equation (1).

Work Flow
In this section, we describe the overall workflow and corresponding algorithm of our proposed approach. Let us consider that X n and Y n are the limited original samples and their corresponding labels, respectively. Let X c be the number of examples belonging to the specific class c. In the proposed workflow as described in Algorithm 1, at first, we train a decoder on X c supplemented with normal noise [44][45][46] as input that produces X (i) c as the output as shown in line 5 of Algorithm 1. Once the training is completed, we pass normal noise to the trained decoder N times to obtain N examples of that particular class c as reflected in line 5 of Algorithm 1. In order to obtain corresponding Y c as the labels of class c for these generated examples, we add an N-dimensional vector having c in the vector with respect to Y as described in line 6 of Algorithm 1. In such a way, we generate N examples with their corresponding labels for a specific class c. Therefore, we initially have no information on X n and Y n , which represent the examples of class and their corresponding labels, respectively. In the next step, the whole process was repeated for all classes, as shown in Figure 2 and described in the loop of Algorithm 1. Finally, we have a large number of labeled data in the form of X and Y, respectively, as examples and corresponding labels as described in line 7 of Algorithm 1. In the end, FSL and SSL take the benefits of generated data X and Y from the proposed pseudo-sample generation model.

Dataset Motivation
The Urdu language is widely used in Asian countries, in particular, Pakistan, India, Bangladesh, and Afghanistan [47]. It is also regarded as the national language of Pakistan. In addition, Urdu, Arabic, Pashto, and Persian languages share various similarities. Due to different applications of Urdu numbers [48][49][50] that mainly include the automated reading of postal numbers, cheque numbers, digitization, and preserving manuscripts from old ages the acquisition and labeling of Urdu number datasets are of utmost importance and the driving motivation of the current study. However, there is no research work present in the literature that is geared toward the collection and labeling of the largest Urdu language digits dataset. In the present work, to the best of our knowledge, we are the first to release a manually labeled extensive challenging Urdu digits dataset consisting of a total of 10,000 images with 8000 training and 2000 test images. In Urdu digits dataset, some digits, in particular, 3 and 4 are almost symmetric in terms of shape. Additionally, digits 7 and 8 are in reflection symmetry. Such kinds of partial/ non-partial symmetric cases induce additional challenges in the dataset for the neural network to learn.

Dataset Collection
We used three different methods to collect data: the Microsoft (MS) paint tool, online search, and paper-based collection from different participants to increase the variability in the dataset.

Microsoft (MS) Paint-Based Collection
In MS paint-based collection, we set up an MS paint tool, in which we fixed the window at 28 × 28 pixels, and then filled it with the black background color. The numbers are written in a fixed window size by five different people using various brush sizes. Following such an approach, a different set of Urdu digits of a total of 37,000 images was generated and collected. Some representative dataset samples collected with the MS paint tool are shown in Figure 4.

Online Data Collection
To include diversity in the dataset, we used Python code scarper to obtain images from the internet using the keyword Urdu Digit, then asked ten different users to crop the Urdu numbers. In this way, we collected a total of around 3000 additional images.

Paper-Based Data Collection
To further increase the diversity of the data, we asked ten different participants to write multiple numbers on an A4 page, take the picture through the mobile camera, and then crop those numbers. Using this setup, we collected around 60,000 images. Some representative samples from the paper-based data collection procedure are depicted in Figure 5. Overall, we have 10,000 images in the Urdu digits dataset consisting of 8000 images for the training set and 2000 images for the testing set. After collecting data through the aforementioned procedures, we performed pre-processing on the image data. First, we converted the digit into white and the background into black color for all the collected images. Then we resized the image by 28 × 28 pixels while maintaining the same aspect ratio. After resizing, we normalize images in the range of 0 to 1, dividing by 255. Following the preprocessing steps similar to MNIST and Fashion-MNIST dataset, we keep the Urdu digits dataset in grayscale while not applying any mean centering to the collected images.

Experiment and Results
In this section, we report our experimental findings in order to demonstrate the performance of the proposed model. We followed various settings in terms of datasets, CNN architectures, and the model parameters, which are detailed in the subsequent sections. We use Equation (2) to calculate the entropy.

Datasets
To check the effectiveness of the proposed approach, we used MNIST [51] and Fashion-MNIST [52] datasets using semi-supervised and few-shot learning. The MNIST dataset has 60,000 training and 10,000 testing samples of 10-digit classes (range from 0 to 9) with 28 × 28 grayscale images. The fashion-MNIST dataset which is used for clothes and accessories has 60,000 training and 10,000 testing samples of ten different classes with sizes of 28 × 28 grayscale images. Some representative samples of MNIST and Fashion-MNIST datasets are shown in Figures 6 and 7, respectively. Additionally, we performed an extensive analysis on the newly introduced Urdu digit dataset.

Result from Semi-Supervised Learning
In this section, we report the results obtained from the SSL. We implemented and used a CNN network [4,38] for SSL results based on autoencoder using standard deviation and mean. For the experiments, we used a total of 100 and 1000 labels from the MNIST and Fashion-MNIST datasets, respectively. For our Urdu digits dataset, we used various numbers of labels, i.e., 100, 200, 500, 1000, and 2000. Utilizing our proposed method, we then generated the data and subsequently the SSL model was applied.
For MNIST and Fashion-MNIST datasets, various state-of-the-art models including CCNs [38], (MS) [38], and CNNs (AE) [4] were considered and directly compared with the proposed model. Note, in these tables, CNNs correspond to a supervised model, CNNs (MS) refers to semi-supervised based on the mean standard deviation layers of the autoencoder, whereas CNNs (Our) is based on a semi-supervised learning method using pseudo-examples. The accuracy values obtained from these models are presented in Tables 1 and 2 for MNIST and Fashion-MNIST datasets, respectively. For the MNIST dataset, CCNs (MS) provides the best results with accuracy values of 81.10 ± 6.16% for 100 labels, as shown in Table 1. However, with increasing labels 1000, our proposed model achieved the best accuracy of 95.11 ± 2.30% which is a 1.40% accuracy improvement over CCNs (MS). However, for the Fashion-MNIST dataset, our model provides the best results, achieving an accuracy of 74.52 ± 1.42% , whereas with an increasing number of sample size 1000, CCNs (MS) provides the best result with an accuracy of 83.67 ± 1.09%. In almost all cases, our approach improves the accuracy by over 2% except for the case of 1000 label MNIST. Overall, for both datasets, the proposed model illustrates its superiority by providing state-of-the-art results. Finally, for the Urdu digits dataset, the accuracy values are presented for various numbers of labels as shown in Table 3. It is noteworthy that with an increasing number of labels, the accuracy improves. For example, with a relatively small number of labels, 20, the accuracy reaches up to 84.90%, whereas it attains an impressive accuracy value of 96.70% for a large number of labels, 200. In short, the proposed model demonstrated superior performance with a reasonable amount of labeled data for the Urdu digits dataset. Table 1. Comparison of accuracy values (in %) between various state-of-the-art models and the proposed model evaluated in MNIST dataset.

Results from Few-Shot Learning
In this section, we reported the results obtained from the FSL. For the comparison, a large knowledge distillation model [34] was considered and trained on a few label samples. In addition, pseudo-examples were generated and then optimized and selected using high fidelity techniques. This method was shown to outperform the original large model using the relatively small model on a few label datasets. In our approach, we utilize a relatively small CNN model [34], conduct experiments on various datasets, and compare the results between these two models. At first, we generate a  Tables 4-6 with respect to MNIST, Fashion-MNIST, and Urdu digits datasets, respectively. As shown in Table 4  For the Fashion-MNIST dataset, as shown in Table 5, the proposed method with the FSL model outperformed all other current state-of-the-art models in terms of accuracy for the same network configuration and optimization scheme. Thus, our extensive experiments elucidate the superior performance in terms of the accuracy of the proposed FSL method for various numbers of levels.

Parametric Study
In both SSL and FSL experiments, the generated number of examples is different for the different number of selected levels. Thus, the number of generated examples can be treated as one of the hyperparameters. Therefore, we conduct extensive experiments to find the influence of a number of generated examples on the accuracy of the model. For the calculation of average accuracy, each experiment was repeated three times.

Conclusions
In summary, in the current study, we proposed a novel approach to improve performance concerning generating pseudo-examples by addressing the current drawbacks in the existing state-of-the-art approaches. In the proposed model, we only used a decoder network, which is easier and faster to train compared to both encoder-decoder architectures. Another advantage of such a strategy is that training a decoder using random values and images can generate different images of the same class, which is impossible in an encoderdecoder that only generates images corresponding to the same training class. Furthermore, we are the first to release a manually labeled Urdu digits dataset collected through various methods. In order to show the efficacy of the proposed approach, we extensively tested the model in different datasets with various samples using both SSL and FSL. The performance comparison in terms of average classification accuracy demonstrates the superiority of the proposed model in that it outperforms current state-of-the-art models for both SSL and FSL. Future works could be geared toward designing an efficient encoder-decoder model, replacing the decoder-only model and building various other valuable datasets.