In this section, we aim to assess in detail the use of different approaches in the recognition of Urdu handwritten characters. In general, we categorized the tasks and issues related to character level analysis into two subsections: (i) Urdu handwritten character recognition and (ii) Urdu handwritten numerals’ recognition.
3.1. Urdu Handwritten Character Recognition
Handwritten text recognition at the character level is a challenging task because of having a large number of variations in writing styles (even from a single author). It is observed from the literature related to character-level recognition in the Urdu script, the artificial neural network (ANN) and its different variants are widely used. An ANN [
32] is a collection of nodes (also known as artificial neurons) linked with each other. These links between artificial neurons are enabled to transmit a signal from one to another within the network. These neurons can process the signals received and then propagate to the neurons connected in subsequent layers. The structure of the ANN may be affected by the kind of information flowing through it because a neural network usually trains itself using the input and labeled output.
The problem of developing a generic type of ICR that can resolve the issues associated with any language is challenging since different languages exhibit different characteristic features, and thus, generalizing this type of system is not possible. In order to overcome this problem, a novel approach was proposed in [
33] exploring how the character set of any language can be represented by primitive geometrical strokes. One of the promising features of the approach is that the recognizer (artificial neural network) has to be trained only once. The data structure of the character set should be represented in the form of geometrical strokes in an XML file. This file helps in training the neural network, not for every time, for each word in the language.
Figure 2 shows a set of thirteen basic geometrical strokes. For evaluation purposes, a set of 25 handwritten Urdu text samples were tested and achieved a success rate of 75–
. One of the limitations of this approach is that it does not apply to the words having dots and diacritics.
Due to having a large character (or alphabet) set, there is inherent similarity among some major strokes, as shown in
Figure 2. This similarity in characters is one of the challenging issues of the incorrect recognition of Urdu handwritten text. Keeping in view the fact mentioned above, in [
21], the authors divided the Urdu character set into four groups according to the number of strokes, as shown in
Figure 3. The authors performed an online Urdu ICR considering single-stroke characters only. Some novel features (shown in
Figure 4) were extracted and then fed to three different classifiers namely, the back propagation neural network (BPNN), the probabilistic neural network (PNN), and the correlation-based classifier. The proposed approach was tested on 85 instances of single-stroke characters taken from 35 writers of different age groups. The results showed that the PNN classifier achieved a higher accuracy of 95% as compared to the other two classifiers. Unlike BNN, the PNN-based classifiers require no initial training. This is the reason PNN-based classifiers achieved higher accuracy than BNN.
For isolated character recognition, the authors in [
22] proposed a technique that builds the feature vector by analyzing the primary and secondary strokes while writing Urdu characters in isolated form. Some of the stroke features that were used to train the classifier were as follows: the diagonal length of the bounding box; the sine-cosine angle ration of the bounding box diagonal; the displacement of the first and last point while tracing the bounding box; the corresponding sine-cosine ratio of the angle between the first and last point; the total length (in pixels) of the primary stroke; and the total angle traversed. A linear classifier was applied to the dataset of five samples each of 38 Urdu characters, i.e., a total of 190 characters were provided by two different writers who could write Urdu characters smoothly. The classifier recognized the characters with an error rate of almost 6% because some characters share quite similar shapes (see
Figure 5) and were not correctly recognized.
Similar work was reported in [
23] by considering the initial half of different Urdu characters. In this work, only those characters were considered that change their shapes concerning their position and context in a word.
Figure 6 depicts Urdu characters in the initial half forms and classified based on the number of strokes. Almost 100 native Urdu writers and speakers were invited to write in Urdu script. The writers were provided with a stylus and digitizing tablet to get the dataset of 3600 instances of Urdu letters in the initial half form. A combination of multilevel one-dimensional wavelet analysis with the Daubechies wavelet [
34,
35,
36] was applied to extract features from these instances. Several neural networks with different configurations were trained for recognition purposes. Among these networks, BPNN provided a maximum recognition rate of 92%.
The MDLSTM (multidimensional long short-term memory) neural network is one of the RNNs (recurrence neural networks) that is implicitly used for sequence learning and segmentation in multidimensional environments [
37,
38,
39]. This model was used for the first time in the work of [
26] for Urdu script recognition. One of the promising features of the model is that it can scan the input image in all four directions, thus reducing the chance of ambiguity. For evaluation purposes, the UPTI (Urdu Printed Text Image) dataset [
40] was used, which contains 10,000 scanned images of both Urdu handwritten and printed text. MDLSTM is one of the supervised techniques; therefore, each input sample in the dataset is tagged and labeled with appropriate information. The dataset is further divided according to the following ratio: 68% for training and 16% for both testing and validation purposes. In order to evaluate the accuracy of the proposed approach, the Levenshtein edit distance [
41] was computed between the output text and baseline results and achieved an accuracy of
% as compared to the results reported in the works of [
42,
43], reporting
% and 89% accuracy, respectively.
Table 1 shows a comparison of the proposed approach on the UPTI dataset [
40] with other techniques.
Promising work was reported in [
46] in which Urdu handwritten text was recognized using the dataset UNHD (Urdu Nastaliq Handwritten Dataset) [
47]. This dataset can be accessed publicly
https://sites.google.com/site/researchonUrdulanguage1/databases UNHD Database. The dataset contains 312,000 words (including both Urdu script and Urdu numerals) written on a total of 10,000 lines by 500 writers of different age groups. The writers were directed to write on white pages of size A4. Each was provided six blank pages labeled with the author ID and the page number. One of the samples of written pages is shown in
Figure 7. Furthermore, in order to maintain the uniformity in data, the writers were asked to write the provided printed text.
In order to recognize the text, a one-dimensional long short term memory (BLSTM) based approach was proposed that was based on RNN (recurrent neural network), capable of restoring the previous sequence information. For evaluation purposes, the dataset was divided into 50% for training, 30% for validation, and 20% for testing and achieved a 6–8 percent error rate that can be improved using two-dimensional BLSTM, as proposed by the authors.
Table 2 gives the summary of the accuracy reported on common datasets in the Urdu domain.
In [
27], the authors proposed a novel approach for Urdu text recognition at the character level, written in Nastaliq font by combining CNN (convolutional neural network) and MDLSTM. In the first phase, CNN was deployed to extract the characteristic features, which were then fed to MDLSTM in the second phase. This approach outperformed the state-of-the-art systems on the UPTI dataset.
Table 3 shows the comparison of Urdu recognition on UPTI datasets.
3.2. Urdu Numeral Recognition
It is quite easy for a human being to recognize the handwritten numeral data, but for the computer system, there is a need for an intelligent approach based on some machine learning algorithms developed for this kind of job. The writing stroke, length, width, orientation, and other geometrical features tend to change while writing the same numeral even by the same author. These different writing styles may introduce shape variations of Urdu numerals that may break the strokes’ primitives and also change their topology. These issues make Urdu handwritten numeral recognition one of the active research areas in the field of image processing. Unfortunately, there is no commercially-available standard dataset of Urdu numerals. Due to this lack of resources, the researchers developed their own dataset and concluded the results. This section covers some notable work related to handwritten numeral recognition in the Urdu domain.
In [
51], different transformations of the Daubechies wavelet [
34,
35,
36] were applied for feature extraction from a dataset of about 2150 samples of handwritten Urdu numerals. For evaluation purposes, 2000 samples were used for training the neural network and 150 instances for testing. In order to decompose the images into different frequency bands, both the low-pass and high-pass filtering were applied at each phase of the Daubechies wavelet [
34,
35,
36] filtering. For classification purposes, BPNN was used and achieved an average recognition rate of
%, as shown in
Figure 8.
In [
52,
53], the authors presented the similarities and dissimilarities between Urdu and Arabic script with recognition of handwritten numeric data. A hybrid technique of HMM and the fuzzy rule was used to recognize the handwritten numerals of both Arabic and Urdu script. The dataset was prepared by inviting 30 trained users to write both the Urdu and Arabic numerals and collected 900 samples in total. The system obtained 97%, 96%, and
% recognition rates using the fuzzy rule, HMM, and the hybrid approach, respectively. The authors also conclude that separation of numerals from Urdu text in a handwritten text is still a challenging issue due to having shape similarity, e.g., First character of Urdu script (Alif) and Urdu numeric (One) both have exactly same shape. A new algorithm is proposed in [
54] to preprocess the complex input and preserve shape of the actual input. Fuzzy association rules are used to link secondary stroke with their respective primary strokes. Different classifiers such as the hidden Markov model (HMM), fuzzy logic, the
k-nearest neighbor (KNN), hybrid fuzzy HMM, hybrid KNN fuzzy, and the convolutional neural network (CNN) wee used for the classification. Statistical tests were applied to find the significance of classifiers’ results. Similarly, a newly-developed OCR algorithm was introduced in the work reported in [
55] that used a semi-supervised multi-level clustering for categorization of the ligatures. Classification was performed using four machine learning techniques, i.e., decision trees, linear discriminant analysis, naive Bayes, and k-nearest neighbor (
k-NN). The system was implemented, and the results showed 62, 61, 73, and 9% accuracy for the decision tree, linear discriminant analysis, naive Bayes, and
k-NN, respectively.
In a very recent work [
56], the authors presented a simple and robust line segmentation algorithm for Urdu handwritten and printed text. In the proposed line segmentation algorithm, a modified header and a baseline detection method were used. This technique purely depends on the counting pixels approach, which efficiently segments Urdu handwritten and printed text lines along with skew detection. The handwritten and printed Urdu text dataset was manually generated for evaluating the algorithm. Dataset consisted of 80 pages having 687 handwritten Urdu text lines, and printed dataset consisted of 48 pages having 495 printed text lines. The algorithm performed significantly well on printed documents and handwritten Urdu text documents with well-separated lines and moderately well on a document containing overlapping words.
The literature related to the Urdu text recognition at the character level proved that the ANN outperformed other machine learning approaches. The results generated by the character recognition system based on ANN were two-fold, i.e., the system was not only applicable for Latin script, but also for handwritten cursive characters of the Arabic-base script. We present a novel approach of CNN in order to recognize Urdu handwritten characters embedding both pixel- and geometrical-based features. The geometrical features were extracted for each text image using hybrid approaches of connected-components labeling [
57] and the upper-lower profile [
58]. The upper-lower profile works by dividing the image into four columns, then by detecting the position of both the first and last black pixels on each column, and provides the bounding box covering the area of interest. The extracted features are then embedded with pixel-based features, making a feature vector and then processed by our proposed model (discussed in the subsequent section) in order to recognize and classify using the variable size of the test set and invariant font.