In the literature, research addressing the issue of synthetic text generation can be classified into two main categories. Top-down approaches are typically based on physical models, that simulate the writing process itself [
8,
9]. Therefore, script trajectories are seen as a result of character key points, the writing speed, character size and inertia, which finally leads to the curvature of handwritings. These approaches are focused more on the physical aspects of writing than the actual synthesis outcome [
10].
Bottom-up approaches, on the contrary, model the shape (and possibly texture) of handwritings itself. Hence bottom-up approaches are preferred in the context of image processing tasks as segmentation or handwriting recognition. Bottom-up approaches can be further categorized into generation of new samples of the same level, and concatenation to more complex outcomes, such as words that are composed of characters or glyphs [
11,
12]. Some synthesis approaches are restricted to one technique, however, a higher synthesis variation and flexibility can be achieved by combining both. A common generation technique is data perturbation, which is performed by adding noise to online or offline samples [
13]. Another generation technique is sample fusion, that blends two or a couple of samples to produce new hybrid ones [
14,
15]. A better statistical relevance can be achieved using model-based generation [
16,
17]. This initially requires the creation of deformable models from sufficient samples—usually on character level—from which new character representations are generated.
Concatenation of handwriting samples to units of higher levels can be done without connecting the samples in case of Latin-based scripts [
11], proper simulation of cursive handwriting requires at least partially connection tough. For concatenation, there are approaches that connect offline samples directly [
20], and those who use polynomial- [
21], spline- [
22]. or probabilistic-models [
23]. Due to the semi-cursive style, connecting is mandatory in case of Arabic script.
Systems using the described techniques to synthesize handwritings have been built for different scripts and purposes. Wang
et al. [
24] proposed a learning based approach to synthesize cursive handwriting by combining shape and physical models. Thomas
et al. [
25] have proposed synthetic handwriting method, for generation of CAPTCHAs’ (completely automated public Turing test to tell computers and humans apart) text-lines. Gaur
et. al synthesized handwritten Hindi, Bengali and Telugu numerals and Hindi words for character and word recognition [
26]. Multiple Latin specific approaches that are based on polynomial merging functions and Bezier curves have been documented in [
22].
As for the problem of automatic synthesis of offline handwritten Arabic text—to the best of our knowledge—Elarian
et al. [
4,
5] wrote the first published research work addressing this problem so far. They propose a straightforward approach of composing arbitrary Arabic words. The approach starts by generating a finite set of letter images from two different writers, manually segmented from IFN/ENIT database, and then two kinds of simple features (width- and direction-feature) are extracted, so they can be used later as a metrics in the concatenation step. Saabni S. and El-Sana J. proposed a system to synthesize Pieces of Arabic Words (PAW) (without diacritics) [
28] from online samples. We proposed a system to generate Arabic letter shapes by ASMs built from offline samples. Subsequently, we developed an approach to render images of Arabic handwritten words, concatenating samples based on online glyphs and using transformations on word level as optional generation step [
29].
Handwriting Recognition
Handwriting recognition can be assigned to three main categories respective the use of segmentation. The first category contains all approaches that completely ignore the segmentation, such methods called "holistic" based [
30,
31] (Latin) [
1], (Arabic). Under the second category fall all approaches that apply an over-segmentation on the PAW, and then a margining strategy is followed in order to detect the optimal margining path [
32,
33]. As an example for those approaches, Ding and Hailong [
34] proposed an approach, in which a tentative over-segmentation is performed on PAWs, the result is what they called "graphemes", the approach differentiates among three types of graphemes namely (main, above, and under-grapheme). The segmentation decisions are confirmed by the recognition results of the merged neighboring graphemes; if recognition failed another merge will be tried until successful recognition. Also, an Hidden Markov Model (HMM) can be trained to handle segmentation [
35]. The disadvantages of such approaches are the possibility of sequence errors and classification faults, as a result of the shape similarity between letters and fragments of letters. The third category is what is called "explicit segmentation", in which the exact border of each character in PAW is to be found. The main features often used to identify the character’s border are minima’s points near or above the baseline. Shaik und Ahmed [
36] proposed an approach that used some heuristic rules calculated upon the vertical histogram of the word’s image. Though authors claim successfulness of their approach with printed text, they report failures cases when PAW contains problematic letters like Sin (
س).
Our segmentation approach can also be categorized under this last category, since we are using topological features to identify the character border. The main problem with this category of segmentation is the varying of shape and topology within the single classes of handwritten Arabic letters. The feature extraction of the segmented letters [
37] is the second important step of any recognition system. A common method to recognize words using explicit segmentation are Support Vector Machines (SVMs) [
38,
39].
Features for character recognition are often gradient or moment based and calculated for sub-images, resulting in feature vector with a length of around 100. Shanbehzadeh
et al. [
37] used moment like features, which are calculated for all columns of cells resulting in a feature vector of the length 34. In the approach of Parkins and Nandi [
40] a histogram of the 8-neighborhood is used to extract features from each cell. Chergui
et. al extract SIFT key points from 5 cells performing recognition by key point matching based on the Euclidean distance [
41].
A straightforward, well-known approach of interpreting such feature vectors are
k-Nearest Neighbor (
k-NN) classifiers [
42]. There is no need to train
k-NNs, but the distance of the features of a sample to all training samples has to be computed, in order to classify the sample as those class, that is mostly represented in the
k closest training samples. As a matter of fact, the classification step is quite costly if many training samples are used. Hence, clustering techniques may be used to reduce the number of comparisons [
43]. SVMs allow classification by separating the feature space by hyperplanes [
44]. A regular SVM separates one specific class from the rest. However, solutions—as LIBSVM—that handle multiple classes, might be more effective. Good character recognition results were also achieved by Artificial Neural Networks (ANNs). Cireşan
et al. use deep convolutional neural networks (DNNs) to recognize digits and characters [
45]. Unlike common ANNs as Multi-Layer-Perceptrons (MLPs), the design of DNNs is close to a biological model, and therefore, complex. Hence, the DNNs are optimized for modern GPUs, to speed up training. DNNs outperforms other ANN based character classifiers and can further be used for general, image-based classification problems [
45].
Isolated-word error correction basically involves the often successively steps of error detection, generation of candidate corrections and their ranking [
46]. Error detection is often solved by N-Grams, which checks whether a word shows any invalid combinations of
n following characters. Comparing the word with a vocabulary—that also defines the candidate corrections—is a common, too. The ranking is often performed by a probabilistic estimation of the likelihood of the correction. However, the decision of the final correction is sometimes done by the user.
The rest of the paper is organized as follows. In
Section 2, handwriting synthesis is discussed, as shown in
Figure 1. We outline necessary data acquisition steps including the basically mathematical background of ASMs. Then, we summarize our methods to synthesize Arabic Handwritings and propose the resulting extension of our IESK-arDB database. In
Section 3, a segmentation based approach for recognition of handwritten Arabic words is proposed. Thereafter, experimental results are discussed in
Section 4, where we use synthetic databases to validate word recognition. Conclusion and future work are presented in
Section 5.