Amharic OCR: An End-to-End Learning

: In this paper, we introduce an end-to-end Amharic text-line image recognition approach based on recurrent neural networks. Amharic is an indigenous Ethiopic script which follows a unique syllabic writing system adopted from an ancient Geez script. This script uses 34 consonant characters with the seven vowel variants of each (called basic characters) and other labialized characters derived by adding diacritical marks and/or removing parts of the basic characters. These associated diacritics on basic characters are relatively smaller in size, visually similar, and challenging to distinguish from the derived characters. Motivated by the recent success of end-to-end learning in pattern recognition, we propose a model which integrates a feature extractor, sequence learner


Introduction
Amharic is the second-largest Semitic dialect in the world after Arabic [1].It is an official working language of the Federal Democratic Republic of Ethiopia and is spoken by more than 50 million people as their mother language and by over 100 million as a second language in the country [2,3].In addition to Ethiopia, it is also spoken in other countries like Eritrea, USA, Israel, Sweden, Somalia, and Djibouti [1,4].
Dated back to the 12th century, many historical and literary documents in Ethiopia are written and documented using Amharic script.Amharic is a syllabic writing system which is derived from an ancient script called Geez, and it has been extensively used in all government and non-government sectors in Ethiopia until today.Amharic took all of the symbols in Geez and added some new ones that represent sounds not found in Geez [5].
In Amharic script, there are about 317 different alphabets including 238 core characters, 50 labialized characters, 9 punctuation marks, and 20 numerals which are written and read, as in English, from left to right [1,6].All vowels and labialized characters in Amharic script are derived, with a small change, from the 34 consonant characters.The change involves modifying the structure of these characters by adding a straight line or shortening and/or elongating one of its main legs.
It also includes the addition of small diacritics, such as strokes or loops/circles to the right, left, top, or bottom of the character.
Due to these small modifications on the consonants, Amharic characters have similar shapes which may make the task of recognition hard for machines as well as humans [7].These features are particularly interesting in research on character recognition because a small change in the basic physical features may affect the orthographic identities of letters.The shapes and structural formations of consonant Amharic characters with their corresponding vowels and labialized variants are depicted in Figure 1.Optical Character Recognition (OCR) applications have been widely used and implemented for decades to digitize various historical and modern documents including books, newspapers, magazines, and cultural and religious archives that are written with different scripts.Multiple intensive works-for multiple scripts-have been done in the area of document image analysis with a better recognition accuracy; most of the scripts now even have commercial off-the-shelf OCR applications.In such a way, many researchers think that the OCR challenge is solved.However, OCR gives better results only for very specific use cases and there are still multiple indigenous scripts, like Amharic, which are underrepresented in the area of natural language processing (NLP) and document image analysis [8].Until recent times, the OCR for Amharic script remained relatively unexplored, and it is still challenging [9,10].
Nowadays, Amharic document processing and preservation is given much attention by many researchers from the fields of computing, linguistics, and social science [3,[11][12][13].In recent years, various models and algorithms have been proposed by many researchers for image pattern recognition, and there has been a rapid advancement of solutions which have demonstrated ground-breaking performance [14].
The first work on Amharic OCR was done by Worku in 1997 [15].Since then, attempts for Amharic OCR have also been made by employing different machine learning techniques.Here, we will try to cover the various techniques considered by different researchers.A tree classification scheme with the topological features of a character was used by Worku [15].It was only able to recognize an Amharic character written with the Washera font and 12 point type.
Then, other attempts were made, such as typewritten [16], machine-printed [1], and Amharic braille document image recognition [17], Geez script written on vellum [18], Amharic document image recognition and retrieval [19], numeric recognition [20], and handwritten recognition [21].However, all of these researchers applied statistical machine learning techniques and limited private datasets.In addition, recognition was done at the character level which is time consuming and also may not achieve better recognition accuracy, since each image of a character is directly dependent on the nature of the segmentation algorithms [22].
Following the success of deep learning, other attempts have also been made for Amharic character recognition by employing Convolutional Neural Networks (CNNs) [6,8,23].Nearly all of these attempts performed segmentation of text images into the character level, which also directly affects the performance of the OCR.The only exceptions are Assabie [21] and recently published works [9,24].Assabie [21] proposed an Hidden Markov Model (HMM)-based model for offline handwritten Amharic word recognition without character segmentation by using the structural features of characters as building blocks of a recognition system, while Belay [9] and Addis [24] proposed Long-Short-Term Memory (LSTM) networks together with CTC (Connectionist Temporal Classification) for Amharic text image recognition.In literature, attempts at Amharic OCR have neither shown results on large datasets nor considered all possible characters used in the Amharic writing system [9].
There are many effective off-the-shelf commercial and open-source OCR applications for many languages, including the functionality of ground truth generation from the existing printed texts so as to train the second model [25].The effectiveness of various open-source OCR engines was evaluated on 19th century Fraktur scripts; the evaluation shows that the open-source OCR engine can outperform the commercial OCR application [26].
Based on deep neural networks, many segmentation-based [27] and segmentation-free OCR techniques have been studied.Impressive progress has also been made for many Latin and non-Latin scripts, ranging from historical handwritten documents to modern machine-printed texts.
However, OCR systems for many scripts, especially those which are indigenous to the African continent, such as Amharic, remain under-researched, and none of the researchers have taken advantage of deep learning techniques such as end-to-end learning, used for many languages, for developing Amharic OCR.Therefore, in this paper, we propose an end-to-end trainable neural network which includes a Convolutional Neural Network (CNN), Bidirectional LSTM (BLSTM), and Connectionist Temporal Classification (CTC) in a unified framework for Amharic text-line image recognition.
This paper is an extension of the previous work [9] with the following summarized contributions: (1) To extract automatic features from text-line images, we propose a CNN-based feature extractor module.(2) To reduce the computational cost, we adjust the images to a smaller size.(3) We adopt an end-to-end trainable neural network for Amharic text-line image recognition that achieves state-of-the-art results.( 4) We also use an extra private Amharic dataset to tune the parameters of the feature extractor module.(5) Based on the experimental results obtained, a detailed analysis of the dataset is presented.

Material and Methods
A general OCR application starts from dataset preparation and then followed by model training.In this section, we will present the dataset used for training and model evaluation, the proposed model architecture, and the training schemes.

Datasets
The shortage of datasets is the main challenge in pattern recognition, and it is one of the limiting factors in developing reliable systems for Amharic OCR.There are few databases used in the various works on Amharic OCR reported in literature.As reported in [16], the authors considered the most frequently used Amharic characters.A later work by Million et al. [1] uses 76,800 character images with different font types and sizes which belong to 231 classes.
Other researchers' work on Amharic OCR [15,20,21] reported that they used their own private databases, but none of them were made publicly available for research purposes.Promising work on Amharic OCR is reported in [9], where the authors employed an LSTM-based neural network for Amharic OCR text-line image recognition; they also introduce a dataset called ADOCR, which is made public and freely accessible at http://www.dfki.uni-kl.de/~belay/.
In this paper, we use the ADOCR database introduced by [9].This dataset contains 337,337 Amharic text-line images which are written with the Visual Geez and Power Geez fonts using 280 unique Amharic characters and punctuation marks.All images are greyscale and normalized to 48 by 128 pixels, while the maximum string length of the ground-truth text is 32 characters including true blank spaces.
From the total text-line images in the dataset, 40,929 are printed text-line images written with the Power Geez font; 197,484 and 98,924 images are synthetic text-line images generated with different levels of degradation using the Power Geez and Visual Geez fonts, respectively.All characters that exist in the test dataset also exist in the training set, but some words in the test dataset do not exist in training dataset.Sample printed and synthetically generated text-line images taken from the database are shown in Figure 2. The details of the dataset (text-line images, unique Amharic characters, and punctuation marks) used in this experiment are summarized in Table 1 and Figure 3.

Proposed Model
For the sequence recognition problem, the most suited neural networks are recurrent neural networks (RNN) [39], while for an image-based problem, the most suited are convolutional neural networks (CNN) [40].To work with the OCR problems, it would be better to combine CNNs and RNNs.The overall framework of the proposed approach is shown in Figure 4.In this framework, we employed three modules: The feature extractor, the sequence learner, and the transcriber module.All three of these modules are integrated in a single framework and trained in an end-to-end fashion.The feature extractor module consists of seven convolutional layers which have a kernel size of 3 × 3, except the one that is stacked on top with a 2 × 2 kernel size; each uses rectifier linear unit (ReLU) activation, four max-pooling layers with pool sizes of 2 for the first pooling layer, and 2 × 1 for the remaining pooling layers.Strides are fixed to one, and the 'same' padding is used in all convectional layers.The number of feature maps is gradually increased from 64 to 512, while the image size is quickly reduced by a spatial pooling to further increase the depth of the network.For an input size N, kernel K, padding P, and stride S, the output size M at each convolutional layer can be computed as M = ( N−K+2 * P S ) + 1.In addition, to normalize the output of the previous activation layer and accelerate the training process, batch normalization is used after the fifth and sixth convolutional layers.
We also used a reshape function to make the output of the convolutional layer compatible with the LSTM layer.The sequence learner module is the middle layer of our framework which predicts the sequential output per time-step.This module consists of two bidirectional LSTM layers, with the soft-max function on top, each of which has 128 hidden layer sizes and a dropout rate of 0.25.The sequential output of each time-step from the LSTM layers is fed into a soft-max layer to get a probability distribution over the C possible characters.Finally, transcription to the equivalent characters is done using the CTC layer.The details of the network parameters and configuration of the proposed model are depicted in Tables 2 and 3.

Network Layers
Kernel Size Stride Feature Maps During training, the image passes through the convolutional layers, in which several filters extract features from the input images.After passing some convolutional layers in sequence, the output is reshaped and connected to a bidirectional LSTM.The output of the LSTM is fed into a soft-max function which has n + 1 nodes, where each node corresponds to a label or each unique character in the ground truth, including one blank character which is used to take care of the continuous occurrence of the same characters.In our case, since there are 280 unique characters, the soft-max outputs will be 281 probabilities, including the blank character, at each time-step.We employed a checkpoint strategy which can save the model weight to the same file each time an improvement is observed in validation loss.
As explained in the work of Graves et al. [41], CTC is an objective function which adopts dynamic programming algorithms to directly learn the alignment between the input and output sequences.Then, the CTC loss function is used to train the network.During training, CTC only requires an input sequence and the corresponding transcriptions.For given training data D, CTC minimizes the negative logarithm of the likelihood loss function, formulated as Equation (1).
where x = x 1 , x 2 , . . ., x T is the input sequence with length T, z = z 1 , z 2 , . . ., z C is the corresponding target for C < T ,and the p(z/x) is computed by multiplying the probability of labels along the path π that contains the output label over all time-steps t, as shown in Equation (2).
where t is the time-step and π t is the label of path π at t.A target label in path π is obtained by mapping the reduction function B that converts a sequence of soft-max outputs for each frame into a label sequence by removing repeated labels and blank (φ) tokens of the given sequences.Taking an example from [9], for a given sequence of observation (o) with a length of eighteen, o = φaaφmmφhφaaφrrφicφ.Then, the paths are mapped to a label sequence l s = B(o) = B(φaaφmmφhφaaφrrφicφ) = B(φaφmφhφaφrφicφ) ='amharic', where B is a reduction mapping function which works by first collapsing the repeated tokens and then removing blanks.
The target sequence probability y from input sequence x is the sum of the probability of all paths by reducing each path to this label sequence using B, and it is formulated as Equation (3).
Once the probability of label sequence y from an input sequence x is obtained with the CTC forward-backward algorithm proposed by [41], we employ the best path decoding method, fast and simple, to find a character (C) that has the highest score from outputs (i) at every time-step; the final recognized string text can be generated using B without segmentation of the input sequence; this is formulated as Equation ( 4).
In all experiments and results reported below, we used the same network architecture, and the performance of the proposed model is described in terms of Character Error Rate (CER), which is computed by counting the number of characters inserted, substituted, and deleted in each sequence and then dividing by the total number of characters in the ground truth; this can be formulated as Equation (5).
where q is the total number of target character labels in the ground truth, P and T are the predicted and ground-truth labels, and D(n, m) is the edit distance between sequences n and m.

Experimental Results
Experiments were conducted using the ADOCR database [9], a public and freely available dataset, which contains both printed and synthetically generated Amharic text-line images.Following the network architecture and experimental setups described in Section 2, we implemented our model with the Keras Application Program Interface (API) with a TensorFlow backend, and the model was trained on a GeForce GTX 1070 GPU.
To select suitable network parameters, different values of these parameters were considered and tuned during experimentation, and the results reported in this paper were obtained using an Adam optimizer employing a convolutional neural network with a feature map that started from 64 and increased to 512, the BLSTM network with two network hidden layers with sizes of 128 each, and a learning rate of 0.001.
Based on the nature of the dataset, we conducted three experiments.Once we trained our network with the synthetic and some of the printed text-line images, the performance of the model was tested with three different test datasets.In the first and the second experiments, the model was evaluated with synthetic Amharic text-line images generated with the Power Geez and Visual Geez fonts, respectively.The third experiment was conducted using a printed test dataset written with the Power Geez font type.
In the original dataset, the sizes of the images were 48 by 128 pixels.Considering similar works done in the area, to reduce computational costs during training, we resized the images into sizes of 32 by 128 pixels.For validation, we used 7% of the training dataset, randomly selected, as proposed in the original paper [9].The network was trained for 10 epochs with a batch size of 200.
During the testing of the proposed model, character error rates of 1.05% and 3.73% were recorded on the two test datasets which were generated synthetically using the Visual Geez and Power Geez fonts, respectively.The model was also tested with the third test dataset, which had printed text-line images that were written with the Power Geez fonts, and a character error rate of 1.59% was obtained.The results recorded during experimentation are summarized in Table 4.

Discussion and Analysis of the Results
We performed repetitive evaluations of our method using the benchmark datasets used in the original paper [9], and we also tried to compare with the state-of-the-art methods on both printed and synthetically generated datasets.The details of the analysis and comparisons are presented in the following sections.

Analysis of the Dataset and Results
In this section, a detailed analysis and description of the dataset are presented, and then the results obtained during experimentation will be presented.As depicted in Figure 5, we observed that some of the printed text-line images are not properly aligned with the ground truth due to the occurrence of extra blank spaces between words and/or the merging together of more words during printing.In addition, with the synthetic text-line images, a character at the beginning and/or at the end of the word is missed, which results in misalignment with the ground-truth.To improve the recognition performance, it is important to annotate the data manually or to use better data annotation tools.
Of several factors, samples with wrongly annotated Amharic text-line images and characters, depicted in Figure 5, are the majors factors causing recognition errors.In general, the recognition errors may occur due to misspelled characters, spurious symbols, or lost characters.by 128 pixels.For validation, we used 7% of the training dataset, randomly selected, as proposed in the original paper [? ].The network was trained for 10 epochs with a batch size of 200.
During the testing of the proposed model, character error rates of 1.05% and 3.73% were recorded on the two test datasets which were generated synthetically using the Visual Geez and Power Geez fonts, respectively.The model was also tested with the third test dataset, which had printed text-line images that were written with the Power Geez fonts, and a character error rate of 1.59% was obtained.The results recorded during experimentation are summarized in Table ??.

Discussion and Analysis of the Results
We performed repetitive evaluations of our method using the benchmark datasets used in the original paper [? ], and we also tried to compare with the state-of-the-art methods on both printed and synthetically generated datasets.The details of the analysis and comparisons are presented in the following sections.

Analysis of the Dataset and Results
In this section, a detailed analysis and description of the dataset are presented, and then the results obtained during experimentation will be presented.As depicted in Figure ??, we observed that some of the printed text-line images are not properly aligned with the ground truth due to the occurrence of extra blank spaces between words and/or the merging together of more words during printing.In addition, with the synthetic text-line images, a character at the beginning and/or at the end of the word is missed, which results in misalignment with the ground-truth.To improve the recognition performance, it is important to annotate the data manually or to use better data annotation tools.
Of several factors, samples with wrongly annotated Amharic text-line images and characters, depicted in Figure ??, are the majors factors causing recognition errors.In general, the recognition errors may occur due to misspelled characters, spurious symbols, or lost characters.The proposed model works better on the synthetic test datasets generated with the Visual Geez font type, compared to the character error rate observed on the Power Geez font type and the printed The proposed model works better on the synthetic test datasets generated with the Visual Geez font type, compared to the character error rate observed on the Power Geez font type and the printed test data.The generalization of this model, especially on synthetic dataset generated with the Power Geez font type, is not as good as the Visual Geez one.This happens mainly because of the significantly larger number of text-line images in the test set and the nature of the training samples (i.e., the text-line images and the ground truth are not properly annotated due to the existence of deformed characters and missing characters in the beginning and/or end of the text-line images during data generation but not in the ground truth).In addition, text-line images generated with the Power Geez font are relatively blurred, resulting in poor recognition accuracy.
In Figure 6, sample Amharic text-line images and predicted texts containing different types of character errors during testing are shown.For example, as illustrated at the top of Figure 5a, the character (Ú ) marked by the green rectangle in the input image is substituted by the character (g) in the predicted text.On the other hand, the two characters (y and ¤) marked with the red rectangle are deleted characters.Other character errors are depicted at the top of Figure 6b; characters (" and Ý) from the input image, marked with the red rectangles, are deleted characters, while characters (â and s) in the predicted text, written with a blue color, are inserted characters.

Performance Comparison
As depicted in Figure 7 and Table 5, the performance of the proposed model is improved.Compared to the original paper, the proposed model achieved better recognition performance with a smaller number of epochs.However, the proposed model took a longer time for training.This is due to the nature of an end-to-end learning approach [42] that incorporates multiple and diverse network layers in a single unified framework.Therefore, the proposed model can be assessed and training time may be further improved by following some other concepts like decomposition [43] or the divide-and-conquer approach.The comparisons among the proposed approach and others' attempts done on the ADOCR Database [9] are listen in Table 5, and the performance of the proposed model shows better results on all three of the test datasets.All results are reported as character error rates (CERs) as results of insertion, deletion, and substitution of characters in the predicted output.This can be efficiently computed using Equation (5).

Conclusions
In this paper, we propose a method for text-line image recognition of Amharic, an old Semitic language.Amharic has it own indigenous script and is rich in a bulk of historically printed and handwritten documents.However, it is an underprivileged group of scripts in Natural Language Processing (NLP) due to the lack of extensive research in the area and the lack of annotated datasets.Therefore, in this paper, we present an end-to-end trainable neural network architecture which consists of CNN (the feature extractor), LSTM (the predictor), and CTC (the transcriber) in a unified framework.The proposed model is evaluated using a publicly available Amharic database called ADOCR, and it outperforms the state-of-art methods employed on this benchmark dataset [9] by a large margin.
As part of future work, the proposed method will extended for handwritten Amharic document image recognition.Similarly, we have planned to develop an OCR system that should recognize some complex Amharic documents, such as historical and scene Amharic text images.

Figure 1 .
Figure 1.The shapes and structural formations of sample Amharic characters: (a) Basic Amharic characters with the orders of consonant-vowel variants (34 × 7), including the lately introduced Amharic character (ª).Characters in the first column are consonants and the others are derived variants (vowels) formed from each consonant by adding diacritics and/or removing part of the character, as marked with the red box.(b) Derived/labialized characters.Labialized characters marked with circles are also derived from consonant characters.

Figure 2 .
Figure 2. Sample Amharic text-line images that are normalized to a size of 32 by 128 pixels: (a) Printed text-line images written with the Power Geez font type.(b) Synthetically generated text-line images with the Visual Geez font type.(c) Synthetically generated text-line images with the Power Geez font type.

Figure 4 .
Figure 4.The proposed model.This network consists of three components.The first component is the convolutional layer which acts as the feature extractor, the second component is the recurrent layer which acts as a sequence learner, plus a soft-max function to predict a label distribution at each time-step.The third component is a Connectionist Temporal Classification (CTC) layer which takes the soft-max prediction at each time-step as an input and transcribes these predictions into final label sequences.All input text-line images are normalized to a fixed size of 32 by 128 pixels, and labels are padded with zero until they reach the maximum sequence length of 32.

Figure 5 .
Figure 5.Samples with wrongly annotated images and GT from the test dataset.(a) Synthetic text-line image; the word marked with yellow rectangle is a sample mislabeled word where the first character (µ) in the GT, marked with a red circle, is missed in the input image but it exists in GT.(b) Printed text-line image; a punctuation mark called a full stop/period, bounded with a purple rectangle in the input image, is incorrectly labeled as two other punctuation marks called word separators, indicated by the green and red text colors in the GT.

Figure 5 .
Figure 5.Samples with wrongly annotated images and GT from the test dataset.(a) Synthetic text-line image; the word marked with yellow rectangle is a sample mislabeled word where the first character (µ) in the GT, marked with a red circle, is missed in the input image but it exists in GT.(b) Printed text-line image; a punctuation mark called a full stop/period, bounded with a purple rectangle in the input image, is incorrectly labeled as two other punctuation marks called word separators, indicated by the green and red text colors in the GT.

Figure 6 .
Figure 6.Sample mis-recognized text-line images.(a) Printed text-line images.(b) Synthetically generated text-line images.The small green and red rectangles are used to mark the substituted and deleted characters, respectively, while the blue color represents inserted characters.

Figure 7 .
Figure 7. Learning loss comparison: (a) The training and validation losses of CTC in the original paper [9] recorded for 50 epochs.(b) The training and validation CTC-loss of the proposed model recorded for 10 epochs.

Table 1 .
[9] details of the text-line images in the dataset[9].
-Figure 3. Unique Amharic characters and punctuation marks presented in the ground truth of the Amharic OCR database [9].

Table 2 .
Convolutional network layers of the proposed model and their corresponding parameter values for an input image size 32 × 128 × 1.

Table 3 .
The recurrent network layers of the proposed model with their corresponding parameter values.The input size of the Long-Short-Term Memory (LSTM) is a squeezed output of the convolutional layers, which is depicted in Table2.
* Denotes methods tested on different datasets.