Amharic is the second-largest Semitic dialect in the world after Arabic [1
]. It is an official working language of the Federal Democratic Republic of Ethiopia and is spoken by more than 50 million people as their mother language and by over 100 million as a second language in the country [2
]. In addition to Ethiopia, it is also spoken in other countries like Eritrea, USA, Israel, Sweden, Somalia, and Djibouti [1
Dated back to the 12th century, many historical and literary documents in Ethiopia are written and documented using Amharic script. Amharic is a syllabic writing system which is derived from an ancient script called Geez, and it has been extensively used in all government and non-government sectors in Ethiopia until today. Amharic took all of the symbols in Geez and added some new ones that represent sounds not found in Geez [5
In Amharic script, there are about 317 different alphabets including 238 core characters, 50 labialized characters, 9 punctuation marks, and 20 numerals which are written and read, as in English, from left to right [1
]. All vowels and labialized characters in Amharic script are derived, with a small change, from the 34 consonant characters. The change involves modifying the structure of these characters by adding a straight line or shortening and/or elongating one of its main legs. It also includes the addition of small diacritics, such as strokes or loops/circles to the right, left, top, or bottom of the character.
Due to these small modifications on the consonants, Amharic characters have similar shapes which may make the task of recognition hard for machines as well as humans [7
]. These features are particularly interesting in research on character recognition because a small change in the basic physical features may affect the orthographic identities of letters. The shapes and structural formations of consonant Amharic characters with their corresponding vowels and labialized variants are depicted in Figure 1
Optical Character Recognition (OCR) applications have been widely used and implemented for decades to digitize various historical and modern documents including books, newspapers, magazines, and cultural and religious archives that are written with different scripts. Multiple intensive works—for multiple scripts—have been done in the area of document image analysis with a better recognition accuracy; most of the scripts now even have commercial off-the-shelf OCR applications. In such a way, many researchers think that the OCR challenge is solved. However, OCR gives better results only for very specific use cases and there are still multiple indigenous scripts, like Amharic, which are underrepresented in the area of natural language processing (NLP) and document image analysis [8
]. Until recent times, the OCR for Amharic script remained relatively unexplored, and it is still challenging [9
Nowadays, Amharic document processing and preservation is given much attention by many researchers from the fields of computing, linguistics, and social science [3
]. In recent years, various models and algorithms have been proposed by many researchers for image pattern recognition, and there has been a rapid advancement of solutions which have demonstrated ground-breaking performance [14
The first work on Amharic OCR was done by Worku in 1997 [15
]. Since then, attempts for Amharic OCR have also been made by employing different machine learning techniques. Here, we will try to cover the various techniques considered by different researchers. A tree classification scheme with the topological features of a character was used by Worku [15
]. It was only able to recognize an Amharic character written with the Washera font and 12 point type.
Then, other attempts were made, such as typewritten [16
], machine-printed [1
], and Amharic braille document image recognition [17
], Geez script written on vellum [18
], Amharic document image recognition and retrieval [19
], numeric recognition [20
], and handwritten recognition [21
]. However, all of these researchers applied statistical machine learning techniques and limited private datasets. In addition, recognition was done at the character level which is time consuming and also may not achieve better recognition accuracy, since each image of a character is directly dependent on the nature of the segmentation algorithms [22
Following the success of deep learning, other attempts have also been made for Amharic character recognition by employing Convolutional Neural Networks (CNNs) [6
]. Nearly all of these attempts performed segmentation of text images into the character level, which also directly affects the performance of the OCR. The only exceptions are Assabie [21
] and recently published works [9
]. Assabie [21
] proposed an Hidden Markov Model (HMM)-based model for offline handwritten Amharic word recognition without character segmentation by using the structural features of characters as building blocks of a recognition system, while Belay [9
] and Addis [24
] proposed Long–Short-Term Memory (LSTM) networks together with CTC (Connectionist Temporal Classification) for Amharic text image recognition. In literature, attempts at Amharic OCR have neither shown results on large datasets nor considered all possible characters used in the Amharic writing system [9
There are many effective off-the-shelf commercial and open-source OCR applications for many languages, including the functionality of ground truth generation from the existing printed texts so as to train the second model [25
]. The effectiveness of various open-source OCR engines was evaluated on 19th century Fraktur scripts; the evaluation shows that the open-source OCR engine can outperform the commercial OCR application [26
Based on deep neural networks, many segmentation-based [27
] and segmentation-free OCR techniques have been studied. Impressive progress has also been made for many Latin and non-Latin scripts, ranging from historical handwritten documents to modern machine-printed texts.
Bidirectional Recurrent Neural Networks with Long–Short-Term Memory (LSTM) architecture for online handwriting recognition [28
], Convolutional Recurrent Neural Networks (CRNN) for Japanese handwriting recognition [29
], Urdu Nastaleeq script recognition using bidirectional LSTM [30
], segmentation-free Chinese handwritten text recognition [31
], a hybrid Convolutional Long-Term Memory Network(CLSTM) for text image recognition [32
], Multidimensional LSTM (MDLSTM) for Chinese handwriting recognition [33
], combined Connectionist Temporal Classification (CTC) with Bidirectional LSTM (BLSTM) for unconstrained online handwriting recognition [34
], MDLSTM for handwriting recognition [35
], an online handwritten mathematical expression recognition using a Gated Recurrent Unit (GRU)-based attention mechanism [36
], a combination of Convolutional Neural Networks and multi-dimensional RNN for Khmer historical handwritten text recognition [37
], and a multi-stage HMM-based text recognition system for handwritten Arabic [38
] have been studied.
However, OCR systems for many scripts, especially those which are indigenous to the African continent, such as Amharic, remain under-researched, and none of the researchers have taken advantage of deep learning techniques such as end-to-end learning, used for many languages, for developing Amharic OCR. Therefore, in this paper, we propose an end-to-end trainable neural network which includes a Convolutional Neural Network (CNN), Bidirectional LSTM (BLSTM), and Connectionist Temporal Classification (CTC) in a unified framework for Amharic text-line image recognition.
This paper is an extension of the previous work [9
] with the following summarized contributions: (1
) To extract automatic features from text-line images, we propose a CNN-based feature extractor module. (2
) To reduce the computational cost, we adjust the images to a smaller size. (3
) We adopt an end-to-end trainable neural network for Amharic text-line image recognition that achieves state-of-the-art results. (4
) We also use an extra private Amharic dataset to tune the parameters of the feature extractor module. (5
) Based on the experimental results obtained, a detailed analysis of the dataset is presented.
3. Experimental Results
Experiments were conducted using the ADOCR database [9
], a public and freely available dataset, which contains both printed and synthetically generated Amharic text-line images. Following the network architecture and experimental setups described in Section 2
, we implemented our model with the Keras Application Program Interface (API) with a TensorFlow backend, and the model was trained on a GeForce GTX 1070 GPU.
To select suitable network parameters, different values of these parameters were considered and tuned during experimentation, and the results reported in this paper were obtained using an Adam optimizer employing a convolutional neural network with a feature map that started from 64 and increased to 512, the BLSTM network with two network hidden layers with sizes of 128 each, and a learning rate of 0.001.
Based on the nature of the dataset, we conducted three experiments. Once we trained our network with the synthetic and some of the printed text-line images, the performance of the model was tested with three different test datasets. In the first and the second experiments, the model was evaluated with synthetic Amharic text-line images generated with the Power Geez and Visual Geez fonts, respectively. The third experiment was conducted using a printed test dataset written with the Power Geez font type.
In the original dataset, the sizes of the images were 48 by 128 pixels. Considering similar works done in the area, to reduce computational costs during training, we resized the images into sizes of 32 by 128 pixels. For validation, we used 7% of the training dataset, randomly selected, as proposed in the original paper [9
]. The network was trained for 10 epochs with a batch size of 200.
During the testing of the proposed model, character error rates of 1.05% and 3.73% were recorded on the two test datasets which were generated synthetically using the Visual Geez and Power Geez fonts, respectively. The model was also tested with the third test dataset, which had printed text-line images that were written with the Power Geez fonts, and a character error rate of 1.59% was obtained. The results recorded during experimentation are summarized in Table 4