Unconstrained Bilingual Scene Text Reading Using Octave as a Feature Extractor

Featured Application: The potential applications of scene text reading are ordering large pictures and video databases by their literary substance, such as Bing Maps, Apple Maps, and Google Street View, as well as supporting visual impaired people. Abstract: Reading text and uniﬁed text detection and recognition from natural images are the most challenging applications in computer vision and document analysis. Previously proposed end-to-end scene text reading methods do not consider the frequency of input images at feature extraction, which slows down the system, requires more memory, and recognizes text inaccurately. In this paper, we proposed an octave convolution (OctConv) feature extractor and a time-restricted attention encoder-decoder module for end-to-end scene text reading. The OctConv can extract features by factorizing the input image based on their frequency. It is a direct replacement of convolutions, orthogonal and complementary, for reducing redundancies and helps to boost the reading text through low memory requirements at a faster speed. In the text reading process, features are ﬁrst extracted from the input image using Feature Pyramid Network (FPN) with OctConv Residual Network with depth 50 (ResNet50). Then, a Region Proposal Network (RPN) is applied to predict the location of the text area by using extracted features. Finally, a time-restricted attention encoder-decoder module is applied after the Region of Interest (RoI) pooling is performed. A bilingual real and synthetic scene text dataset is prepared for training and testing the proposed model. Additionally, well-known datasets including ICDAR2013, ICDAR2015, and Total Text are used for ﬁne-tuning and evaluating its performance with previously proposed state-of-the-art methods. The proposed model shows promising results on both regular and irregular or curved text detection and reading tasks.


Introduction
Currently, reading text from a natural image is one of the hottest research issues in computer vision and document processing. It has many applications including ordering large pictures and video databases by their literary substance, such as Bing Maps, Apple Maps, Google Street View, and so on. Moreover, it allows for image mining, office automation, and support for the visually impaired. Thus, scene text is highly important for thoughtful and uniform services throughout the world. However, reading text from natural images poses several challenges, due to the use of different fonts (color, type, and size) and texts being written on more than one script. Moreover, imperfect image condition causes distorted text, and complex and inference backgrounds cause unpredictability. As a result, reading or spotting texts from a natural image becomes a challenging task.
Following [22], we prepare large syntactically generated bilingual (English and Amharic) scene text datasets. Additionally, we collect real datasets that have different shapes and written using the two scripts. 2.
Our proposed model extracts feature by factorizing based on their frequencies (low and high), which helps to reduce both storage and computation costs. This also helps each layer gain a larger receptive field to capture more contextual information. 3.
The proposed system can detect and read texts from an image that has arbitrary shapes, containing oriented, horizontal, and curved text. 4.
The performance of the time-restricted attention encoder-decoder module is examined to predict words based on the extracted and segmented features. 5.
Using the prepared dataset and well-known datasets, we perform several experiments and our model shows promising results.
The rest of the paper is organized as follows. Related works are presented in Section 2. In Section 3, we discuss the proposed bilingual end-to-end scene text reading methodology. A short description of the Ethiopic script and datasets that are used for training and evaluating the proposed model is described in Section 4. The experimental set-up and results are discussed in Section 5. Finally, a conclusion is drawn in Section 6.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 14 presented to read texts from the natural image in an end-to-end manner. The major contributions of the article are summarized as follows: 1. Following [22], we prepare large syntactically generated bilingual (English and Amharic) scene text datasets. Additionally, we collect real datasets that have different shapes and written using the two scripts. 2. Our proposed model extracts feature by factorizing based on their frequencies (low and high), which helps to reduce both storage and computation costs. This also helps each layer gain a larger receptive field to capture more contextual information.
3. The proposed system can detect and read texts from an image that has arbitrary shapes, containing oriented, horizontal, and curved text. 4. The performance of the time-restricted attention encoder-decoder module is examined to predict words based on the extracted and segmented features. 5. Using the prepared dataset and well-known datasets, we perform several experiments and our model shows promising results.
The rest of the paper is organized as follows. Related works are presented in Section 2. In Section 3, we discuss the proposed bilingual end-to-end scene text reading methodology. A short description of the Ethiopic script and datasets that are used for training and evaluating the proposed model is described in Section 4. The experimental set-up and results are discussed in Section 5. Finally, a conclusion is drawn in Section 6.

Related Work
Reading text from a natural image is currently an active field of investigation in computer vision and document analysis. In this section, we introduce related works, including scene text detection, scene text recognition, and text spotting (combining detection and recognition) techniques.

Scene Text Detection
Traditional and deep-learning machine-learning methods are used to detect texts from a natural image. In [1,3,[23][24][25], scene text detection methods have been presented to detect and bind text areas from a natural image, but this approach has manual computation problems. Lee et al. [25] presented sliding-window-based methods measured by shifting over the image and determining text proximity based on local image highlights. In [26,27], a connected component analysis method was presented to detect scene texts using Stroke Width Transform (SWT) and Maximum Stable Extreme Region (MSER), respectively. However, these approaches are limited when it comes to detecting text regions from distorted images.
Recently, deep-learning techniques improved several machine-learning problems, including scene text detection and recognition problem. Tian et al. [1] presented a Connectionist Text Proposal Network (CTPN), which uses a vertical anchor mechanism that jointly predicts location and text/notext scores of each fixed width. Shi et al. [14] introduced Segment Linking (SegLink), which is an

Related Work
Reading text from a natural image is currently an active field of investigation in computer vision and document analysis. In this section, we introduce related works, including scene text detection, scene text recognition, and text spotting (combining detection and recognition) techniques.

Scene Text Detection
Traditional and deep-learning machine-learning methods are used to detect texts from a natural image. In [1,3,[23][24][25], scene text detection methods have been presented to detect and bind text areas from a natural image, but this approach has manual computation problems. Lee et al. [25] presented sliding-window-based methods measured by shifting over the image and determining text proximity based on local image highlights. In [26,27], a connected component analysis method was presented to detect scene texts using Stroke Width Transform (SWT) and Maximum Stable Extreme Region (MSER), respectively. However, these approaches are limited when it comes to detecting text regions from distorted images.
Recently, deep-learning techniques improved several machine-learning problems, including scene text detection and recognition problem. Tian et al. [1] presented a Connectionist Text Proposal Network (CTPN), which uses a vertical anchor mechanism that jointly predicts location and text/no-text scores of each fixed width. Shi et al. [14] introduced Segment Linking (SegLink), which is an oriented scene text detection method that segments and then links the text to complete instances using a linkage Appl. Sci. 2020, 10, 4474 4 of 14 prediction. Ma et al. [28] presented a novel rotation-based framework to detect arbitrarily oriented texts found in natural images by proposing region proposal network (RPN) and rotation RoI pooling. A deep direct regression-based method for detecting multi-oriented scene text has been presented in [29]. Efficient and accuracy scene Text detector (EAST) [5] has been introduced to effectively detect words or text lines using a single neural network.

Scene Text Recognition
In the text-reading phases of natural images, text recognition is the second phase after scene text detection. This method can be implemented independently or after scene text detection phases. In the scene text recognition phase, the cropped text regions are fed either from the scene text detection phase or from the prepared input dataset, from which the sequences of labels are decoded. Previous attempts were made by detecting individual characters and refining misclassified characters. Such methods require training a strong character detector for accurately detecting and cropping each character out from the original word. These types of methods are more difficult for Ethiopic scripts due to their complexities. Apart from the character level methods, word recognition [12], sequence to label [30], and sequence to sequence [31] methods have been presented. Liu et al. [32] and Shi et al. [15] presented a spatial attention mechanism to transform a distorted text region from irregular input images into canonical pose suitable recognition. However, both the detection and recognition task performance are determined based on the extracted features. Previously proposed scene text detection and recognition of deep learning-based and conventional machine learning feature extraction methods do not consider the frequency of the input image. Following [21], in this paper, we propose an OctConv with ResNet-50 feature extractor, which extracts features by factorizing based on their frequencies.

Scene Text Spotting
Recently, several end-to-end scene text spotting methods have been introduced and have shown a remarkable result compared to independent scene text detection and recognition approaches. For instance, Li et al. [10] introduced an end-to-end text spotting technique from natural images using RPN as a text detector and attention Long Short Term Memory (LSTM) as a text recognizer. Liao et al. [8] presented an end-to-end scene text-reading method using Single Shot Detector (SSD) [33] and convolutional recurrent neural network (CRNN) for scene text detection and recognition, respectively. Liu et al. [34] introduced a unified network to detect and recognize multi-oriented scene texts from natural images. Lunadren et al. [35] introduced an octave-based fully convolutional neural network with fewer layers and parameters to precisely detect multilingual scene text. The most recently proposed scene text-reading models are summarized in Table 1. Improving the feature extraction and recognition network will improve scene text detection, recognition, and text spotting problems. In [21], an OctConv feature extraction method has been proposed for object detection and improves its performance. Octave convolution addresses spatial redundancy, which was not addressed in the previously proposed methods. The OctConv does not change the connectivity between feature maps and it is different from inception multi-path designs [36,37]. In our proposed bilingual text-reading method, we replace the ResNet-50 vanilla convolution with OctConv, which can operate quickly and produce accurate results in the extraction of features. As stated in [38], the limitation of Connectionist Temporal Classification (CTC), attention encoder-decoder, and hybrid (CTC and attention) method is improved using a time-restricted self-attention method for an automatic speech recognition system. In our proposed method, we integrate a time-restricted self-attention encoder-decoder module for recognition with feature extraction and bounding box detection layers.

Methodology
In this section, the details of the proposed bilingual scene text-reading model are presented. The architecture of the model, shown in Figure 1, is trained in an end-to-end manner that concurrently detects and recognizes words from a natural image.

Overall View of the Architecture
Our proposed architecture follows the architecture presented in [9,21]. Our proposed architecture has three functional components, feature-extraction layer, text/non-text detection layer, and recognition layer. In the feature-extraction layer, features are extracted from input natural images and passed to the next layer using an FPN [20] with ResNet-50 [39] by replacing the vanilla convolution with an octave convolution. Then, using the extracted features on the 1st layer as an input, a region proposal network (RPN) [40] predicts text/non-text area and bounding boxes of each text area. Finally, by applying RoI to the outputs of the 2nd layer, text segmentation, and word prediction are done using the time-restricted self-attention encoder-decoder module. Details of each layer are presented below.

Feature Extraction Layer
Feature extraction is one of the crucial steps in machine learning problems. In the deep learning era, several automatic feature extraction methods have been proposed, including [40][41][42][43]. These feature extraction methods were applied to several problem domains and produced good results. Recently, Chen et al. [21], proposed an OctConv method that extracts features based on their frequencies. We use Chen et al.'s feature extraction method to detect text/non-text regions. Naturally, texts found in natural images have different properties (i.e., size, orientation, shapes, and color). These cause a challenge in perfectly detecting the text/non-text region, which directly affects the performance of the recognition task. To overcome this challenge, we build high-level semantic feature maps using FPN with ResNet-50. Different from [9], in our proposed feature extraction layer, we replace vanilla convolutions by OctConv. This factorizes the mixed-feature map tensor into high and low-frequency maps, where the high-frequency feature map tensors encode with fine details, whereas the low-frequency feature map tensors encode with global structures. Compared to vanilla convolution, OctConv reduces spatial redundancy, memory cost, and computation cost.
For given spatial dimensions w and h with the number of feature maps c, the input feature tensor of a convolution layer will be X ∈ R c×h×w . In OctConv, the input vector X factorized along channel dimensions into low feature map (X L ) and high feature map (X H ) frequencies. As stated in [21], the factorization of high feature map and low feature map tensors are computed as follows: where the value of α ∈ [0, 1) In the factorization process, fine details are obtained on high-frequency feature maps, whereas differences in speed in spatial dimensions with respect to image location were obtained at low-frequency feature map tensors. This process maps the features that are compacted and replace spatial repetitive feature maps with different resolution maps. On these feature maps, an octave convolution is applied where the vanilla convolution does not work, due to different resolutions of high-and low-frequency feature maps. The octave convolution enables efficient inter-frequency communication and effectively operates on low-and high-frequency tensors. For the factorized high (X H ) and low ( X L ) feature tensors, there is a corresponding output feature tensor Y H and Y L , respectively. To get each output feature tensor, inter (Y H→L , Y L→H ) and intra (Y L→L , Y H→H ) frequency convolution update is performed. Each output feature map at location (p, q) is computed using appropriate kernels (W L and W H ), applying regular convolution for intra-frequency update and removing the need of explicitly computing and sorting on up/down sampling for inter-frequency communication as follows: The recognition performance of the model is improved because OctConv can extract a larger receptive field for low-frequency feature maps. Most commonly, text found in natural images has low frequencies. Compared to vanilla convolution, OctConv convolves at a factor of 2 receptive fields.

Text Region Detection Layer
Using RPN and taking the extracted feature maps as an input, text/non-text regions are detected. Following [9] and [20], we assign five anchors at different stages {P2, P3, P4, P5, P6} with the area of anchors {32 2 , 64 2 , 128 2 , 256 2 , 512 2 }, respectively. Besides, to handle different text sizes {0.5, 1, 2} aspect ratios are implemented at each stage. By doing this, text proposal features are generated. These features are further extracted using RoI align [41], which preserves a more accurate location compared to RoI pooling. Finally, the Fast Region (R)-CNN [41] generates precise bounding boxes for the texts found in the input natural image. Using a soft-Non-maximal suppression (NMS) [42] technique, we select one bounding box for those texts that have more than one bounding box.

Segmentation and Recognition Layer
After texts are detected at the detection layer, text segmentation and recognition of words are performed. Text instance regions are segmented using four consecutive convolution layers with 3 × 3 filters and deconvolution layers with 2 × 2 filters and strides on the outputs of RoI align feature in the previous layer, with predicted bounding boxes. Finally, the outputs of the segmented text instance feature x = (x 1 , x 2 , . . . , x T ) are fed for a time-restricted self-attention encoder-decoder module.
In [43], a time-restricted (attention window) self-attention encoder-decoder module is presented for automatic speech recognition, which produces a state-of-the-art result by improving the limitations of CTC (i.e., hard alignment problem and conditional independence constraints) and the attention encoder-decoder module. Unlike [9], we use a time-restricted self-attention module using a bidirectional Gated Recurrent Unit (GRU) as an encoder and a GRU as a decoder. Form the extracted and segmented features, the bidirectional encoder computes the hidden feature vector h t as follows: where z t , r t , h t , and h t are update gate, reset gate, current memory, and final memory at the current time step, respectively. W, U, and b are parameter matrices and vector; σ and tanh stand for sigmoid and hyperbolic tangent function, respectively. Using the embedding matrix W emb the hidden vector h t is converted to embedding matrix b t as follows: By applying a linear projection on the embedded vector b t query (q t ), values (v t ), and keys (k t ) vectors are computed as follows: where Q, K, and V are query, key and value matrices, respectively. Based on these results, attention weight a u and attention result c u are derived as follows: To address the conditional independence assumption in CTC, an attention layer is placed before the CTC projection layer ph u and transforms it to a particular dimension representing the number of CTC output labels. Then, the attention layer output that carries context information is served as the input of CTC projection layer at the current time u.
where W proj and b are the weight matrix and bias of the CTC projection layer, respectively. Finally, the projected output is optimized as follows: where y denotes the output label sequence. A many-to-one mapping B is defined to determine the correspondence between a set of paths and the output label sequences. The self-attention layer links all positions with a constant number of operations that are performed in sequence.

Ethiopic Script
Ethiopic script, which is derived from Geez, is one of the most ancient scripts in the world. It is used as a writing system for more than 43 languages, including Amharic, Geez, and Tigrigna. The script has largely been used by Geez and Amharic, which are the liturgical and official languages of Ethiopia, respectively. Amharic language is the second Semitic language after Arabic. The script is written down in a tabular format in which the first column denotes the base character and the other columns are vowels derived from the base characters, made by slightly deforming or modifying the base characters. The script has a total of 466 characters, out of which 20 are digits, 9 are punctuation marks, and the remaining 437 characters are parts of the alphabet. Developing a scene text recognition system for Ethiopic script is challenging, due to the visually similar characters, especially between base characters and the derived vowels, and the number of characters in the script. Furthermore, the lack of training and testing datasets is another limitation in the development of a scene text reading system for Ethiopic scripts. In this paper, we propose an end-to-end trainable bilingual scene text reading model using FPN, RPN, and time-restricted self-attention CTC.

Dataset Collection
In any machine learning technique, a dataset plays an important role in training and obtaining a better machine learning model. In particular, deep learning methods are more data-hungry than traditional machine learning algorithms. However, preparing a large dataset was a challenging task specifically for under-resourced languages. In this paper, we use a syntactically generated scene text dataset, and real scene text dataset for training and testing the proposed model, respectively. Following [12], a bilingual scene text dataset is prepared. A detailed description of synthetic dataset generation and real scene text dataset preparation is provided in the following sections.

Synthetic Scene Text Dataset
To train the proposed model, we use a bilingual scene text dataset, which is generated by adding a simple modification to the scene text dataset generation technique presented in [12]. The generated scene text images are like real scene images. This technique is very important to get more training data for those scripts that do not have prepared real scene text datasets. As far as we know, there is no prepared real scene text dataset for Ethiopic script. Moreover, most texts found in natural images are written in two languages (Amharic and English). Due to this, we prepare 500,000 bilingual training datasets from 54,735 words (825,080 characters), which were collected from social, political, and governmental websites that are written in Amharic and English. In the dataset generation process, 72 freely available Ethiopic Unicode fonts, different background images, font size, rotation along the horizontal line, and skew and thickness parameters are tuned. The sample generated scene image and statistics of the generated dataset are presented in Figure 2 and Table 2, respectively.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 14 system for Ethiopic scripts. In this paper, we propose an end-to-end trainable bilingual scene text reading model using FPN, RPN, and time-restricted self-attention CTC.

Dataset Collection
In any machine learning technique, a dataset plays an important role in training and obtaining a better machine learning model. In particular, deep learning methods are more data-hungry than traditional machine learning algorithms. However, preparing a large dataset was a challenging task specifically for under-resourced languages. In this paper, we use a syntactically generated scene text dataset, and real scene text dataset for training and testing the proposed model, respectively. Following [12], a bilingual scene text dataset is prepared. A detailed description of synthetic dataset generation and real scene text dataset preparation is provided in the following sections.

Synthetic Scene Text Dataset
To train the proposed model, we use a bilingual scene text dataset, which is generated by adding a simple modification to the scene text dataset generation technique presented in [12]. The generated scene text images are like real scene images. This technique is very important to get more training data for those scripts that do not have prepared real scene text datasets. As far as we know, there is no prepared real scene text dataset for Ethiopic script. Moreover, most texts found in natural images are written in two languages (Amharic and English). Due to this, we prepare 500,000 bilingual training datasets from 54,735 words (825,080 characters), which were collected from social, political, and governmental websites that are written in Amharic and English. In the dataset generation process, 72 freely available Ethiopic Unicode fonts, different background images, font size, rotation along the horizontal line, and skew and thickness parameters are tuned. The sample generated scene image and statistics of the generated dataset are presented in Figure 2 and Table 2, respectively.

Real Scene Text Dataset
In addition to the synthetic dataset, we collected 1200 benchmark bilingual real scene text images using photo camera and image search on Google. The images were captured from local markets, navigation and traffic signs, banners, billboards, and governmental offices. We also incorporated several office logos, most of which were written both in Amharic and English with curved shapes. In addition to our prepared dataset, we used the Synthetic [22] dataset to pre-train the proposed model with our synthetic dataset. To refine the pre-trained model and compare its performance with a stateof-the-art model, we used ICDAR2013 [44], ICDAR2015 [40], and Total-Text [45] datasets. The datasets, we used in the proposed model are summarized in Table 2. Additionally, sample images from the collected datasets are depicted in Figure 3.

Real Scene Text Dataset
In addition to the synthetic dataset, we collected 1200 benchmark bilingual real scene text images using photo camera and image search on Google. The images were captured from local markets, navigation and traffic signs, banners, billboards, and governmental offices. We also incorporated several office logos, most of which were written both in Amharic and English with curved shapes. In addition to our prepared dataset, we used the Synthetic [22] dataset to pre-train the proposed model with our synthetic dataset. To refine the pre-trained model and compare its performance with a state-of-the-art model, we used ICDAR2013 [44], ICDAR2015 [40], and Total-Text [45] datasets. The datasets, we used in the proposed model are summarized in Table 2. Additionally, sample images from the collected datasets are depicted in Figure 3.   [40] English 1500 1000 500 Regular Synthetic [22] English 600,000 --Regular Total-Text [45] English 1555 1255 300 Irregular

Experiments and Discussions
The effectiveness of the proposed model was evaluated and compared with state-of-the-art methods by pre-training the proposed model using our synthetically generated dataset and a Synthetic dataset. Finally, the pre-trained model was refined by merging the above-mentioned datasets.

Implementation Details
The proposed model was first pre-trained using our synthetically generated bilingual dataset and Synthetic [22], then fine-tuned using the union of other real-world datasets indicated in Section 4.2.2. Due to the lack of real sample images in the fine-tuning stage, data augmentation and multiscale training were applied by randomly modifying brightness, hue, contrast, the angle of the image between −30 and 30. Following [9], for multi-scale training, the shorter sides of the input images were randomly resized to five scales (600, 800, 1000, 1200, 1400). We used Adam [46] (base learning rate = 0.0001, β1 = 0.9, β2 = 0.999, weight decay = 0) as an optimizer. Following the result of [21], we set the value to α = 0.25 which denotes the ratio of the low-frequency part.
The experiment of the proposed bilingual scene text reading model is conducted on the Ubuntu machine containing Intel Core i7-7700 (3.60 GHz) CPU with 64 GB RAM and GeForce GTX 1080 Ti 11176 MiB GPU. For the implementation, we use Python 3.7 and PyTorch1.2.   [40] English 1500 1000 500 Regular Synthetic [22] English 600,000 --Regular Total-Text [45] English 1555 1255 300 Irregular

Experiments and Discussions
The effectiveness of the proposed model was evaluated and compared with state-of-the-art methods by pre-training the proposed model using our synthetically generated dataset and a Synthetic dataset. Finally, the pre-trained model was refined by merging the above-mentioned datasets.

Implementation Details
The proposed model was first pre-trained using our synthetically generated bilingual dataset and Synthetic [22], then fine-tuned using the union of other real-world datasets indicated in Section 4.2.2. Due to the lack of real sample images in the fine-tuning stage, data augmentation and multi-scale training were applied by randomly modifying brightness, hue, contrast, the angle of the image between −30 and 30. Following [9], for multi-scale training, the shorter sides of the input images were randomly resized to five scales (600, 800, 1000, 1200, 1400). We used Adam [46] (base learning rate = 0.0001, β1 = 0.9, β2 = 0.999, weight decay = 0) as an optimizer. Following the result of [21], we set the value to α = 0.25 which denotes the ratio of the low-frequency part.
The experiment of the proposed bilingual scene text reading model is conducted on the Ubuntu machine containing Intel Core i7-7700 (3.60 GHz) CPU with 64 GB RAM and GeForce GTX 1080 Ti 11176 MiB GPU. For the implementation, we use Python 3.7 and PyTorch1.2.

Experiment Results
Throughout our experimental analysis, we evaluated a single model trained in a multilingual setup as explained in Section 3. To improve the performance of the model, we first pre-trained it using Synthetic dataset [22] and our synthetically generated bilingual dataset which has a total of 430 characters. Then, we fine-tuned the pre-trained model by combining the above-mentioned real scene text datasets. The text recognition results were reported in an unconstrained setup, that is, without using any predefined lexicon (set of words).
The performance of the trained model was verified using our prepared testing dataset and well-known ICDAR detests. As discussed in Section 4.2, the collected images in our dataset contain horizontal, arbitrary, and curved texts. Both the detection and recognition results were promising for horizontal, arbitrary, and curved text. The experiment evaluation for scene text detection on our prepared real scene text dataset showed 88.3% Precision (P), 82.4% Recall (R), and 85.25% F1-score (F). On the other hand, the end-to-end scene text-reading experiment showed 80.88% P, 49.01% R, and 61.04% F. The scene text detection performance of the proposed method for English and Amharic words do not differ much. However, in the end-to-end scene text reading task, 63.4% of errors occurred in the recognition of Amharic words. From incorrectly recognized characters, some of them did not have sufficient samples on the real and Synthetic datasets. Sample detection and recognition results are depicted in Figure 4. Most of the detection errors in our proposed method occurred from false detection of non-text areas of backgrounds.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 14 Throughout our experimental analysis, we evaluated a single model trained in a multilingual setup as explained in Section 3. To improve the performance of the model, we first pre-trained it using Synthetic dataset [22] and our synthetically generated bilingual dataset which has a total of 430 characters. Then, we fine-tuned the pre-trained model by combining the above-mentioned real scene text datasets. The text recognition results were reported in an unconstrained setup, that is, without using any predefined lexicon (set of words).
The performance of the trained model was verified using our prepared testing dataset and wellknown ICDAR detests. As discussed in Section 4.2, the collected images in our dataset contain horizontal, arbitrary, and curved texts. Both the detection and recognition results were promising for horizontal, arbitrary, and curved text. The experiment evaluation for scene text detection on our prepared real scene text dataset showed 88.3% Precision (P), 82.4% Recall (R), and 85.25% F1-score (F). On the other hand, the end-to-end scene text-reading experiment showed 80.88% P, 49.01% R, and 61.04% F. The scene text detection performance of the proposed method for English and Amharic words do not differ much. However, in the end-to-end scene text reading task, 63.4% of errors occurred in the recognition of Amharic words. From incorrectly recognized characters, some of them did not have sufficient samples on the real and Synthetic datasets. Sample detection and recognition results are depicted in Figure 4. Most of the detection errors in our proposed method occurred from false detection of non-text areas of backgrounds. In addition to our testing dataset, we evaluated the performance of our proposed model using ICDAR2013, ICDAR2015, and Total-Text testing datasets, which contain only English texts. The model is fine-tuned for both English and Amharic languages as one model, not for each language. The results of our proposed method and previously proposed methods are shown in Table 3. The experiment showed that our proposed method had a better recognition result on ICDAR2013 and Total-Text datasets. However, the scene text detection result of our proposed method was almost similar to a recently proposed mask text spotter [9] method. We used their architecture and implementation code with a little modification on the feature extraction layer and recognition layer. From the MaskTextSpotter implementation, we modified the ResNet-50 feature extraction by octave based ResNet-50 feature extraction and the text recognition part is modified by self-attention encoderdecoder model. Whereas the preprocessing and RPN implementation is taken from MaskTextSpotter. In Table 4, we compare the scene text detection result of our proposed method with previously proposed methods using ICDAR2013, ICDAR2015, and Total-Text datasets. Table 3. F1-Score experimental results of the proposed unconstrained scene text reading system compared with previous methods.

ICDAR2013 ICDAR2015
Total-Text TextProposals+DicNet * [47] 68.54% 47.18% -DeepTextSpotter * [19] 77.0% 47.0% -FOTS * [34] 84.77% 65.33% - In addition to our testing dataset, we evaluated the performance of our proposed model using ICDAR2013, ICDAR2015, and Total-Text testing datasets, which contain only English texts. The model is fine-tuned for both English and Amharic languages as one model, not for each language. The results of our proposed method and previously proposed methods are shown in Table 3. The experiment showed that our proposed method had a better recognition result on ICDAR2013 and Total-Text datasets. However, the scene text detection result of our proposed method was almost similar to a recently proposed mask text spotter [9] method. We used their architecture and implementation code with a little modification on the feature extraction layer and recognition layer. From the MaskTextSpotter implementation, we modified the ResNet-50 feature extraction by octave based ResNet-50 feature extraction and the text recognition part is modified by self-attention encoder-decoder model. Whereas the preprocessing and RPN implementation is taken from MaskTextSpotter. In Table 4, we compare the scene text detection result of our proposed method with previously proposed methods using ICDAR2013, ICDAR2015, and Total-Text datasets. Table 3. F1-Score experimental results of the proposed unconstrained scene text reading system compared with previous methods.

ICDAR2013 ICDAR2015 Total-Text
TextProposals+DicNet * [47] 68.54% 47.18% -DeepTextSpotter * [19] 77.0% 47.0% -FOTS * [34] 84.77% 65.33% -TextBoxes * [8] 84.65% 51.9% -E2E-MLT ** [48] -71.4% -Mask Text Spotter ** [9] 86  In the experiment, the proposed bilingual scene text reading method had limitations regarding small font size scene texts and severely distorted images. Furthermore, due to the existence of many characters and their similarities, and the limited number of training samples for certain Ethiopic characters, a recognition error occurred at the time of testing. To improve the recognition performance of the system and the scene text-reading system in general, it is necessary to prepare more training data that contain enough samples for every character.

Conclusions
This paper introduced an end-to-end trainable bilingual (English and Ethiopic) scene text reading system using octave convolution and time-restricted attention encoder-decoder module. In the proposed model there were three layers. In the first layer, FPN with ResNet-50 was used as a feature extractor by replacing vanilla convolution with OctConv. Secondly, bounding box prediction and detection of texts were performed using RPN. Finally, recognition of text was performed by segmenting text areas based on the detected bounding boxes on the second layer using a time-restricted attention encoder-decoder network. To measure the effectiveness of the proposed model, we collect and syntactically generate a bilingual dataset. Additionally, we use well-known ICDAR2013, ICDAR2015, and Total Text datasets. Based on the prepared bilingual dataset, the proposed method shows 61.04% and 85.25% F1-measure on scene text reading and scene text detection, respectively. Compared to state-of-the-art recognition performance, our proposed model shows promising results. However, our method shows state-of-the-art results for ICDAR2013 and Total-Text end-to-end text readings. Furthermore, due to the existence of many characters, their similarities, and the limited number of training samples for certain Ethiopic characters, a recognition error occurred at the time of testing. To improve the recognition performance of the system, it is necessary for the future to prepare more training data that contain enough samples for every character. After the publication of the paper, the implementation code and the prepared dataset link will be freely available for the researchers on https://github.com/direselign/amh_eng.