Scene Text Detection with Polygon Offsetting and Border Augmentation

: Scene text localization is a very crucial step in the issue of scene text recognition. The major challenges—such as how there are various sizes, shapes, unpredictable orientations, a wide range of colors and styles, occlusion, and local and global illumination variations—make the problem different from generic object detection. Unlike existing scene text localization methods, here we present a segmentation-based text detector which can detect an arbitrary shaped scene text by using polygon offsetting, combined with the border augmentation. This technique better distinguishes contiguous and arbitrary shaped text instances from nearby non-text regions. The quantitative experimental results on public benchmarks, ICDAR2015, ICDAR2017-MLT, ICDAR2019-MLT, and Total-Text datasets demonstrate the performance and robustness of our proposed method, compared to previous approaches which have been proposed.


Introduction
Automatic scene text localization is a key part in many practical daily life applications, such as instant language translation, autonomous driving, image retrieval, scene text understanding, and scene parsing. Despite its similarity to the traditional OCR on scanned documents, scene text is much more challenging due to a large variation of text styles, sizes, orientations, and a wide range of complex backgrounds, where together with occlusions, it makes it challenging to locate scene texts from images. Therefore, accurate and robust scene text detection is still an interesting research challenge.
With the great success of convolutional neural networks (CNN) used in object detection, instance segmentation, and semantic segmentation problems, many scene text detectors based on object detection [1][2][3][4][5][6][7] and instance segmentation [8,9] have recently shown promising results. Unfortunately, some methods failed in some complex cases, such as in the case of arbitrarily shaped and curved texts, which is difficult to represent with a single rectangle or quadrangle used in generic object detectors, as shown in Figure 1.
As recent developments in pixel labelling problems have gained interest, in this paper we present a semantic segmentation-based text detector which can detect text in various shapes; however, semantic segmentation can be used to label the text regions, and it might not be able to distinguish text instances which are very close, thus resulting in a single merged text instance, as shown in Figure 2. To deal with this problem, in addition to representing the text instances using only text pixel masks, our proposed method also learns the text's outer border and offset masks. The text's outer border masks represent each text instance boundary, while the offset masks represent the distance between the shrunk text instance polygon border and its original shape. Both can greatly help to separate the adjacent text instances.  In this paper, we present a pipeline semantic segmentation-based text detector and extended text representation. We first used the ResNet-50 [10] combined with the feature pyramid network (FPN) [11] as a backbone to extract features from input images. Each scale feature was combined and up-sampled into the input image's original sizes. Instead of using single direct upsampling to the original size, we applied the consecutive upsampling modules, which improved the overall training stability and output segmentation results. The text instances were independently predicted on each scale using a simple connected component analysis. The text border information was also used to ensure a clear cut between each text instance. By using a simple polygon non-maximum suppression over the entire detected text instances, we obtained the final text locations. The experimental result is promising in terms of detection accuracy on the standard test benchmarks, including ICDAR2015 [12], ICDAR2017-MLT [13], ICDAR2019-MLT [14], and Total-Text [15].
The contribution of this paper can be summarized as follows: 1. In addition to the text pixel masks, we also employed the offset masks and text instances border to represent the text instances, which improves the distinguishing of contiguous text instances. 2. A post-processing pipeline to predict text instances location was proposed, which apparently yields higher accuracy while impacting slightly on inference time. 3. The experimental results show our proposed method that has a competitive accuracy on standard benchmarks.
The remainder of this paper is organized as follows: Section 2 discusses the previous text detection methods. In Section 3, the proposed method is described, including the text representation, network structure, loss function, and text instance inference details. Section 4 discusses the quantitative experimental results on standard benchmark datasets and the effect of border augmentation. Section 5 draws final conclusions and directions for future work.

Related Works
Text detection is still a popular and active research area in the computer vision field. In this section, we introduce existing scene text detection methods, which can be categorized into three main categories: the connected component-based [16][17][18], detection-based [3,4,7,19,20], and semantic segmentation-based methods [9,21,22].
Connected component-based methods: Previous works in scene text detection have been dominated by bottom-up methods which are usually built on stroke or character detection. Individual character is detected based on the observation of scene text characteristics, such as colors, stroke width, enclosure contours, and geometric properties. These properties lead to classic text detection features, such as Stroke Width Transform [16] and Maximally Stable Extremal Regions (MSER) [17]. Then, the detected characters are grouped into words or text lines. The grouping methods usually adopt some defined heuristic or learned rules to remove false detections. Nevertheless, the connected component-based method might not be robust in complex scenarios due to the uncertain scene conditions, in terms of text distortion, orientation, occlusion, reflection, and noise.
Detection-based methods: Convolutional neural networks (CNN) have demonstrated strong capability in object detection problems. Many recent object detection frameworks, such as proposal-based detectors (Faster-RCNN [23], Mask-RCNN [24]) and regression-based detectors (Single Shot MultiBox Detector : SSD [25], You Only Look Once : YOLO [26]) have shown splendid performance in various practical applications. Both proposal and regression-based methods have been shown to produce impressive results in terms of speed and accuracy on many famous object-detection benchmark datasets. However, scene text has a different context when compared to generic objects. More specifically, text is significantly distinct from generic objects in many aspects, such as the various aspect ratios and non-axis-aligned orientation. These specific characteristics make it difficult to apply the existing object detection algorithms directly.
To handle multi-oriented scene text, R2CNN [3] employed rotatable anchors based on Faster-RCNN. TextBoxes++ [7] modified the convolution kernel shapes and SSD anchor boxes to effectively handle various text aspect ratios, especially long text. LOMO [19] and SPCNET [20] formulated the text detection problem into a instance segmentation problem by using Mask-RCNN as a base to generate both the axis-aligned bounding box and text segmentation mask, which was able to deal with arbitrary text shapes. EAST [4] directly applied regression from CNN features to form up-text quadrangles without using the anchor box mechanism.
Semantic segmention-based methods: Instead of detecting text in character-or word-bounding box levels, the intentions of the methods in this group are to label the text and non-text regions at a pixel level. PixelLink [9] proposed a method which represents the text instances in eight connected text pixel maps, and directly infers the word level boxes by using a simple connected component. TextSnake [21] presented the arbitrary shapes text detector by using the text region-based center line, together with geometry attribute representation. The text lines were reconstructed by a striding algorithm from central text line point lists. PSENet [22] utilized polygon offsetting and multi-scaled text segmentation maps to separate and detect text. These methods usually vary in the way they express text blobs and the method used to distinguish between each text instance.

Proposed Method
This section presents details of the proposed text detector, including the text representation, network structure, loss function, and text instance inference details of our method.

Text Representation
In many previous works, scene texts are typically represented by bounding boxes, which are 2D rectangles containing texts. Some use axis-aligned bounding boxes, which are aligned with the axes of the coordinate system, whereas some use oriented bounding boxes, which are arbitrarily oriented rectangles. To make the bounding box fit the text regions more accurately, quadrangles are used. However, in some difficult cases they are still not capable of precisely capturing text instances, such as the texts in Figure 2 which are aligned in a curved shape. To cope with this limitation, some methods have used text masks to represent arbitrarily shaped text instances. Nonetheless, we found that this might not be able to separate the very close text instances. Thus, in this work, instead of using only shrunk text masks, we combine the shrunk text masks and offset masks, which are offsetting polygons that can be either inward or outward. In addition, to make the network able to capture different text sizes, each original text polygon is offset into multiple scaled polygons based on its area and perimeter. If we consider the text instance t i and polygon scaling factor α, the polygon offsetting ratio d i can be calculated from the following equation: The ground truth for each image consists of three components: text masks g tm , which are filled offsetting polygons; offset masks g om , where each polygon area is filled with the d i ; and outer border masks g bm , which represent each text instance border. This text representation is illustrated in Figure 3.

Network Structure
In this work, a fully convolutional neural network based on ResNet-50 was used as our core network. To avoid loss of spatial information, we utilized the feature pyramid network (FPN) [11]. FPN has demonstrated a significant ability to handle multi-scaled semantic information in many recent works. The lateral connections are built between deep and shallow feature maps in order to generate high-quality feature maps from low-level and high semantic features. Since deconvolution causes the checkerboard pattern on the output text masks, we utilized bilinear interpolation to upsample feature maps to the desired input size. The output from the network contains three branches, text masks, offset masks, and border masks. The overall network structure is shown in Figure 4.

Loss Function
The output from the network consists of three components: text, offset, and border masks. For the text and border masks, since the ratio between text, non-text pixels, and especially border pixels in scene text images are greatly imbalanced, making the network tends to put more emphasis on non-text pixels, yielding false detections when using standard binary cross-entropy loss. Since our problem aims to maximize the overlapping regions between ground truth and the predicted mask, there are many region-based losses which can be applied to cope with this problem, such as weighted cross-entropy [27], tversky loss [28], and focal loss [29]. However, to maximize classification loss efficiency, the parameters need to be tuned. To address this problem, we utilized dice loss, which is non-parametic loss, for both the text mask L tm and border masks L bm , which can be formulated as follows: where o tm (x, y), o bm (x, y), g tm (x, y), and g bm (x, y) represent the pixel value at (x, y) on the output text masks, and border masks on their ground truth masks, respectively. For the offset masks, to ensure good training stability, we employed smooth L 1 loss. The loss function for the offset mask can be formulated as: In this work, we combined all losses into multi-task loss, L, which can be defined as: where λ 1 , λ 2 , and λ 3 weigh the importance between text, border, and offset masks, respectively.

Text Instance Inference
After the forward pass, the network outputs are multi-scaled text masks, border, and offset masks. We employed thresholding on both text and border masks. To ensure a clear cut between each text instance, the border augmentation was used in combination with standard connected component analysis to detect and separate each component.
Border augmentation is a simple and fast operation between corresponding output text masks o tm and border masks o bm , which can be defined as.
where o ba represents the output text border augmented masks. We then calculated the text instance score using polygon scoring, which can be defined as: where P and N represent polygon scoring and the number of pixels in text instance t i , respectively. Each text component is restored back to its original size by using the offset value v(t i ), which can be calculated from output offset masks o om , as follows: Given such text polygon candidates with their associated scoring probabilities, we performed polygon non-maximum suppression to discard the overlapping detections, thus obtaining the final set of text instances.

Experiments
In order to evaluate the performance of the proposed method, we conducted a quantitative test on standard benchmarks for scene text detection and compared with the existing methods.

Datasets
SynthText [30] is a large-scale, computer-generated dataset. This dataset contains about 800,000 images. The images were created by fusing natural background images with rendered text. In order to make the text look more realistic, artificial transformations were applied, such as random fonts, sizes, colors, and orientations. In this dataset, text instances were annotated in both word and character levels. For this work, we utilized this dataset to pre-train our model. ICDAR2015 [12] first appeared in the 2015 incidental scene text detection robust reading competition. The images in this dataset were taken by Google Glasses without taking image quality and viewpoint into consideration. This dataset contained small, blurred, and multi-oriented text instances. There were 1500 images in total, which can be separated into 1000 training and 500 testing images. The text instances from this dataset were labeled in word-level quadrangles.
ICDAR2017-MLT [13] is a large, multi-lingual scene text dataset. This dataset included 7200 training, 1800 validation, and 9000 testing images, containing text from nine languages. The text instances from this dataset were annotated at word level by using four vertices quadrangles.
ICDAR2019-MLT [14] is the latest multi-lingual scene text dataset. This real-world dataset consisted of 10,000 training and 10,000 testing images containing text from 10 languages. The text instances from this dataset were annotated at word level by using four vertices quadrangles, as in ICDAR2017-MLT.
Total-Text [15] is a dataset which contains both horizontal and multi-oriented text instances. The dataset specially features curved text, which is occasionally presented in other benchmarks. The dataset is split into training and testing sets with 1255 and 300 images, respectively.

Implementation Details
We first trained our model on the SynthText dataset for 1 epoch, and continued to train and fine-tune on benchmark datasets until the model converged. The stochastic gradient descent (SGD) with momentum was used by setting the momentum and weight decay to 0.9 and 5 × 10 −4 . During the training on SynthText, the learning rate was initially set to 10 −3 , which then decayed to 10 −4 until the loss was stable. At the beginning, the batch size was set to 1, then increased to 4. We believed that the adaptive training batch size could slightly boost the model accuracy. From the experiment, the polygon scaling factor α was set as [0.6, 0.75, 0.9, 1.25]. The λ weighted the importance between text, border masks, and offset, which were set to 1, 1, and 0.1, respectively.
After we obtained the pre-trained weights from SynthText, the model was then fine-tuned on standard benchmarks, ICDAR2015, ICDAR2017-MLT, ICDAR2019-MLT, and Total-Text. The learning rate was initially set to 10 −4 and decreased by a factor of 10 at every 250 epoches. We set the batch size as equal to 4, and trained the models on 4 × NVIDIA Tesla K-80. For the ICDAR datasets, the non-readable text regions, which were labeled as "###", were not used during the training. To correct the imbalance ratio between the number of text and non-text pixels, for each text scale mask we adopted Online Hard Negative Mining (OHEM) [31] with a ratio of 1:3. As data augmentation was crucial to increasing the robustness of the algorithm, the following augmentations were applied: • Photometric distortion, as described in [32]. In this work, we also introduced a mosaic data augmentation technique by randomly combining multiple image patches into a new training image. This technique can increase the amount of training data and algorithm robustness. The sample of input and output images is shown in Figure 5. The output results for each dataset depend on its ground truth representation. For the ICDAR datasets, which represented text instances in quadrangles, we calculated the minimal area rectangle by using the standard OpenCV function to acquire the four-point output. In the case of the arbitrary shape text dataset, the text polygons and masks were considered as outputs.

Results
The results were evaluated by standard evaluation, depending on the dataset corresponding protocol.

Multi-Oriented English Text
We compared our proposed method with previous works on the ICDAR2015 dataset. The model was fine-tuned by using a pre-trained weight from SynthText, and was further trained for 200 epoches. In the testing stage, we scaled the longer side of the image to 1280 pixels, while still preserving the image aspect ratio and using only single-scale testing. From the quantitative results list in Table 1, our method gives a competitive result in terms of the f-measure. The samples of several detection results are shown in Figure 6.

Multi-Oriented and Multi-Language Text
To verify the robustness of our proposed method on multi-language scene text detection, we conducted the experiment on ICDAR2017-MLT and ICDAR2019-MLT datasets. The model weight from SynthText was fine-tuned for 300 epoches on the ICDAR2017-MLT training dataset, and 450 epoches on ICDAR2019-MLT. Since the image sizes in this dataset were not equal, we resized the longer side of the image to 1280 pixels, while still preserving the aspect ratio and test by using only single-scale. The experimental result on this dataset is shown in Table 1. Our method shows good and respectable performance compared to other state-of-the-art methods. Samples of the detection results on ICDAR2017-MLT and ICDAR2019-MLT datasets are shown in Figures 7 and 8, respectively.

Multi-Oriented and Curved English Text
We tested our method's ability to detect curved and arbitrary-oriented texts on the Total-Text dataset. Similar to the experiment on ICDAR2017-MLT and ICDAR2019-MLT, we started from the SynthText pre-trained weights and fine-tuned them on Total-Text for 150 epoches. The experimental results showed that our method surpassed other methods in terms of precision with respectable recall and f-measure. Detailed results are shown in Table 1. Figure 9 shows that our method can detect curved text in various styles, shapes, and orientations.

Speed Analysis
We also employed a comparative experiment in terms of text detection speed. All of the experiments were tested on NVIDIA GTX 1080 Ti and Intel i7-4770K.
As shown in Table 2, our proposed method gives a good balance between detection speed and accuracy. In the experiment, ResNet-50 and ResNet-34 are considered as the feature extraction backbone to trade off the speed and accuracy. If we change the backbone to ResNet-34, our proposed method gives nearly real-time detection speed.

Border Augmentation
To analyze the adjacent text instance separation capability of our method, we removed the entire border augmentation part and conducted the experiment under the same configurations. As shown in Table 1, the border augmentation can improve the result on all datasets. The sample result of border augmentation is shown in Figure 10. (c) Figure 10. The effect of border augmentation: (a) without border augmentation; (b) with border augmentation; (c) close-up of adjacent text instances. As shown in the images, the border augmentation is able to provide clear-cut and accurate text instances.

Conclusions
In this paper, we presented a method based on semantic segmentation which can be used to localize arbitrarily-oriented text in natural scene images. By using shared, multi-scaled convolution features to learn text and offset masks, we were able to effectively pinpoint the locations of exact text instances. The border augmentation mechanism also helps distinguish between adjacent text components. The numerical results on different standard scene text benchmarks show the advantages in terms of speed, while still preserving acceptable accuracy when compared to previously proposed text detectors.
In the future, we will investigate the causes of failed detection and the possibility to build a single and lightweight network for end-to-end scene text localization and recognition.