Towards Accurate Scene Text Detection with Bidirectional Feature Pyramid Network

: Scene text detection, this task of detecting text from real images, is a hot research topic in the machine vision community. Most of the current research is based on an anchor box. These methods are complex in model design and time-consuming to train. In this paper, we propose a new Fully Convolutional One-Stage Object Detection (FCOS)-based text detection method that can robustly detect multioriented and multilingual text from natural scene images in a per pixel prediction approach. Our proposed text detector employs an anchor-free approach, unlike state-of-the-art text detectors that do not rely on a predeﬁned anchor box. In order to enhance the feature representation ability of FCOS for text detection tasks, we apply the Bidirectional Feature Pyramid Network (BiFPN) as the backbone network, enhancing the model learning capacity and increasing the receptive ﬁeld. We demonstrate the superior performance of our method on multioriented (ICDAR-2015, ICDAR-2017 MLT) and horizontal (ICDAR-2013) text detection benchmark tasks. Moreover, our method has an f-measure of 88.65 and 86.32 for the benchmark datasets ICDAR 2013 and ICDAR 2015, respectively, and 80.75 for the ICDAR-2017 MLT dataset.


Introduction
Scene text detection is both fundamental and challenging in the field of machine vision, and plays a critical role in subsequent text recognition tasks. Current mainstream text detection methods [1][2][3][4][5], rely on predefined anchor frames to extract high-quality word candidate regions. Despite the success of these methods, they are associated with the following limitations: (1) text detection results are highly sensitive to the size, orientation, and the number of predefined anchor boxes. (2) Due to the variable size, shape, and orientation of text in natural scenes, it is difficult to capture all the text instances via the predefined anchor boxes. (3) A large number of anchor boxes are required in order to improve text detection performance, resulting in complex and time-consuming calculations. For example, in order to improve the accuracy of DeepText [1], Zhong empirically designed four scales and six aspect ratios, resulting in 24 prior bounding boxes at each sliding position. The number of anchors is 2.6 times more than those of Faster R-CNN [6].
Fully convolutional networks (FCNs) [7] recently have been very successful in the dense prediction task, including semantic segmentation [8][9][10], keypoint detection [11,12], and depth estimation [13,14]. Several scene text detection methods [15][16][17][18][19] treat text detection as a semantic segmentation problem and use the FCN for pixel-level text prediction. Liao et al. [20] proposed a novel binarization module called differentiable binarization (DB), which enabled the segmentation network to set the threshold of binarization adaptively, greatly improving the performance of text detection. Such methods typically diverge from anchor boxes to anchor-free frameworks by using corner/center points. This facilitates computational efficiency and generally improves the performance over anchor box-based text detectors. Since only coarse text blocks can be detected from the saliency map, a complex postprocessing step is required to extract the precise bounding boxes [21]. Recently, Tian et al. [22] proposed a fully convolutional one-stage object detector (FCOS) pipeline to solve this issue and achieve state-of-the-art results in general object detection tasks.
In this paper, we design a simple yet efficient anchor-free method based on FCOS, a one-stage fully convolutional text detection framework with a weighted bidirectional pyramid feature network (BiFPN) for the scene text detection of natural images. Our proposed method's architecture is shown in Figure 1. In order to enhance the feature representation ability, we employ EfficientNet as a new backbone network. Experiments demonstrate the superior performance of our proposed approach compared to state-of-the-art methods for benchmark (ICDAR-2013, ICDAR-2015, ICDAR-2017 MLT) text detection datasets.

Related Work
Anchor-based text detector. Text detection methods based on region proposals use a general object detection framework and often employ regression text boxes to get the region text information [23]. For example, in [1], the GoogleNet [24] inception structure was employed to improve Faster R-CNN [25]. As a result, an initial region proposal network (InceptionRPN) was generated, which acquired text candidate regions, removed background regions using a text detection network, and voted on the detected overlapping regions to determine the optimal result. Jiang et al. [26] proposed a rotational region convolutional network (R2CNN) to detect arbitrarily-oriented text in scene images. A novel connectionist text proposal network (CTPN) was proposed in [27] in order to locate text lines in scene images. In [28], a vertically regressed proposal network (VRPN) was proposed to match text regions using multiple neighboring small anchors. While in [29], Ma et al. presented the rotation region proposal network (RRPN) to detect arbitrarily oriented text. This paper aims to generate tilted proposals with angular information about the text orientation. The angle information is then adjusted and bounding box regression is performed to make the proposals more accurately fit the orientation of the text region.
Previous research has adopted bounding boxes or quadrangles as a text description approach. For example, the approach presented in [3] was based on a single shot MultiBox detector (SSD) [30] object detection framework, which used a quadrilateral or rotated rectangle representation to replace the rectangular box. Reference [5] proposed an end-to-end two-stage scene text detection network architecture, named the quadrilateral region proposal network (QRPN), that can accurately locate scene texts with quadrilateral boundaries. In [31], the authors proposed the rotation-sensitive regression detector (RRD) framework to perform classification and regression on different features extracted by two different designs of network branches. Deng et al. [32] proposed a new two-stage algorithm. In the first stage, the method predicts text instance locations by detecting and linking corners instead of traditional anchor points. In the second stage, the authors designed a pooling layer called dual-Roi pooling, which embeds data augmentation inside a regional sub-network.
Anchor-free text detector. Anchor-free-based approaches treat text as a distinct object and leverage efficient object detection architectures (e.g., YOLOv1 [33], SSD [30], Corner-Net [34], and DenseBox [35]) to detect words or text lines directly from natural images. YOLOv1 [33] does not use anchor boxes, but rather predicts bounding boxes at points close to the center of the object, resulting in a low recall. CornerNet [34] is a recently introduced single-stage anchor-free detector that detects bounding box corner pairs and groups them together to make the final detected bounding box. However, CornerNet needs a complex postprocessing procedure to cluster corner pairs that belong to the same instance. An additional distance metric also needs to be learned when grouping. Another family of anchor-free detectors is that based on DenseBox [35] (e.g., UnitBox [36]). These detectors are considered unsuitable for generic object detection due to difficulties in handling overlapping bounding boxes and relatively low recall values. FCOS [22] is a single-stage anchor-free detector recently proposed to obtain detection accuracy comparable to traditional anchor-based detectors. Unlike YOLOv1, FCOS utilizes all points in the ground truth bounding box to predict the bounding box, while the detected low-quality bounding boxes are restrained by the proposed "centerness" branch. In this paper, we introduce a method based on FCOS, and integrate the Bidirectional Feature Pyramid Network (BiFPN) into the FCOS framework. Experiments demonstrate the ability of BiFPN to enhance the model learning capacity and increase the receptive field.

Bidirectional Feature Pyramid Network
Mainstream text detection architectures employ pyramid feature combination steps (e.g., feature pyramid network (FPN)) to enrich features with high-level semantic information. The traditional FPN generally enriches the feature maps from the final output of a single path architecture in a top-down manner. Despite their great success, such methods are limited by several factors: (1) the design does not incorporate high-level context with the former level features, retaining the spatial detail and semantic information in the network path. (2) Input features vary with resolution, resulting in inconsistent contributions to the output feature. Tan et al. [37] recently proposed a bidirectional pyramid network (BiFPN) that fused multiscale features for object detection. The framework contained two key modules: cross-scale connections and weighted feature fusion. Unlike the one-way information flow of the traditional top-down FPN, BiFPN included a bottom-up path aggregation network and an additional edge from the original input to the output node. We employed five levels of feature maps defined as {P3, P4, P5, P6, P7} ( Figure 2), with each feature level P3, P4, P5, P6, and P7 having 8, 16, 32, 64, and 128 strides, respectively.
The fast feature fusion of the BiFPN, as follows: where P td n is the middle result of the nth layer on the top-down path; P out n is the output result of the nth layer on the bottom-up path; Resize is an upsampling or downsampling operation for resolution matching; Conv is depthwise separable convolution [38] for feature fusion, and here we use weighted normalized feature fusion [37]; w i ≥ 0 is guaranteed by applying a Relu after each w i ; and = 0.0001.

FCOS for Text Detection
The majority of state-of-the-art text detectors such as Deep-text [1], TextBoxes [2], TextBoxes++ [3] and ABCNet [39] use a predefined anchor box, which requires elaborate parameter tuning and complex calculations for box IoUs during training. With no anchor box, the FCOS [22] can predict the 4D vector and class labels for each spatial location on the feature map layer directly. The 4D vector describes the relative offsets (l, t, r, and b) from the four sides (left, top, right, and bottom) of a location bounding box (Figure 3). denote the coordinates of the left-top and right-bottom corners of the bounding box. We can map each location p(x,y) on feature map Fi back onto the input image using s 2 + xs, s 2 + ys . This is close to the center of the receptive field of location p. Anchor-based text detectors consider the location on the input image as the anchor box center and regress the target bounding box with these anchor boxes as references. In contrast, following [19], we treated the location as a training sample instead of an anchor-box and regressed the target bounding box at the location. In our framework, p was treated as a positive sample if it fell into any ground-truth box. In addition to the classification label, we also defined the 4D real vector t * = (l * , t * , r * , b * ) as the regression targets at the location, where l * , t * , r * , b * are the distances from the location to the four edges of the bounding box (as shown in Figure 3). We simply selected the bounding box with the minimal area as the regression target. More specifically, if location (x,y) was associated with bounding box Bi, the training regression targets for the location can be determined as Equation (2).
After the execution of the feature extraction backbone, the anchor-free text detection head predicted the text location in the images of nature. Figure 4 presents the network architecture of the text box detection network. Similar to [40], the input features of the backbone network were fed into the three convolutional layers for the final text/nontext classification and the quadrilateral bounding box regression branches. Note that the proposed method has at least 9× fewer network output variables than popular anchorbased text detectors [1,3] that use preset anchor boxes. Following [22], we also employed the centerless branch to eliminate low-quality text prediction bounding boxes.

Datasets
We evaluated our proposed method on several standard benchmark tasks including ICDAR 2013 (IC13) [41] and ICDAR 2015 (IC15) [42] for multioriented text detection and ICDAR 2017 MLT (MLT17) [43] for multilingual text detection. IC13 [41] inherits from ICDAR 2003 [44], with 229 and 233 natural images for training and testing, respectively. IC15 [42], the first incidental scene text dataset, was built for the Incidental Scene Text challenge in the ICDAR-2015 Robust Reading Competition and contains 1000 images for training and 500 images for validation/testing. The 17,548 text instances (annotated by the 4 vertices of the quadrangle) are usually skewed or blurred since they are acquired without prior preference or intention. IC15 provides word-level English annotations. MLT17 [43] consists of 18,000 images containing text in 9 different languages: Arabic, Bengali, Chinese, English, French, German, Italian, Japanese, and Korean. A total of 9000 images were used for training the model (7200 for training and 1800 for validation), and the other half for testing.
In order to compare with the state-of-the-art methods, we performed the comparison on three popular public datasets. Specifically, we used the official evaluation tools in the public datasets ICDAR 2013, ICDAR 2015, while for ICDAR 2017, we used the evaluation tools provided by the authors [45].

Implementation Details
We used EfficientNet-B1 [46] as the backbone networks for our proposed model, with the hyper-parameters following those of EfficientDet [37]. In particular, our network was trained using the stochastic gradient descent (SGD) across 80 K iterations with the initial learning rate of 0.01 and a minibatch of 16 images. The learning rate was reduced by a factor of 10 at iterations 50 K and 70 K. Furthermore, the weight decay and momentum were set as 0.0005 and 0.9, respectively. We pretrained the weights on ImageNet [47] for the initialization of our backbone networks, while the newly added layers were initialized by applying random weights with a gaussian distribution of mean 0 and standard deviation of 0.01. For the ICDAR-2017 MLT dataset, we used the training and validation data (i.e., 9000 training images), while for both ICDAR 2013 and ICDAR 2015, we employed the pretrained model from ICADAR-2017 MLT, with the provided training images applied for finetuning. We implemented our method in Py torch and performed the training on a RTX TITAN GPU system.
Loss Function: The loss function employed during training is defined as follows: where the classification loss L cls , box regression loss L box , and centerness loss L center are equal to those in [22].

Results and Comparison
We compared the performance of our approach with that of the state-of-the-art methods using the ICDAR 2013 (IC13), ICDAR 2015 (IC5), and ICDAR 2017 MLT (MLT17) benchmark datasets (Tables 1-3).  (Table 3). Our proposed method achieved better results than the other methods on these three challenging text detection benchmarks. Figure 5 depicts the qualitative detection results.  Our framework had the following advantages.
• Scene text detection was designed as a proposal-free and anchor-free pipeline, which did not require the manual design of an anchor box or the heuristic adjustment of an anchor box, reducing the number of parameters and simplifying the training process.

•
Compared to the anchor box-based methods, our one-stage text detector, which avoided the use of RPN networks and IOU-based proposal filtering, greatly reduced the computation. • As a result, our text detection framework, was simpler and more efficient. Our framework could be easily extended to other vision tasks, providing a new solution for the detection task of scene text recognition.
BiFPN is better than FPN. We compared BiFPN with FPN on the three benchmark datasets. As shown in Tables 1-3, the integration of BiFPN into the proposed approach improved the F-measure by 2.7, 2.2 and 2.5% compared to the proposed approach with the traditional FPN. In our experiments, we observed that BiFPN had better feature extraction ability than FPN. The main reason for this was BiFPN's two-way feature fusion, which could better preserve text features.

Conclusions
In the current paper, we proposed a new FCOS based text detection approach that includes an anchor-box and proposal-free one-stage text detector. Our method can robustly detect text from natural scene images, which is simpler, more efficient, and more scalable than anchor-based methods. Moreover, we demonstrated the ability of the bidirectional feature pyramid network (BiFPN) as the new backbone network of FCOS to significantly enhance the feature representation of FCOS, effectively improving text detection in natural scene pictures. Our proposed method achieved better results than other state-of-the-art methods on these three challenging text detection benchmarks (ICDAR 2013, ICDAR 2015 and ICDAR 2017 MLT). In the future, we will continue to focus on feature fusion methods to further improve detection capabilities. In addition, due to the simplicity of our framework, we are interested in extending our framework to scene text recognition tasks.