A Straightforward and Efficient Instance-Aware Curved Text Detector

A challenging aspect of scene text detection is to handle curved texts. In order to avoid the tedious manual annotations for training curve text detector, and to overcome the limitation of regression-based text detectors to irregular text, we introduce straightforward and efficient instance-aware curved scene text detector, namely, look more than twice (LOMT), which makes the regression-based text detection results gradually change from loosely bounded box to compact polygon. LOMT mainly composes of curve text shape approximation module and component merging network. The shape approximation module uses a particle swarm optimization-based text shape approximation method (called PSO-TSA) to fine-tune the quadrilateral text detection results to fit the curved text. The component merging network merges incomplete text sub-parts of text instances into more complete polygon through instance awareness, called ICMN. Experiments on five text datasets demonstrate that our method not only achieves excellent performance but also has relatively high speed. Ablation experiments show that PSO-TSA can solve the text’s shape optimization problem efficiently, and ICMN has a satisfactory merger effect.


Introduction
As a crucial premise of text recognition, scene text detection has largely attracted the attention of many academics and industrial researchers, and many promising results have been achieved in recent decades. However, due to existing large differences in size, aspect ratio, direction and shape, as well as the presence of distortion and occlusion, success detection of scene text is still a very challenging task. To deal with these challenges, traditional text detection pipelines [1,2] usually focused on two subtasks: text detection and non-text removal, however, they are limited by hand-crafted features and usually involve heavy post-processing.
In recent years, with the renaissance of convolutional neural networks (CNNs), many deep learning-based methods  have achieved remarkable achievements in text detection, and these methods can be divided into top-down and bottom-up methods. The top-down methods [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20], also commonly referred to as regression-based methods, usually adopt popular object detection pipelines to first detect text on the block level and then break a block into the word or line level if necessary. However, because of the structural limitations of the corresponding CNN models, these methods cannot efficiently handle long text and arbitrarily shaped texts. The bottom-up methods [21][22][23][24][25][26][27][28][29][30][31][32][33][34] first detect text components with a CNN and then group these components into text instances. According to the granularity of the text components, the bottom-up methods are mainly divided into two categories: pixel-level methods and part-level methods. Borrowing from the idea of semantic segmentation, pixel-based methods [21][22][23][24][25][26][27][28][29] produce a text saliency map for text detection by employing an FCN [30] to perform pixel-level text/nontext prediction. By regarding a text region as a group of text components, part-level methods [31][32][33][34] usually first detect individual characters and then group them into words or text lines. Compared with the top-down methods, the bottom-up methods have more flexibility and better detection performance. Unfortunately, however, the training model requires a large number of sample annotations and complex network design, which limits their generalization.
For taking advantage of the high efficiency of top-down methods while avoiding the bottom-up method for large-scale labeling of samples, we introduce an efficient and straightforward curved text detection method, namely, look more than twice (LOMT), which progressively localizes a complete text instance multiple times without heavy learning. First, the text proposals are located in the original image by adopting a direct regression network, and then the candidate characters are extracted from each text proposal by using maximally stable extremal regions (MSER). Next, a particle swarm optimization-based text shape approximator (PSO-TSA) gradually approximates an arbitrary text shape by a specific PSO technique. Finally, one complete text instance is generated by merging two adjacent or intersecting text proposals by an instance-aware component merging network (ICMN).
Our main contributions can be summarized as follows: (1) We propose a text detector, which can accurately locate curved text appearing in the scene through several consecutive straightforward and effective steps. (2) To fine-tune the result of the regression-based text detector, we propose a particle swarm optimization-based text shape approximator called PSO-TSA, which can quickly approach text shapes without heavy pre-training or pre-learning in advance. (3) To improve the text instance completeness, an instance-aware component merging network (ICMN) is designed to merge adjacent text subparts, which can be flexibly adapted to text detection results of any shape. (4) Although the entire pipeline is not an end-to-end mechanism, the experiments on five text benchmark datasets show that our method not only achieves excellent harmonic mean (H-mean) performance but also has relatively high speed.
The rest of this paper is organized as follows. We briefly review some related works on scene text detection in Section 2, followed by describing the details of the proposed method in Section 3. Then, we present experimental results in Section 4. Finally, we conclude and give some perspectives on future work in Section 5.

Related Works
Recently, numerous approaches have been proposed to address different challenges in text detection, such as low resolution, perspective distortion, low contrast, complex background, multi-lingual, multi-view, multi-direction, and arbitrary shape and size. In early works, connected component analysis (CCA) was mainly utilized to detect text in scene images by first extracting candidate text fragments and then filtering non-text candidates. Representative connected-component-based approaches such as the stroke width transform (SWT) [1] and MSER [2] have achieved outstanding performance on various test datasets, particularly for well-focused texts, e.g., ICDAR13 [35]. However, these methods fall short for the more challenging ICDAR15 [36] and MSRA-TD500 [37] due to the limitation of hand-crafted features. With the advance of CNNs, different CNN models have been exploited for scene text detection tasks. In general, CNN-based methods can be roughly formulated into two categories: top-down methods and bottom-up methods.
Top-down methods first detect text on the block level and break a block into the word or line level if necessary, and these methods are often based on general object detection frameworks [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. The Faster Region-based CNN (R-CNN) [3] is the most representative and accurate generic object detection method and has also achieved promising results on text detection tasks [4]. TextBoxes++ [5] adopts irregular 1 × 5 convolutional filters instead of the standard 3 × 3 filters and leverages recognition results to refine the detection results. ITN [6], E2E-MLT [11] and FOTS [18] are end-to-end text instance networks. Liu et al. applied feature pyramid networks (FPNs) in the CNN part to extract features of multiscales, and used the bidirectional long short-term memory (Bi-LSTM) network to generate text proposals [7]. By connecting a region proposal network (RPN) and a regression module (TLOC), Liu et al. [8] proposed a method named the curve text detector (CTD) to detect curved texts. EAST [9], deep regression [20], DeconvNet [10] and E2E-MLT [11] all directly use a fully convolutional network (FCN) to predict the score map for each pixel. Inspired by EAST, Zhang et al. [12] presented a text detector named look more than once (LOMO) that progressively localized text multiple times. SegLink++ [13] detects dense and arbitrarily shaped scene text by instance-aware component grouping. By combining deep learning and histogram-oriented gradient features, NguyenVan et al. [14] designed a pooling-based scene text proposal technique and integrated it into an end-to-end scene text reading system. He et al. [15] proposed multi-scale scene text detection with a scalebased RPN. Zhu et al. [16] presented a shape robust detector with coordinate regression. Wang et al. [17] proposed a progressive scale expansion network (PSENet) to detect text instances with arbitrary shapes. Wang et al. [19] proposed a text region proposal network (text-RPN) and verified and refined it through a refinement network.
The above state-of-the-art regression-based methods have achieved promising results on standard benchmarks. However, most of these methods need a comprehensive design with complex anchors and cumbersome multistage network structure, and suboptimal performance is achieved by exhaustive tuning. Unsatisfactory results often occur when dealing with long text.
Bottom-up methods usually first detect text at the pixel or part level and then segment these components into desired word-or line-level text instances. Borrowing from the idea of semantic segmentation, pixel-based methods [21][22][23][24][25][26][27][28][29] produce a text saliency map by employing an FCN [30] to perform pixel-level text/non-text prediction. Wu et al. [21] proposed self-organized text detection with minimal post-processing via border learning. He et al. [22] presented an end-to-end scene text detection by combining a multiscale FCN with a cascade-style instance segmentation. PixelLink [23] generates an 8-direction margin to separate text lines. Xue et al. [24] exploited bootstrapping and text border semantic techniques for long text localization. Zhang et al. [25] first used an FCN to extract text blocks and then search for character candidates from these blocks with MSER [2]. By fusing the proposed generator technology with the FCN, a framework named FAST [26] was proposed to reduce the number of text proposals while maintaining a high recall in scene text images. Xu et al. [27] presented a text detector named TextField, which directly regresses the direction field. Pixel-level methods mostly need complex post-processing to extract accurate bounding boxes of words or text lines. Liu et al. propose the Mask Tightness Text Detector (Mask TTD) [29], which uses a tightness prior and text frontier learning to enhance pixel-wise masks prediction.
By regarding a text region as a group of text components, part-level methods [30][31][32][33][34] usually first detected components or characters and then grouped them into words or text lines. SPC-Net [30] uses an instance segmentation framework and context information to detect text in arbitrary shapes while suppressing false positives. Lyu et al. [32] combined position-sensitive segmentation with corner detection to generate text instances. Long et al. [33] used ordered disks to represent curved text. Baek et al. [34] developed a framework, referred to as character region awareness for text detection (CRAFT), which explores each character and the affinity between characters. This work is inspired by the idea of LOMO [12], which progressively localizes text multiple times. However, the detection results of LOMO are multiple subsections of a word or text line instead of a single complete text instance, which leads to a certain degree of incompleteness of semantics. Similar to LOMO, by combining the backbone of ResNet50 [38] with FPN [39], our method also integrates the advantages of top-down methods and bottom-up methods. In order to overcome the limitation of regression-based detector to locate arbitrary shape text and to make the located text box to fully contain text while fitting its out-line, we first propose a straightforward and efficient text shape approximation method called PSO-TSA, which is based on particle swarm optimization method. Without prior training and heuristic parameters, PSO-TSA achieves very competitive performance on curved and non-curved datasets. Moreover, a lightweight ICMN is proposed to merge two adjacent or intersecting text subsections in a text line, which can further improve the completeness of a text instance. Compared with most deep learning algorithms, the advantages of our method are very simple and straightforward, PSO-TSA is general to any data set and ICMN can be readily plugged into any CNN-based detector.

Background
A challenging aspect of scene text detection is to handle curved text which is common in natural scenes. The regression-based methods have achieved promising results on standard benchmarks, and the efficiency is also very high. Using regression-based method for text proposal detection is a premise of our method. Here, we choose the classic EAST detector [9] to detect the text proposals appearing in the scene.
By inputting the text image to EAST detector [9], a dense prediction channel of text/non-text is outputted to indicate the pixel-wise confidence of being text or not. The dense proposals from the network are then filtered based on the text confidence score rp with the threshold value 0.9. For further processing in the next step, the pixel proposals with rp ≥ 0.9 are labeled as white pixels, and the rest are labeled as black pixels, which are also considered text pixels and non-text pixels, respectively. The image formed by these white/black pixels is called a text confidence map, marked as the symbol I m . Finally, locality-aware NMS [9] is performed on the filtered proposals to generate candidate quadrilateral text proposals, which are collected into a text proposal set R = r t t∈ [1,T] . Here, r t is the t − th text proposal, and T is the total number of text proposals.
Usually, regression-based text detectors fall quite short of detecting extremely long curved text. In order to overcome the limitation of regression-based detector to irregular text, we introduce an efficient and straightforward mechanism called PSO-TSA, which makes the regression-based detection result gradually change from loosely bounded rectangle box to compact polygon.
In the PSO-TSA method, we need to fine-tune the text proposal detected by EAST. Character extraction is another prerequisite of our method. The accuracy of character extraction in the text proposal largely determines the performance of our method. In order to avoid a large number of sample annotations and complicated network designs, we adopt a simple traditional character extraction method like MSER [2] instead of a deep framework like CRAFT. The reason for choosing MSER is that the MSER algorithm assumes that text components have a similar homogeneous background and foreground and thus are stable with respect to a range of thresholds. Here, an extracted connected component is also considered a candidate character. The candidate characters extracted from the t − th text proposal r t are then combined into a set C t , expressed as C t = c t m |m ∈ [1, M] , in which c t m is the m − th character in t − th text proposal and M is the total number of candidate characters. It should be noted here that a detected connected component does not refer to an individual character in a strict sense; it may be a single character or consecutive multiple characters in a text proposal.

LOMT
As illustrated in Figure 1, the pipeline of our approach consists of text proposal detection, connected components (CCs) extraction, shape approximation and component merging, the first two are prerequisites, and the last two are the main strategies of our algorithm.

PSO-TSA
Inspired by the sociological behavior associated with birds flocking, Kennedy and Eberhart proposed a PSO method [40] to solve a global optimization problem. Each individual in the swarm is called a particle, which represents a candidate position or solution to the problem. By constantly adjusting their positions and speeds with shared information or experience, particles fly through the search space influenced by two factors: one is the individual's best position ever found (pbest); the other is the group's best position (gbest). The PSO algorithm possesses the advantages of simplified computing, rather quick convergence speed, global optimization performance and fewer control parameters, so we apply an improved PSO in this paper to solve the optimal approximation problem of an arbitrary text shape. What needs to be pointed out here is that in addition to the position and velocity properties of the particles, we introduce isometric information into the particle in order to better use the improved PSO algorithm to approximate the shape of the text. To the best of our knowledge, this is the first time that the PSO algorithm has been used in curved text detection. A vivid example is shown in Figure 2 to explain the procedure of the proposed PSO-TSA algorithm, in which the text proposals and candidate characters are extracted from EAST and MSER algorithm, respectively. First, a curve is fitted with all the center points of the candidate characters, which is called the fitted character centerline. Because the accuracy of character detection by MSER algorithm is normally not high enough, this fitted curve is most likely not the real centerline of the curved text, which is also a problem that is usually faced by bottom-up algorithms. In response, points are evenly sampled on the fitted character centerline, which serve as the initial points of the particle, and also as the center point of the sampling neighborhood of the particle's corresponding dimension position. Here, the sampling neighborhood is also the region where the particle position changes, and particle positions in all dimensions are fitted to generate the approximated centerline of the text. Because the particle's initial position points are sampled on the fitted character centerline, the first approximated centerline is the same as the fitted character centerline. However, as the particle's position changes when the particle swarm flies in the sampling neighborhood, the approximated centerline will be farther away from the fitted character centerline and closer to the real centerline of the text such that it eventually coincides with the real text centerline. The polygon, connected in series by equidistant control points moving in the direction perpendicular to the approximated centerline, is the approximate polygon for curved text by PSO-TSA algorithm. By combining the shared foraging flight of the particle swarm and the continuous adjustment of the equidistant points of the corresponding particles according to their distance change, the global optimal particle position gradually approaches the true centerline of the text; in the same instant, the polygon gradually approaches the text shape, and finally, both optima are reached. The specific initialization of particles is described in detail in Algorithm 1. First, we define a particle swarm X = {X 1 , X 2 , · · · , X I }, in which X i represents the i − th particle, i ∈ [1, I], and I is the total number of particles in particle swarm X. Each particle is represented by its position, velocity and distance between its position and equidistant points, namely, Before initializing the particles, we first need to set the particles' range of activity. Assume that N points are uniformly sampled on the character centerline, which are expressed as P 1 , P 2 , · · · , P N . The spatial neighborhood N r1×r1 (P n ) with center point P n and radius r1 is the variation range of the n-th dimension position That is, the position of particle X i is expressed as x n i is initialized as a point P s n sampled in N r1×r1 (P n ), whose x and y coordinates are equal to P s n . x and P s n . y, respectively. The n−th dimensional velocity v n i of X i .v is initialized to (0, 0) T . The distance variable d n i between the n−th position point x n i and its two equidistant points is initialized with a value uniformly sampled in the interval [d 1 , d 2 ]. Algorithm 1. Particle Swarm Initialization.
3. Set the range of equidistant values of particle position points as [d 1 , Randomly sample a point P s n in each N r1×r1 (P n ), and all sampled points form a point set P s = P s 1 , · · · , P s n , · · · , P s N . 6.
Randomly sample N values in interval [d 1 , d 2 ] to form a equidistant set D S = d s 1 , · · · , d s n , · · · , d s N . 7.
x n i = (P s n · x, P s n · y) T , v n i = (0, 0) T and d n i = d s n . 9. end for 10. end for When the number of iterations k does not reach the maximum number of iterations K or the error criterion ∆ε does not reach the minimum ε min , the proposed PSO-TSA algorithm is implemented iteratively to approximate the text shape. Each cyclical process mainly includes the following three steps: Step 1: Two equidistant points y n,1 i and y n,2 i are calculated in the normal direction at X i s n − th position point x n i according to Equations (1) and (2), respectively.
Here, θ n is the angle between the tangent line at point x n i and the horizontal positive half-axis, that is, θ n = arctan(k n ). k n is the tangent slope at the n-th point x n i of particle X i . The isometric points of all position points are connected head to tail to form the corresponding polygon S i of particle X i .
Step 2: The valuating indicator f x i of particle X i is calculated as The fitting value f X i is expressed as the sum of two parts: one corresponds to the ratio of the text aggregation degree in S i and r t , and the other corresponds to the ratio of the number of characters in S i and r t . Each part has a weight of 0.5. Here, A S i and A r t are the text aggregation degree of S i and r t , respectively. N C S i and N r t is the number of characters in S i and r t . Here the variable N r t takes the value M. Given the text confidence map I m , which is obtained in Section 3.1, the text aggregation degree A S i is defined as the ratio of the number of white pixels in the polygon S i within I m to the area of S i , and A r t is the ratio of the number of white pixels in the text proposal r t to the area of r t .
The reason for considering the parameter ratio rather than the parameter itself of each part is that doing so can ensure the completeness of the text instance while ensuring the compactness.
Step 3: After k − th single loop ends, the individual particle optima Pbest i and the particle swarm global optimum Gbest k are found from the history, and each particle's velocity, position and distance are updated according to Equations (4)-(6), respectively, in which c 1 , c 2 represent the acceleration constants, µ 1 , µ 2 are uniformly distributed random numbers and ∆δ is white Gaussian noise.
When the optimization is completed, the optimal polygon will be obtained, which is composed of 14 equidistant points connected in series after the PSO-TSA optimal approximation.

ICMN
Even if the PSO-TSA algorithm is used to revise the EAST detection result from quadrilateral to polygon, the approximate polygon may not be a complete text instance but only a part of the text instance. To avoid incomplete scene text detection when processing long words or text lines, it is necessary to merge adjacent or intersecting text components that belong to the same text instance. Therefore, ICMN is proposed to solve the instance incomplete detection problem. The intuition behind the ICMN is that humans can easily judge whether two separate text components belong to a complete text instance and naturally perceive its shape without examining the entire instance; similarly, a specially designed neural network might also be able to infer the boundary points of a text instance.
The specific architecture of the ICMN is shown in Figure 3, which consists of an input layer, a pooling layer, two fully connected layers (FC) and an output layer. The input layer first accepts the boundary points of two adjacent polygons as individual features, both of which have a size of 1 × 28. In the pooling layer, these two features from the input are simply concatenated together to build a 1 × 56-sized snippet feature. There are two fully connected layers in the ICMN, the sizes of which are 56 × 48 and 48 × 30. Using fully connected layers in the ICMN reduces the dimensions of the pooled feature to the desired size of 1 × 30. The final output consists of two sibling parts: the first one (with 1 × 2 − sized) outputs two confidence scores indicating whether the two input components belong to the same text instance or not, and the second one (with 1 × 28 − sized) outputs 14 regression boundary points of the instance, half of which are upper boundary points and half of which are lower boundary points. Because each point has x and y coordinates, the ICMN outputs a total of 28 corresponding point coordinates. The salient aspects of the ICMN are that it can predict whether two components merge and can refine the text instance boundary points by spatial coordinate regression. For training the ICMN, we assign a binary class label (whether it belongs to an instance) to each text snippet. A positive label is assigned to a snippet if (i) two subsections in it belong to an instance or (ii) the Intersection over Union (IoU) is greater than or equal to 0.3. We design a multitask loss L to jointly train classification and coordinate regression.
Here, L cls is the loss for merging/non-merging classification, which is a standard softmax loss; L reg is the loss for spatial coordinate regression; and λ is a hyperparameter used to control over fitting. The regression loss is where L 2 distance is adopted; l * i is the label, which is 1 for positive samples and 0 for negative samples; N pos is the number of positive samples in a mini-batch; and N b is the number of all samples in a mini-batch. Here, P i j .x and P i j .y denote the x and y coordinates of the predicted j − th point P i j of the i − th sample, respectively, and P i * j .x and P i * j .y denote the x and y coordinates of the j − th ground truth point of the i − th sample, respectively. Before training, we obtain the real ground truth by the method proposed in [8]. The regression loss is calculated only for positive samples. To assist LOMT in dealing with long curved text, a small and light ICMN is used in conjunction with PSO-TSA. Two adjacent or intersecting polygons may be subparts of the same text instance, so they are input to the ICMN to decide whether to merge them into a more complete text instance. The input of the ICMN is 28 points formed by concatenating each pair of 14 points of adjacent polygons, and the outputs of the network are the merging flag of the two parts and the regressed 14 points of the more complete text polygon. After a series of merging steps, each complete text instance appearing in the scene can be detected.  [36]: IC15 was built for Challenge 4 of the ICDAR-2015 Robust Reading Competition and consists of 1000 training images and 500 test images. The images are acquired using Google Glass, and the text accidentally appears in the scene. The ground truth is annotated by a word-level quadrangle.

Experiment and Discussion
ICDAR2017 (IC17) [41]: IC17 contains 7200 training images, 1800 validation images, and 9000 test images with text in 9 languages for multilingual scene text detection. Similar to IC15, the text regions in IC17 are also annotated by the 4 vertices of quadrilaterals.
MSRA-TD500 (TD500) [37]: TD500 contains 500 natural images, which are split into 300 training images and 200 test images, collected both indoors and outdoors using a pocket camera. The images contain English and Chinese scripts. This dataset is mainly used for multidirectional text detection, in which text regions are annotated by rotated rectangles.
SCUT-CTW1500 (CTW1500) [8]: CTW1500 consists of 1000 training and 500 test images. Every image has curved text instances, which are annotated by polygons with 14 vertices. The annotation is given at the text-line level such that a complete sentence is annotated as a single polygon.
Total-Text (Total-Text) [10]: Total-Text contains 1555 scene images, which are divided into 1255 training images and 300 test images. This dataset mainly provides curved texts, which are annotated at the word level by polygon-shaped bounding boxes.

Implementation Details
In this section, we carry out experiments on curved and non-curved datasets to test our method. In actual experiments, while maintaining the aspect ratio, the longer sides of the images in IC15, IC17, and TD500 are resized to 768, 1024, and 768, respectively, the longer sides of the images within Total-Text and CTW1500 are resized to 1280 and 1024, respectively. All experiments are performed with a single image resolution. The results are evaluated using the standard PASCAL VOC protocol [42], which uses a 0.5 intersection-over union (IoU) threshold to determine true and false positives.
For training the text proposal detector, a union of the IC15 and IC17 dataset is used. The network is trained using an improved adaptive moments (ADAM) [43] optimizer. To speed up learning, we uniformly sample 512 × 512 crops from images to form a mini-batch of size 24. The learning rate of ADAM starts from 1 × 10 −3 , decays to one-tenth every 25,000 mini-batches, and stops at 1 × 10 −5 . The network is trained until the performance stops improving.
When executing the PSO-TSA model, the parameters are set as follows. The particle swarm population consists of I particles moving in the N-dimensional search space, in which I and N are set as 20 and 7, respectively. The radius of the sampling neighborhood, r1, is a certain proportion of the maximum of the average width and height of characters, which is set to 0.25 in the experiment. The distance range of the particle's position point and its two equidistant points is [d 1 , d 2 ], in which d 1 = 1.5 × r1 and d 2 = 2.5 × r2. The maximum number of iteration K and the minimum error ε min are set 20 and 1 × 10 −3 respectively. The particle's acceleration constants C 1 and C 2 are the same, both of which are set to 1.2, and both u 1 and u 2 are random numbers in (0, 1).
In the objective function of ICMN, the hyper-parameter λ was set as 1.5. During training of the ICMN network model, the negative-to-positive sample ratio was set to 10 in a minibatch, the learning rate was set to 0.005, the batch size was 128, and the number of iterations was set to 30,000. In text detection process, when the IoU of two polygons is greater than or equal to 0.3, the two polygons are input into ICMN network for further merging. Text (IC15, IC17 and TD500) We first conduct experiments on three multi-oriented text datasets, IC15, IC17, and TD500, to demonstrate that the proposed method performs well for multi-oriented scene text detection. For IC15, annotations are provided at the word level, so the test experiments are performed at the word level. For the IC17 and TD500 datasets, annotations are provided at the line level, without a post-processing step applied to generate word boxes, similar to the method in [34]. We evaluate the performance of the proposed method directly at the line level. To make a fair comparison with end-to-end methods [11,18], we include their detection-only results by referring to the original papers. We also compare the proposed method with other state-of-the-art methods, including five recent multi-oriented text detectors, EAST [9], TextBoxes++ [5], FTPN [7] and the methods in [25,32] and five arbitrarily shaped text detectors: TextSnake [33], PixelLink [23], PSENET-1s [17], CRAFT [34] and ICG (SegLink++) [13]. The comparison is given in Table 1. As depicted in Table 1, LOMT achieves 85.7%, 74.6%, and 82.7% in terms of the harmonic mean (H-mean) on IC15, IC17, and TD500, respectively, outperforming most competitors. Table 1. Results on quadrilateral-type datasets, such as IC15, IC17 and TD500. P, R and H refer to the precision, recall rate and H-mean, respectively. The best H score is highlighted in bold, and * denotes that the results based on multiscale tests.

P (%) R (%) H (%) P (%) R (%) H (%) P (%) R (%) H (%)
Zhang et al. [ IC15 is a dataset of complicated backgrounds, and the text size is small. As shown in the fourth to sixth columns of Table 1, the H-mean performance of LOMT is better than all other competitors, except that it is slightly lower than the CRAFT method. In addition, we demonstrate some test examples in Figure 4a, and LOMT can accurately locate text instances in different directions and different sizes.
IC17 is a large-scale multilingual text dataset. We next verify the ability of our method to detect multilingual text for IC17. As shown in columns 7 to 9 of Table 1, our method outperforms all other competing methods in terms of H-mean. Some qualitative detection results on the IC17 dataset are given in Figure 4b, and the results show that our LOMT method can accurately detect multilingual text such as English, Chinese and Korean in different directions and sizes.
On MSRA-TD500, we evaluate the performance of LOMT for detecting long and multilingual text lines. As shown in the results of columns 10 to 12 of Table 1, our method has certain advantages in the detection of long texts, and the H-mean is only 0.2% lower than CRAFT [34]. In addition, we demonstrate some test examples in Figure 4c, and our method can accurately locate long text instances with various orientations.

Experiments on Two Curve Text Datasets (Total-Text and CTW1500)
To test the ability of LOMT for the detection of curved text, we evaluate the detection results on both the Total-Text and CTW1500 datasets, which mainly contain curved text.
We first conduct text detection experiments on the Total-Text dataset. Total-Text was constructed in late 2017, and curved text was collected from various scenes, including text-like and low-contrast background scenes. Most images in this dataset contain at least one curved text, which is annotated in word-level granularity using a strictly 14-point polygon. To compare LOMT with other algorithms under the same conditions, the number of uniformly sampled points on the character centerline is set to 7 so that the polygon with the same 14 vertices as the ground truth. Columns 3 to 5 of Table 2 show the results of different methods on Total-Text. As shown in Table 2, the proposed method achieves 88.2%, 79.6%, and 83.7% in precision, recall and H-mean on Total-Text, respectively. The results show that our method outperformed recent state-of-the-art methods on Total-Text. Especially, compared with the text detector EAST, the H-mean is improved by 41.7%, and our method is slightly better than the excellent CRAFT algorithm. Examples of the detection results of the proposed method are illustrated in Figure 5. From the results, we can find that the proposed method can work well for text of any shape and length. In addition, we show some test examples of different methods on Total-Text in Figure 6, where the results in columns (a), (b) and (c) correspond to CTD + TLOC [8], CRAFT [34] and the proposed LOMT method, respectively. CTD + TLOC [8] detects a false text in the first row in Figure 6a, and all detection results are inaccurate. The detection results of CRAFT [34] shown in Figure 6b are relatively ideal, but the parts of a single character, such as the letter "C", are not completely contained in the detected text box. Compared to other methods, the proposed method has better completeness of the text instances and smoother boundaries of the polygons.   We continue to evaluate the results of LOMT on CTW1500, whose annotation is given at the text-line level such that a complete text instance is annotated as a single 14-point polygon. A comparison with the quantitative evaluations of other methods is depicted in Table 3, indicating that H-mean performance of the proposed LOMT method is 22.1% higher than the EAST [9] algorithm. Compared with CTD + TLOC [8], TextSnake [33], PSENet-1s [17] and Mask-TTD [29], H-mean improvement is 9.1%, 6.9%, 4.5% and 3.1%, respectively. The proposed LOMT is slightly better in H-mean performance than ICG (SegLink++) [13] and the method in [19] and is only 1.0% lower than CRAFT [34]. The results demonstrate that the proposed method can handle arbitrarily shaped line-level text detection. Some qualitative results are shown in Figure 7. The results show that LOMT can handle long text detection well.

Speed and Ablation Analysis
All the experiments are conducted on a PC using a single NIVIDA GeForce GTX 1080 graphics card with the Pascal architecture and an Intel(R) Core(TM) i7-6700K @ 4.00 GHZ CPU. The speed of PSO-TSA and ICMN are 20.0 FPS and 30.9 FPS, respectively, which can basically achieve real-time running. Although the H-mean performance of the CRAFT algorithm is currently the most competitive compared with other state of the arts, its average frame rate is only 3.3.
The proposed method consists of four modules: text proposal detection, candidate character extraction using MSER, text shape approximations by PSO-TSA, and text subsection merging via ICMN. The first two modules are the prerequisites for PSO-TSA. Without the extracted characters, the PSO-TSA module cannot be executed independently. That is, the contribution of the character extraction module is included in the performance improvement of PSO-TSA. Therefore, we need to verify only the effectiveness of the three modules, namely, text proposal detection, PSO-TSA and ICMN. Table 3 shows the ablation experiment results on two curved text benchmarks. From Table 3, for the CTW1500 and Total-Text datasets, compared to the text proposal detection module, the improvements in the H-mean performance of the PSO-TSA module are 10.8% and 39.7%, respectively. Furthermore, compared with the PSO-TSA module without ICMN, the gain of the H-mean performance of LOMT is 10.3% and 2.0%. In particular, it should be emphasized that compared to the general regression-based detection module, EAST, the overall performance of our algorithm can be improved by at least 21.1% and up to 41.7%. The experimental results demonstrate that each module in Table 3 has a certain contribution to the overall performance, thus verifying the necessity of each part.

Disadvantages and Advantages
The main disadvantage of the proposed method is not an end-to-end framework. The traditional MSER method is used in our character extraction module, and the accuracy rate is 68.3% in CTW1500 data set, while the accuracy rate of CRAFT algorithm is 75.7%. Due to unsatisfactory text detection performance of MSER, the accuracy of the proposed method is not good enough compared to the existing methods, especially CRAFT.
Although the performance of our method is slightly inferior to the CRAFT method, one of the biggest advantages of our method is that it does not depend too much on a specific data set. In order to test the generalization of the algorithm, Table 4 lists the comparison results of the generalization experiment of the two methods, in which CRAFT-IC15-20k.pth and CRAFT-MLT-25k.pth denote the CRAFT models trained on IC15 and IC13 + IC17 data sets, respectively. It can be seen from Table 4 that the model CRAFT-IC15-20k.pth performs well on the IC15 dataset, but its performances on the other four datasets have dropped a lot, and the performances of the model CRAFT-MLT-25k.pth on the IC15 and IC17-MLT data sets are also inferior to our method. In other words, the CRAFT model is strongly based on training samples, and its generalization ability is not particularly ideal. Fortunately, neither our character extraction module nor our PSO-TSA module depends on a specific dataset; consequently, they are highly applicable to any dataset. ICMN module can be readily plugged into to any regression-based text positioning module to complete the text instance detection.

Conclusions
In this paper, we have proposed a straightforward and efficient instance-aware curved text detector, which is composed of quadrilateral text proposal detection, character extraction, particle swarm optimization-based text shape approximation (PSO-TSA), and an instance-aware component merging network (ICMN). Without training or learning in advance, PSO-TSA can adjust the quadrilateral box obtained by a regression-based text detector to a polygon, thus approaches the shape of the text. ICMN can merge two adjacent or intersecting text components into one single text instance that is more complete. A large number of experiments have been carried out on five scene text detection benchmark datasets, H-mean performance achieve to 85.7%, 74.6%, 82.7%, 82.5% and 83.7% for IC15, IC17, MSRA-TD500, CTW1500 and Total-Text dataset, respectively. As the two core components of the text detector, the average running speeds of PSO-TSA and ICMN modules are 20.0 FPS and 30.9 FPS, respectively. Experimental results demonstrate that the proposed algorithm has competitive advantages in H-mean performance and execution speed.
In the future, we would like to replace the MSER extraction with more elegant character detection method to further improve the efficiency of the algorithm.
Funding: This work was funded by the National Natural Science Foundation of China (NSFC), grant number 61671376 and 61771386.