Scene Text Detection in Natural Images: A Review

: Scene text detection is attracting more and more attention and has become an important topic in machine vision research. With the development of mobile IoT (Internet of things) and deep learning technology, text detection research has made significant progress. This survey aims to summarize and analyze the main challenges and significant progress in scene text detection research. In this paper, we first introduce the history and progress of scene text detection and classify the traditional methods and deep learning ‐ based methods in detail, pointing out the corresponding key issues and techniques. Then, we introduce commonly used benchmark datasets and evaluation protocols and identify state ‐ of ‐ the ‐ art algorithms by comparison. Finally, we summarize and predict potential future research directions.


Introduction
Scene text detection (STD) is the process of detecting the presence and position of text in scene images. STD not only acts as a detection and positioning tool but also plays a key role in extracting important high-level semantic information from scene images. It has important applications in intelligent transportation systems [1], content-based image retrieval [2], industrial automation [3], portable vision systems [4,5], etc.
Evolution of STD. The concept of "scene text detection" first appeared in the 1990s [6][7][8]. With the rapid development of Internet technology and portable mobile devices, more and more scenarios have emerged where a need exists for extracting text from image information. At present, scene text detection has become a significant aspect of computer vision and pattern recognition techniques, as well as a research hotspot in the field of document analysis and recognition. Some top international conferences, such as ICDAR (International Conference on Document Analysis and Recognition), ICCV (International Conference on Computer Vision), ECCV (European Conference on Computer Vision), AAAI (AAAI Conference on Artificial Intelligence), include scene text detection listed as a separate research topic.
Motivation for writing this review. In 1998, Lecun et al. designed the LeNet5 model [9], which achieved a 99.1% recognition rate on the MNIST dataset. In recent years, deep learning (DL) has attracted significant attention due to its success in various domains, and DL-based STD methods with minimal feature engineering have been flourishing. A considerable number of studies have applied deep learning to STD and successively advanced the state-of-the-art performance [10][11][12][13][14][15][16][17]. This trend motivates us to conduct a review to report the current status STD technique research.
Contributions of this review. We thoroughly review the technological development of STD in order to inspire and guide researchers and practitioners in the field. Specifically, in Sections 3 and 4, we categorize STD by technique into traditional detection methods and deep learning-based detection methods and representative techniques from both categories. Then, in Sections 5 and 6, we provide a comprehensive survey of STD benchmark datasets and evaluation methods. In addition, we summarize and analyze the most representative approaches to DL techniques for STD in Section 6.2. Finally, we introduce the challenges of STD and outline future directions for the field.

Background
In this section, we first introduce the definition of the STD and then summarize the important features of scene text in natural images.

What Is STD?
Scene text detection is to locate text in complex scene images. Examples of STD include text detection in various contexts, such as books, ID cards, tickets, intelligent traffic scenarios, such as road signs, license plate recognition, etc. Figure 1 gives a formal overview of the text detection process. We have summarized a unified three-stage framework that most existing models fit into. The three stages as follows: 1. Transformation. In this stage, the input image is transformed into a new one using a spatial transformation network (STN) [18] framework, while any text contained in it is rectified. The rectification process facilitated subsequent stages. It is powered by a flexible thin-plate spline (TPS) transformation, which can handle a variety of text irregularities [19] and diverse aspect ratios of text lines [20]. 2. Feature Extraction. This stage maps the input image to a representation that focuses on the attributes relevant for character recognition while suppressing irrelevant features, such as font, color, size, and background. With convolutional networks entering a phase of rapid development after AlexNet's [21] success at the 2012 ImageNet competition, Visual Geometry Group Network (VGGNet), GoogleNet [22], RestNet [23], and DetNet [24] are often used as a feature extractor. 3. Prediction. Predicts the position of the text in the image, usually expressed as a coordinate point.

Features of Scene Text
Text detection in natural scene images is much more difficult than text detection in scanned document images because of the diversity of the forms the text may occur in. The main features of the scene text are summarized below: The most representative methods of this approach are the stroke width transform (SWT) [48] and the maximum stable extremal region (MSER) [49]. The SWT was first proposed in [48] (Figure 2) in 2010 and takes advantage based on the assumption that strokes located in the same text area have approximately equal widths to obtain candidate text regions. The Canny algorithm was used to detect the edges of the input image, then the gradient direction of the edge pixels was calculated, and the algorithm searched for matching pixels along the path of the gradient direction. The sum of all pixels on the search path between matching pixels p and q was taken as the stroke width w between the two pixels. The method is simple, local, and data-dependent (see Figure 3), which makes it powerful enough to detect text in multiple fonts and languages. In [50], a method based on feature vectors of connected components generated through STW was proposed. For the formulation of the feature vectors, some properties were used, such as the directionality of the text edge gradients, high contrast with the background, and the geometry of the text components, jointly with the attributes found by the stroke width transformation. In [51], text and non-text regions were analyzed on three levels: pixel, component, and text-line levels. The stroke feature transform (SFT), which extends the SWT, was used as a low-level filter to determine whether a pixel belongs to text or not. In [52], an algorithm was proposed which used multiple low-level visual features to learn a model, which eventually provided a text attention map indicating candidate regions of text in the image. During detection, the text detector using SWT focused only on these selected image regions to reduce computation time and improve detection performance. In [53], an algorithm based on SWT was used to extract arbitrary text in natural scene images. References [54,55] involved a modified SWT for detecting scene text.
In [49], the maximally stable extremal regions (MSER) [56] were leveraged to detect candidate text regions in scene images. This approach offers robustness to geometric, noise, and illumination conditions. The MSER method has been employed in several studies [16,57,58] that achieved excellent text detection performance on complex scene images. In [50], a text detection method based on colorenhanced contrast extremal regions (CERs) and neural networks was presented. The method used CERs to extract candidate text regions using a six-component tree which was produced from color images, and classify them into text regions and non-text regions using a neural network. In [59], a text detection method that combines extremal regions (ER) and corner-HOG features was presented. In [52], a novel method was proposed based on a conditional random field (CRF) pipeline, which used a convolutional neural network (CNN) to estimate the confidence of the MSER being a candidate text region. In [60,61], methods combining MSER and SWT were proposed to achieve better text detection performance.
It is worth mentioning that the scene text detection methods based on CCA mainly deal with extracting the connected regions in the image text candidate regions, thus reducing the search scope of natural scene text effectively. However, this type of method relies heavily on the detection of textconnected regions. In fact, in scene images with complex backgrounds, noise interference, low contrast, and color variation, it is difficult to detect connecting regions of text accurately. At the same time, it is also very difficult to design a reasonable analyzer for connected regions. In summary, CCAbased detection methods are difficult to implement and not robust in detecting text from scene images.

Deep Learning Approaches for STD
Deep learning methods automatically extract text features by training a model and are particularly suitable for object detection, speech recognition, and other pattern recognition problems. Typical deep learning networks include deep belief networks (DBN), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and capsule networks. Deep learning-based methods are more effective, simple, and robust compared to manually designed algorithms for extracting and classifying candidate text regions. The boom of deep learning has also led to the development of successful techniques for scene text detection. In general, the main deep learning-based text detection methods can be classified into three categories: region proposal-based methods, image segmentationbased methods, and hybrid methods. Table 1 summarizes a comparison among some of the current state-of-the-art techniques in this field.   [80] CVPR'17  VGG-16  ---IC13 or IC15 TextBoxes++ [81] 2018

Region Proposal-Based Methods
Region proposal-based text detection methods adopt a general object detection framework, often using regression text boxes to obtain regional text information. In [85], a method based on a region proposal mechanism for text detection was proposed, while in [86], the Faster R-CNN [87] was improved using the inception structure proposed by GoogleNet. This resulted in the inception region proposal network (InceptionRPN), which obtains text candidate regions, uses a text detection network to remove background regions, and finally, votes on the overlapping detected regions to obtain the best results. In [75], a new method called the rotational region CNN (R2CNN) was proposed for detecting arbitrarily-oriented text in scene images. In [74], a novel connectionist text proposal network (CTPN) was proposed for localizing text lines in scene images, while a verticallyregressed proposal network (VRPN) was proposed in [88], which could match text regions using multiple neighboring small anchors. In [76], the rotation region proposal network (RRPN) was proposed to detect arbitrarily-oriented text by rotating text region proposals.
Rectangle bounding boxes or quadrangles have been adopted to describe text. In [89], an endto-end trainable one-stage algorithm similar to a single shot multibox detector (SSD) [90] was proposed. Reference [81] was also based on an SSD object detection framework, where the rectangular box representation of conventional object detectors was replaced using a quadrilateral or rotated rectangle representation. In [91], a quadrilateral window (not a rectangle) was used to detect text in arbitrary orientations. A quadrilateral region proposal network (QRPN) was proposed in [92] for generating quadrilateral proposals based on a novel quadrilateral regression algorithm. In [82], two separate network branches were used to extract different text characteristics for text detection and oriented bounding box regression. In [93], corners were employed to estimate the possible locations of text instances, while an embedded data augmentation module inside a region-wise subnetwork was employed.
To achieve high coverage of the target box, in [94], a learning mechanism was proposed that integrates a two-stage R-CNN framework into a single-stage detector and uses the learned anchors instead of the original anchors into the final prediction. Existing deep neural network-based text detection methods use multi-scale filters and feature layers to detect multi-scale text. Reference [95] proposed a text detector named the short path network (SPN) to use low-level semantic features to complement the propagated loss of deep features. In [96], a multi-scale shape regression network (MSR) was presented, which was capable of locating text lines of different lengths, shapes, and curvatures in scenes. A scale-insensitive adaptive region proposal network (Adaptive-RPN) was proposed in [97] to generate text proposals by focusing only on the intersection over union (IoU) values between predicted and ground-truth bounding boxes. A scale-transfer module and a scalerelationship module were proposed in [98] to handle the problem of scale variation. A novel text detector, namely Look More than Once (LOMO), was presented in [99] to detect long text and arbitrarily shaped text in scene images by considering the geometric properties of the text instance, including the area, text center line, and border offsets to identify irregular text.
When trained with rigid word-level bounding boxes, the abovementioned methods exhibit limitations in analyzing text regions of an arbitrary shape. In [100], a one-stage model named convolutional character network (CharNet) was proposed, which predicts the bounding boxes of words and characters directly. In [101], a two-stage method called omnidirectional pyramid mask proposal text detector (OPMP) was proposed, which uses an effective pyramid sequence modeling method to produce arbitrary-shaped proposals. In [64], a novel adaptive Bezier-curve network (ABCNet) was presented to detect arbitrarily-shaped text in scene images.
This group of methods usually includes two parts: classification and regression of text candidate regions. In one-stage detectors, these candidate regions are generated by sliding windows; in twostage detectors, the candidate regions are proposals generated by an RPN, but the RPN itself still classifies and regresses proposals generated by sliding windows. To improve the accuracy of text detection, it is often necessary to manually design anchors of various scales, aspect ratios, and even orientations to better surround the text area, which makes region proposals-based methods complicated and inefficient. The anchor mechanism is not effective enough for scene text detection, which can be attributed to its IoU-based matching criterion between anchors and ground-truth boxes.

Image Segmentation-Based Methods
Image segmentation is an important part of image processing and machine vision techniques for image analysis and is a hot research topic today. Image segmentation is to classify images at the pixel level, determine the category of each point, and divide the image area. It is currently widely used in medical imaging, automated driving, UAV assistance, remote sensing, and other applications. Scene text detection can also be regarded as a pixel-level text/non-text classification, so image segmentation algorithms, such as semantic segmentation and instance segmentation, can be used to handle this challenge. In [13,68], an image segmentation-based method was proposed that used a fully convolutional network (FCN) [102] for pixel-level multi-oriented text detection. In [93], an algorithm for word-level text detection consisting of two cascaded CNNs was presented. The first network was fully convolutional and responsible for detecting regions containing text, while the second network predicted directional rectangles containing single word regions. A novel progressive scale expansion network (PSENet) approach was presented in [103], which gradually expanded the detection region from small kernels to large, and complete text instances were detected through multiple semantic segmentation maps. The system is robust, being able to detect arbitrarily shaped text and independently attached text. In [79], the segment linking (SegLink) method was introduced, which decomposed text into two components, namely segments and links. Instance aware component grouping (ICG) for arbitrary-shape text detection was presented in [104], while in [71], an instance segmentation-based method was proposed, which predicted text instances lying very close by linking pixels within the same instance. In [84], a new scene text detection method was proposed to detect text area effectively by exploring each character and affinity between characters. In [105], a mask R-CNN-based text detector was used to suppress false detection caused by background noise more effectively using the pyramid attention network (PAN) as a new backbone network. A novel framework with the local segmentation network (LSN) was presented in [106], followed by the curve connection to detect text in horizontal, oriented, and curved forms. An efficient and accurate arbitrary-shaped text detector, named the pixel aggregation network (PAN), was proposed in [107], which was equipped with a segmentation head made up of a feature pyramid enhancement module (FPEM) and a Feature Fusion Module (FFM). The authors of this study proposed a method based on mask R-CNN, named pyramid mask text detector (PMTD) [14], which used location-aware information to generate text masks instead of binary text masks. In [66], a module named differentiable binarization (DB) was proposed, which could perform the binarization process in a segmentation network. An FCN-based method named TextEdge was proposed in [108], where the text-region edge map was used as a segmentation mask. A segmentation-based detector named instance segmentation network (ISNet) was introduced [12], which linearly combines a generation mask and mask coefficients for fast text localization. A segmentation-based method that used polygon offsetting combined with border augmentation to detect text in natural images was presented in [109], while in [110], a novel character candidate extraction method based on super-pixel segmentation and hierarchical clustering was introduced. A novel scene text detection technique making use of semantics-aware text borders and bootstrapping-based text segment augmentation was presented in [111]. In [112], an instance segmentation-based framework was presented, which extracted each text instance as a separate connected component and introduced a shape-aware loss of adaptive multi-scale text instances when training the detection model.
Image segmentation-based methods for text/non-text classification at the pixel level have become mainstream for detecting text with multiple orientations and arbitrary shapes. However, this group of methods often requires time-consuming and complex post-processing to deal with complicated cases such as sticking or overlapping text.

Hybrid Methods
To detect scene text under more complex situations more efficiently, some researchers have combined the previous methods. In [113], a novel anchor-free region proposal network (AF-RPN) was proposed to replace the original anchor-based RPN and speed up text detection. In [114], a new framework for text detection named "Simple but Accurate" (SA-Text) was introduced, which utilizes heatmaps to detect text regions in scene images effectively. SA-Text detects text that occurs in various fonts, shapes, and orientations in scene images with complicated backgrounds. In [67], a new pipeline that directly predicts arbitrary orientations and quadrilateral text or text lines from natural images through a single network was proposed, eliminating unnecessary post-processing. A pixel-wise method named TextCohesion for scene text detection was proposed in [115], which splits a text instance into five key components: a text skeleton and four directional pixel regions. A novel conditional spatial expansion (CSE) mechanism to improve the performance of text detection by using a region expansion algorithm was introduced in [116]. CSE starts with a seed arbitrarily initialized within a text region and progressively merges neighborhood regions based on local features extracted by a CNN and contextual information of merged regions.
The advantages and disadvantages of the three kinds of methods for text detection are summarized in Table 2.

STD Resources: Datasets
High-quality data and textual annotations are essential for both model learning and evaluation. Below, we summarize the most widely used benchmark datasets. A comprehensive list is provided in Tables 3 and 4.
ICDAR 2003 (IC03) [117]. This is the first benchmark released for scene text detection and recognition from the ICDAR Robust Reading Competition. There are 258 natural images for training and 251 natural images for testing. All the text instances in this dataset are in English and are horizontally placed.
Street View Text (SVT) [27]. This dataset consists of 350 images annotated with word-level axisaligned bounding boxes from Google Street View. It contains smaller and lower resolution text, and not all text instances within it are annotated.
KAIST [118]. This dataset comprises 3000 images captured in different environments, including outdoor and indoor scenes, under different lighting conditions (clear day, night, strong artificial lights, etc.). The images were captured either using a high-resolution digital camera or a lowresolution mobile phone camera. All images have been resized to 640 × 480 pixels.
ICDAR 2011 [119]. This dataset inherits from ICDAR 2003 and includes some modifications. There are 229 scene images for training and 255 scene images for testing.
MSRA-TD500 (M500) [120]. This dataset contains 500 natural scene images in total, with 300 images intended for training and 200 images for testing. It provides text-line-level annotation and polygon boxes for text region annotation. It contains both English and Chinese text instances.
ICDAR 2013 (IC13) [121]. This is also a modified version of ICDAR 2003. There are 229 natural images for training and 233 natural images for testing.

USTB-SV1k [122]. It contains 1000 street images from Google Street View with 2955 text instances in total. It provides word-level annotations, and it only considers English words.
ICDAR 2015 (IC15) [25]. It contains 1500 scene images, 1000 for training and 500 for validation/testing. The text instances (annotated using four quadrangle vertices) are usually skewed or blurred since they were acquired without users' prior preference or intention. Specifically, it contains 17,548 text instances. It provides word-level annotations. IC15 is the first incidental scene text dataset, and it is an English dataset.
COCO-Text [123]. It is the largest benchmark that can be used for text detection and recognition so far. The original images are from the Microsoft COCO dataset, and 173,589 text instances from 63,686 images are annotated. There are 43,686 images for training and 20,000 images for testing, including handwritten and printed, clear and blurry, English and non-English text.
SynthText [124]. It contains 858,750 synthetic images, where text with random colors, fonts, scales, and orientations are rendered on-scene images carefully to have a realistic look. The text in this dataset is annotated at the character, word, and line level.
Chinese Text in the Wild (CTW) [125]. This dataset contains 32,285 high-resolution street view images annotated at the character level, including its underlying character type, bounding box, and detailed attributes, such as whether word art has been used. The dataset is the largest one to date and the only one that contains detailed annotations. However, it only provides annotations for Chinese text and ignores other scripts, e.g., English. It contains 32,285 high-resolution street view images of Chinese text, with 1,018,402 character instances in total. All images are annotated at the character level, including its underlying character type, bounding box, and six other attributes. These attributes indicate whether the background is complex, whether it is raised, whether the text is hand-written or printed, occluded, and distorted, or whether word art has been used.
RCTW-17 [126]. This dataset contains various kinds of images, including street views, posters, menus, indoor scenes, and screenshots for competition on detecting and recognizing Chinese text in images. The dataset contains about 8000 training images and 4000 test images, with annotations similar to ICDAR2015.
Total-Text (ToT) [127]. This dataset contains a relatively large proportion of curved text, compared to the few instances in the previous datasets. These images were mainly obtained from street billboards and annotated as polygons with a variable number of vertices.
SCUT-CTW1500 [128]. This dataset contains 1500 images in total, 1000 for training and 500 for testing, with 10,751 cropped word images. Annotations in CTW-1500 are polygons with 14 vertices. The dataset mainly consists of Chinese and English words.
ICDAR 2017 MLT (MLT17) [129]. It is a large-scale multi-lingual text dataset, which contains 10,000 natural scene images in total, with 7200 training images, 1800 validation images, and 9000 testing images. It provides word-level annotation.
ICDAR 2019 Arbitrary-Shaped Text (ArT19) [130]. ArT consists of 10,166 images, 5603 for training, and 4563 for testing. They were collected with text shape diversity in mind, and all text shapes (i.e., horizontal, multi-oriented, and curved) have a high number of instances.
ICDAR 2019 MLT (MLT19) [131]. This dataset contains 18,000 images in total, with word-level annotation. Compared to MLT, this dataset has 10 languages. It is a more real and complex dataset for scene text detection.
ICDAR 2019 Large-scale Street View Text (LSVT19) [132]. This dataset consists of 20,000 testing images and 30,000 training images with full annotations, and 400,000 training images with weak annotations, which are referred to as partial labels.

Evaluation Metrics for STD
In this section, we summarize evaluation protocols for text detection algorithms. The task of text detection is commonly evaluated using the ICDAR protocol, the AP-based protocol, and the TloUmetric, analyzed in the following paragraphs. The evaluation methods mainly consider three performance parameters, namely, Precision (P), Recall (C), and the overall evaluation index (Fmeasure, F). Commonly, recall and precision are calculated before F-mean, while there are some differences in the calculation methods. Determining whether two bounding boxes match or not is a straightforward but not simple problem. There are four ways in which two bounding boxes can match, as shown in Figure 4. . Four match types between ground truth and detected rectangles: (a) one-to-one match; (b) one-to-many matches with one ground truth rectangle; (c) one-to-many matches with one detection rectangle; (d) many to many matches.

ICDAR 2003(IC03) Evaluation metrics.
We have a set of ground-truth targets G, and the set of detection targets D. The IC03 metric calculates precision, recall, and the standard F-measure for oneto-one matches, as shown in Figure 4a, as follows: We adopt the standard F-measure to combine precision and recall into a single measure of quality. The relative weights of these are controlled by α, which we set to 0.5 to give equal weight to precision and recall: where BestMatchG and BestMatchD indicate the result of the closest match between the detection and ground truth rectangles, as defined below: ICDAR 2013(IC13) Evaluation Metric. IC03 only considers one-to-one match types, which is simple but cannot handle all the cases detected, so in IC13, a new evaluation method was used: DetEval. The new method takes into account one-to-one, one-to-many, and many-to-one cases but does not handle many-to-many cases. The criteria of these two evaluations are based on mutual overlap rates between detection ({Dj}j) and ground truth ({Gi}i): where tp and tr are the thresholds of precision and recall, respectively.
where is a parameter function that controls the amount of punishment, and it is often set to 0.8.
ICDAR 2015(IC15) IoU Metric. The IC15 metric [8] follows the same metric as the Pascal VOC. To be considered a correct detection, the value of Intersection-over-Union (IoU) defined in equation 12 must exceed 0.5.

AP-Based Evaluation Methods
To avoid fine-tuning the output detection confidence, datasets, such as RCTW-17 [26], have adopted interpolated average precision as the main detection evaluation metric: For a given task and class, the precision-recall curve is computed based on the method's ranked output. Basically, this metric relies on the IoU metric to calculate the precision and recall in advance.

Tightness-Aware Intersection-Over-Union (TIoU) Evaluation Protocol
The existing metrics exhibit some drawbacks: (1) They are not goal-oriented; (2) they cannot recognize the tightness of detection methods; (3) existing one-to-many and many-to-one solutions involve inherent loopholes and deficiencies. Previous metrics severely rely on an IoU threshold, which will lead to unreasonable results in some cases, such as those shown in Figure 5. To improve these shortcomings, the TIoU approach involves three annotation concepts to enhance the focus on detecting text content: (a) The annotation does not cut the text instance; (b) annotation contains less background noise, especially outlier text instances; (c) even if annotations do not match the text instance perfectly, they should be as perfect as possible.
The text instances detected in Figure 5a,b have the same value of IoU (0.66) against the ground truth, while the former does not detect a few characters of the GT (Ground Truth). To solve this issue, the cutting behavior can be penalized using the corresponding proportion of intersection in GT, as shown in Equations (13) and (14): The proposed solution aims to penalize such types of detections for making detection compact for avoiding including outlier-GTs in the same detected region. Nevertheless, as shown in Figure 5c, if the outlier-GTs are inside the target GT region, even the perfect detection bounding box cannot avoid containing these outliers. Therefore, only the outlier-GT region that is inside the detection bounding box but outside the target GT region would be penalized. The area ( ) of the union of all eligible outlier-GTs is calculated using Equation (15):

Discussion
In this part, we briefly summarize the strengths and drawbacks of commonly used evaluation methods for scene text detection. Details are shown in Table 5.

Evaluation Protocols
Match Type Strength and Weakness

IC03 Evaluation Protocol
One-to-One The IC03 metric calculates precision, recall, and the standard F-measure for one-to-one matches. However, it is unable to handle one-to-many and many-to-many matches between the ground truth and detections.

IC13 Evaluation Protocol
One-to-One One-to-Many Many-to-one This method takes into account one-to-one, one-tomany, and many-to-one cases but cannot handle many-to-many cases.

IC15 Evaluation Protocol
One-to-One One-to-Many Many-to-one This method uses the ICDAR15 intersection over union (IoU) metric and IoU ≥ 0.5 as a threshold for counting a correct detection. This method is the most commonly used evaluation method and is simple to calculate.

TIoU Evaluation Protocol
One-to-one One-to-many Many-to-one many-to-many This method can quantify the completeness of ground truth, the compactness of detection, and the tightness of the matching degree. However, it is relatively complex to calculate.

Results on Benchmark Datasets
In this section, we present the results of representative text detection methods on some public datasets. The evaluation uses the Precision (P), Recall (C), and F-measure (F) metrics. Since different methods may involve experiments on different benchmark datasets, and even on the same dataset they may adopt different training sets (such as using a synthetic dataset for pre-training or using special data augmentation schemes to enlarge the number of training samples), it is impossible to make an absolutely fair comparison. However, the analysis is useful for evaluating the development of state-of-the-art methods in this field and establishing future directions. Table 6 reports the text detection performance of different methods on a horizontal-text dataset. As is shown in Table 6, on the IC13 dataset, the performance has increased drastically from 77% ( [46]) to 92.1% ( [133]) in terms of F-measure. In [133], a supervised pyramid context network (SPCNET) (see Figure 6) is adopted, which can achieve better detection results. It can be observed that multiple methods of general object detection and semantic segmentation have been extended to scene text location, and the current trend is applying a deep learning framework to training an end-to-end text detector.   Table 7 shows the text detection performance of different methods on irregular-text datasets. The methods of [66,97] achieve relatively high performance, while in [97], ContourNet (Figure 7) was proposed to train an accurate arbitrarily-shaped text detection model. In [66], a segmentation-based network method was proposed, which can set the thresholds for binarization adaptively using a module named differentiable binarization (Figure 8).    Table 8 shows the text detection performance of different methods on arbitrary quadrilateral text datasets. As is shown in Table 8, [11] achieves relatively high performance on the IC15 dataset by applying the spatial binning positions in Position Sensitive ROI (PSROI) pooling ( Figure 9). Besides, the method of ContourNet [97] achieves state-of-the-art performance on the M500 dataset.

Speedup:
Current text detection methods place more emphasis on speed and efficiency, which is necessary for real-time scene text detection. As shown in Figures 10 and 11, DB [66] achieves stateof-the-art speed on MSRA-TD500 and ICDAR 2015 datasets. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 fps, on the MSRA-TD500 dataset, and running at 55 fps with an F-measure of 81 on the ICADAR 2015 dataset.

Conclusions and Discussion
In this paper, we reviewed scene text detection methods proposed in recent years. We comprehensively classified these methods into three types and highlighted the key techniques. Furthermore, we analyzed three types of benchmarks and evaluation protocols. Finally, we reported the results of several representative methods on benchmarks and compared their performance.
As discussed in Sections 3, 4, from the manual design of text features to feature extraction using deep learning, DL-based STD significantly improved the speed as well as the accuracy of text detection. Deep learning-based methods for scene text detection have emerged with promising results. However, there are still some in the field.
Benchmark Dataset. Scene text detection frameworks, including deep learning-based STD methods, required large, annotated datasets for training. However, data annotation remains timeconsuming and expensive. It is a big challenge to create a very large benchmark dataset, such as ImageNet, including plentiful scenarios, such as multi-scale, multi-lingual, and multi-orientation text, etc.
Real-time Scene Text Detection. Text information in scene images is extremely helpful to people's daily activities. Therefore, applying scene text detection technology to smart terminal devices (e.g., mobile phones, cameras, assistive devices, etc.) is a future development direction. However, most of the current methods are limited by the performance of smart terminal devices, which cannot achieve real-time levels while maintaining relatively good detection accuracy. Hence, real-time text detection is another future development direction.
Special Scene Text Detection Methods. Most of the proposed text detection methods mainly demonstrated their performance on some public datasets, and some of them simply accumulated some domain knowledge and adjusted parameters repeatedly (e.g., using Faster R-CNN, SSD, FCN, RNN, and other pattern recognition domain knowledge) to obtain higher testing performance, which leads to a lack of innovation and deep thinking. This results in approaches by researchers with no specialization in the field of document analysis.

Conflicts of Interest:
The authors declare no conflict of interest.