Scene Text Detection Based on Two-Branch Feature Extraction

Scene text detection refers to locating text regions in a scene image and marking them out with text boxes. With the rapid development of the mobile Internet and the increasing popularity of mobile terminal devices such as smartphones, the research on scene text detection technology has been highly valued and widely applied. In recent years, with the rise of deep learning represented by convolutional neural networks, research on scene text detection has made new developments. However, scene text detection is still a very challenging task due to the following two factors. Firstly, images in natural scenes often have complex backgrounds, which can easily interfere with the detection process. Secondly, the text in natural scenes is very diverse, with horizontal, skewed, straight, and curved text, all of which may be present in the same scene. As convolutional neural networks extract features, the convolutional layer with limited perceptual field cannot model the global semantic information well. Therefore, this paper further proposes a scene text detection algorithm based on dual-branch feature extraction. This paper enlarges the receptive field by means of a residual correction branch (RCB), to obtain contextual information with a larger receptive field. At the same time, in order to improve the efficiency of using the features, a two-branch attentional feature fusion (TB-AFF) module is proposed based on FPN, to combine global and local attention to pinpoint text regions, enhance the sensitivity of the network to text regions, and accurately detect the text location in natural scenes. In this paper, several sets of comparative experiments were conducted and compared with the current mainstream text detection methods, all of which achieved better results, thus verifying the effectiveness of the improved proposed method.


Introduction
Texts have become one of the indispensable means of transmitting information in the contemporary world, and there are all kinds of text information in the social scene where we live. Natural scene text detection is to locate the text area in an image through the detection network and express the text area with a polygon bounding box. The accurate detection results are beneficial to a wide range of practical applications, such as instant translation, image retrieval, scene analysis, geographic location, license plate recognition, etc., and have attracted much attention in the fields of computer vision and document analysis. In recent years, with the rapid development of the convolutional neural network (CNN), scene text detection has made great progress [1,2]. We can roughly divide the existing CNN-based text detection algorithms into two categories: regression-based methods and segmentation-based methods.
For the scene text detection algorithm based on regression, the text target is usually represented in the form of a rectangular box or quadrilateral box with a specific direction. Although the detection speed is fast, which can avoid the accumulation of multi-stage errors, most existing regression-based methods cannot accurately and effectively solve the problem of text detection due to the limited text representation (axis-aligned rectangle, rotated rectangle, or quadrilateral), especially when they are used to detect any shape text Figure 1. It shows the overall framework of the method proposed in this paper. In this paper, the backbone network ResNet is modified appropriately, and the residual correction branch is designed to improve the ability of network feature extraction. Secondly, a more efficient feature fusion module, namely, Two-Branch Attention Feature Fusion (TB-AFF) module, is adopted.
To demonstrate the effectiveness of our proposed framework, we conducted extensive experiments on three competitive benchmark datasets, including ICDAR 2015, Total-Text, and MSRA-TD500. In these datasets, Total-Text is specifically designed for curved text detection. Therefore, the experimental results on the MSRA-TD500 and Total-Text datasets show that this method has high flexibility in complex situations (such as multilanguage, curved text, and arbitrary-shaped text). Specifically, on the Total-Text dataset with arbitrary-shaped text, we surpass the results of most of the most advanced methods with an absolute advantage, and our model achieves considerable performance. In addition, the framework proposed in this paper has achieved good performance on the multidirectional text dataset ICDAR 2015.
In this paper, the residual correction branch (RCB) and two-branch attention feature fusion (TB-AFF) modules are used for lightweight feature extraction. In order to make up for the deficiency of the feature-extraction ability of a lightweight network, the residual correction branch (RCB) is embedded into the backbone network to enhance its featureextraction ability. In addition, this paper also proposes a two-branch attention feature fusion (TB-AFF) module, which is used to enhance the feature expression of multi-scale scene texts and improve the accuracy of its detection.
To sum up, the main contributions of this paper are as follows: (1) A residual correction branch (RCB) is proposed. In order to make up for the shortcomings of a lightweight network in feature-extraction ability and receptive field, the residual correction branch (RCB) is embedded in the backbone network to enhance its feature-extraction ability. This is a more efficient method than previous methods, has a generic structure, and is less Figure 1. It shows the overall framework of the method proposed in this paper. In this paper, the backbone network ResNet is modified appropriately, and the residual correction branch is designed to improve the ability of network feature extraction. Secondly, a more efficient feature fusion module, namely, Two-Branch Attention Feature Fusion (TB-AFF) module, is adopted.
To demonstrate the effectiveness of our proposed framework, we conducted extensive experiments on three competitive benchmark datasets, including ICDAR 2015, Total-Text, and MSRA-TD500. In these datasets, Total-Text is specifically designed for curved text detection. Therefore, the experimental results on the MSRA-TD500 and Total-Text datasets show that this method has high flexibility in complex situations (such as multi-language, curved text, and arbitrary-shaped text). Specifically, on the Total-Text dataset with arbitraryshaped text, we surpass the results of most of the most advanced methods with an absolute advantage, and our model achieves considerable performance. In addition, the framework proposed in this paper has achieved good performance on the multidirectional text dataset ICDAR 2015.
In this paper, the residual correction branch (RCB) and two-branch attention feature fusion (TB-AFF) modules are used for lightweight feature extraction. In order to make up for the deficiency of the feature-extraction ability of a lightweight network, the residual correction branch (RCB) is embedded into the backbone network to enhance its featureextraction ability. In addition, this paper also proposes a two-branch attention feature fusion (TB-AFF) module, which is used to enhance the feature expression of multi-scale scene texts and improve the accuracy of its detection.
To sum up, the main contributions of this paper are as follows: (1) A residual correction branch (RCB) is proposed. In order to make up for the shortcomings of a lightweight network in feature-extraction ability and receptive field, the residual correction branch (RCB) is embedded in the backbone network to enhance its feature-extraction ability. This is a more efficient method than previous methods, has a generic structure, and is less computationally intensive. (2) The proposed two-branch attention feature fusion (TB-AFF) module is used to enhance the feature expression of multi-scale scene texts, so as to improve the accuracy of its detection. It can effectively solve the detection problem of scene texts with arbitrary shapes and improve the performance of scene text detection. (3) We achieve state-of-the-art performance on several benchmarks including different forms of text instances (directed, long, multilingual, and curved), demonstrating the superiority of our newly designed module.

Related Work
In recent years, with the rise of deep learning, various scene text detection algorithms based on neural networks have been proposed, one after another. Using convolutional neural network to automatically learn text features frees researchers of scene text detection from the tedious process of manually designing features, making remarkable progress in scene text detection technology. At present, scene text detection methods based on deep learning can be divided into two categories: regression-based methods and segmentationbased methods.
The algorithm of scene text detection based on regression is mainly inspired by Faster R-CNN, SSD [7], Mask R-CNN [8], FCIS [9], etc. Usually, some text boxes need to be set in advance, and the convolutional neural network is used to determine whether they overlap with the text area and adjusts their size and position to accurately locate the text. Based on Faster R-CNN, Zhong et al. put forward the DeepText algorithm [10]. By introducing the Inception structure into RPN network, the candidate text boxes with overlapping areas are deleted by voting instead of the non-maximum suppression method, and the final detection result is obtained. Tian et al. proposed an algorithm based on Faster R-CNN, which has strong detection function for fuzzy text and multilingual text [11]. Shi et al. proposed a simple and effective multi-angle text detection algorithm SegLink based on SSD [12]. Different from ordinary objects, texts are usually displayed in irregular shapes with different aspect ratios. To solve this problem, Liao et al. proposed an end-to-end natural scene text detection method, TextBoxes, in 2017 [13]. Liao et al. modified the convolution kernel and text box based on TextBoxes and put forward TextBoxes++ [14]. The network architecture of TextBoxes++ is like that of TextBoxes, replacing the global average pooling in the last layer with a convolutional layer and introducing angle prediction to achieve detection of text in any direction. Ma et al. proposed a rotated region proposal network (RRPN) through Faster-RCNN detection algorithm to solve the problem that RPN can only detect horizontal text [15]. Based on Faster R-CNN, Jiang et al. improved the RoI pool layer and prediction header and proposed the Rotational Region CNN (R2CNN) [16]. From an instance-aware perspective, Dai et al. proposed an FTSN based on the instance segmentation framework FCIS [9]. Most of the existing regression-based methods rely on preset text boxes. However, in a natural scene, the text direction is changeable, and the size and aspect ratio change dramatically. In order to make the preset text box overlap with the text area, many methods add text boxes with various directions, sizes, and aspect ratios, but this undoubtedly increases the complexity and calculation of the method. Segmentation-based approaches draw mainly on semantic segmentation methods, which treat all pixels in a text bounding box as positive sample regions, describe text regions by employing different representations, and then reconstruct text instances through specific post-processing. The greatest benefit of these methods is the ability to extract text of an arbitrary shape. Zhou et al. proposed a new text detection algorithm, East, based on FCN, in 2017 [17]. Deng et al., inspired by SegLink, proposed to detect scene text by example segmentation [18]. To solve the problem that irregular-shaped text is difficult to detect, Long et al. proposed a TextSnake algorithm that can detect arbitrary-shaped scene text through the text centerline [19]. In [20], Wang et al. proposed the Progressive Scale Expansion Network (PSENet) text detection method, which takes a breadth-first traversal-based approach to progressively expand text kernels to reconstruct text instances. Wang et al. put forward the Pixel Aggregation Network (PAN), based on [21]. He et al. proposed to use the Direct Regression Network (DDR) to detect multi-directional scene text [22]. Tian et al. [23] regards each instance text as a cluster and clusters pixels by embedding mapping. TextField uses depth domain to link adjacent pixels and generate candidate text parts [24]. Based on pyramidal instance segmentation, PMTD discards the pixel-by-pixel binary prediction segmentation method and combines shape and location information to mitigate boundary discontinuities caused by inaccurate labeling [25]. The classical text detection algorithm is limited by convolution kernel receptive field, so it cannot detect long text well. To address this problem, Baek et al. [26] proposed the Character Region Awareness for Text Detection (CRAFT) algorithm based on characterprobability prediction. Huang [27] proposed the Mask-Pan algorithm, based on the Mask Rcnn algorithm and the pyramid attention network. The pyramid attention network enables the model to better focus on the context information of the image text for location and classification. Xie [28] and others proposed the SPCNET network, based on Mask-Rcnn, to add an attention mechanism. This method introduced the text-context module and attention mask, which made the algorithm better integrate the intermediate features of semantic segmentation with the detection features and improved the detection accuracy. DBNet adopts adaptive binarization for each pixel, derives the binarization threshold from network learning, and adds the binarization steps to the network for training. A differentiable binarization method is proposed to solve the nondifferentiable problem, to reduce the post-processing steps. The network runs fast and is sensitive to background interference. Based on DBNet, our method makes up for the shortcomings of the feature-extraction ability and receptive field of a lightweight network and greatly improves the performance of text detection.

Residual Correction Branch (RCB)
In this paper, we do not design a complex network architecture to enhance the representation of text features but to improve the backbone network without adjusting the model architecture, to achieve the purpose of improving the performance of the whole network. Therefore, this paper proposes a novel residual correction branch (RCB) as an effective method, as shown in Figure 2, to help convolutional networks learn the discriminant representation of text feature information.
et al. proposed to use the Direct Regression Network (DDR) to detect multi-directional scene text [22]. Tian et al. [23] regards each instance text as a cluster and clusters pixels by embedding mapping. TextField uses depth domain to link adjacent pixels and generate candidate text parts [24]. Based on pyramidal instance segmentation, PMTD discards the pixel-by-pixel binary prediction segmentation method and combines shape and location information to mitigate boundary discontinuities caused by inaccurate labeling [25]. The classical text detection algorithm is limited by convolution kernel receptive field, so it cannot detect long text well. To address this problem, Baek et al. [26] proposed the Character Region Awareness for Text Detection (CRAFT) algorithm based on character-probability prediction. Huang [27] proposed the Mask-Pan algorithm, based on the Mask Rcnn algorithm and the pyramid attention network. The pyramid attention network enables the model to better focus on the context information of the image text for location and classification. Xie [28] and others proposed the SPCNET network, based on Mask-Rcnn, to add an attention mechanism. This method introduced the text-context module and attention mask, which made the algorithm better integrate the intermediate features of semantic segmentation with the detection features and improved the detection accuracy. DBNet adopts adaptive binarization for each pixel, derives the binarization threshold from network learning, and adds the binarization steps to the network for training. A differentiable binarization method is proposed to solve the nondifferentiable problem, to reduce the post-processing steps. The network runs fast and is sensitive to background interference. Based on DBNet, our method makes up for the shortcomings of the feature-extraction ability and receptive field of a lightweight network and greatly improves the performance of text detection.

Residual Correction Branch (RCB)
In this paper, we do not design a complex network architecture to enhance the representation of text features but to improve the backbone network without adjusting the model architecture, to achieve the purpose of improving the performance of the whole network. Therefore, this paper proposes a novel residual correction branch (RCB) as an effective method, as shown in Figure 2, to help convolutional networks learn the discriminant representation of text feature information.  This module mainly enhances the detection capability of the whole network by increasing the effective receptive field of the network. Specifically, firstly, the input X passes through a conventional convolution layer, BN layer, and Relu activation function, and then it is sent to two different branches, x 1 and x 2 , to obtain the feature information in different spaces. On the first branch, x 1 , a simple convolution and BN operation are used to extract features, the purpose of which is to retain the spatial information on the main branch of the original backbone network, that is, the input feature x 1 is convolved to get the output feature y 1 . In the other branch, x 2 , firstly, the input is downsampled by averaging pool, and the receptive field of CNN is increased by downsampling. Then, the obtained feature graph is subjected to convolution and BN operation, and then the nearest neighbor interpolation up-sampling is performed to restore the input size. Finally, the output feature, y 2 , is obtained by Sigmoid activation function. x 1 and x 2 branches are used to extract enough text information features in parallel. Next, the output features, y 1 and y 2 , of these two branches are multiplied element by element and then added with the original input X, and the final output feature Y of the network is obtained through the Relu activation function. Thanks to the design of this double-branch structure, the receptive field of each spatial location is effectively expanded, so that every pixel in the space has the information of its surrounding area, thus paying attention to more contextual information. The specific process is as follows: Given the original input X, the branches x 1 and x 2 are obtained as follows: acquisition of output feature y 1 : Given the input branch x 2 , the average pooling with a filter size of 4 × 4 and a step size of 4 are adopted, as follows: AvgPool r represents the average pool function with downsampling r times. Through the average pool operation, the image scale is reduced and the network receptive field is expanded. This feature can help CNN to generate more discriminating feature expressions and extract richer information.
Acquisition of output feature y 2 : where NN(·) is the nearest neighbor interpolation upsampling, and Sigmoid activation function can increase the nonlinearity of neural network model, so as to increase the fitting ability of nonlinear relationship of samples. Acquisition of output feature Y: The module has two parallel branches, the two branches are carried out independently, and the outputs of each branch are combined as the output of the final network. Therefore, the whole network can generate a larger receptive field, fully obtain the context information of the text features in the image, and help to capture the whole text area well. On the other hand, the residual correction branch does not need to collect the whole global context information, but only considers the context information around each spatial position, which avoids some pollution information from irrelevant areas (non-text areas) to some extent and can accurately locate the text areas. Moreover, as can be seen from Figure 2, RCB does not need to rely on any additional learnable parameters, has strong versatility, can be easily embedded into modern classification networks, is suitable for various tasks, and is convenient to use.

Two-Branch Attention Feature Fusion (TB-AFF) Module
As we all know, attention plays a vital role in computer vision. As the text has its own texture features, it is worth considering what kind of attention module should be designed to perfectly match the text features. The attention mechanism in deep learning originated from the human visual attention mechanism [29,30]. For example, SENet compresses global spatial information into channel descriptors to capture channel correlation [31]. The way to calculate attention is to average the pixel values of each channel, and then, after a series of operations, normalize them with sigmoid function. SENet is specifically expressed as follows.
Given the intermediate feature X ∈ R C×H×W of C channel and the feature graph of size H × W, the channel attention weight w ∈ R C in SENet can be calculated as: In which g(X) ∈ R C represents the global context feature, and g(X) = 1 is the global average pooling (GAP). δ represents rectification linear unit (Relu), B represents batch standardization (BN), and σ is Sigmoid function. This is achieved by having two fully connected (FC) layers.
is a dimension-increasing layer, and r is the channel-reduction rate. Here, the attention of this channel squeezes each feature graph of size H × W into a scalar. This extremely rough description tends to emphasize the global distribution of large targets, and it is effective for large-scale targets. However, the effect is not very good for small-scale targets, so small targets are often ignored. However, compared with the whole image, the proportion of text is small, which belongs to our small target detection. Therefore, global channel attention may not be the best choice. Here, this paper proposes a multi-scale attention fusion network (MSAFN), which uses attention to fuse text features. For the feature pyramid network FPN, the deeper the features, the more channels there are. However, the feature maps of each layer propagate from top to bottom during fusion, so the deeper the feature maps, the more channels will be reduced. The reduction in channels will inevitably lead to the loss of feature information, and the higher-level features will often lose more feature information. Therefore, in order to retain more information of text features, here we propose a unified and universal scheme based on the structure of FPN, that is, the Double Branch Attention Feature Fusion (TB-AFF) module, which is a multi-scale feature extraction module. This paper mainly discusses the way of feature fusion and better fuses features of different semantics or scales. As far as we know, the two-branch attention feature fusion (TB-AFF) module has never been discussed before. By aggregating multi-scale text feature information along the channel dimension, combining local attention information and global attention information, this module emphasizes the larger target with more global distribution and highlights the smaller target with more local distribution, thus alleviating various problems caused by the change of text scale, improving the representation ability of text feature information, and further improving the performance of text detection. The scheme of our proposed architecture is shown in Figure 3.
In this part, we will describe the proposed two-branch attention feature fusion (TB-AFF) module in detail. The key idea of B-AFF is to extract global and local attention weights by changing the dimension size and using two branches with different scales, to achieve local and global attention on multiple scales. TB-AFF module structure is relatively simple. Global feature branch extracts global feature attention by using global average pooling and pointwise conv (that is, ordinary convolution with convolution kernel 1). The local feature branch also uses pointwise conv to extract local feature attention, in order to preserve details. For SENet, only the global channel attention is used, while the proposed TB-AFF module also aggregates the local feature attention, which helps the whole network to contain less background clutter and irrelevant areas and is more conducive to the detection of small targets. In this part, we will describe the proposed two-branch attention feature fusion (TB-AFF) module in detail. The key idea of B-AFF is to extract global and local attention weights by changing the dimension size and using two branches with different scales, to achieve local and global attention on multiple scales. TB-AFF module structure is relatively simple. Global feature branch extracts global feature attention by using global average pooling and pointwise conv (that is, ordinary convolution with convolution kernel 1). The local feature branch also uses pointwise conv to extract local feature attention, in order to preserve details. For SENet, only the global channel attention is used, while the proposed TB-AFF module also aggregates the local feature attention, which helps the whole network to contain less background clutter and irrelevant areas and is more conducive to the detection of small targets.
In this paper, the characteristic layers and are taken as examples. We choose to perform the initial feature fusion for the two input features, and , that is, perform the original operation process of FPN. Given two inputs, and , the high-order feature is upsampled by linear interpolation, and then it is fused with the next-order feature , so that feature information from different levels of semantics is fused in each layer of feature map. In the above process, the initial feature fusion is just the addition of simple corresponding elements, and then we will carry out more detailed operations.
Given the feature X, after the initial fusion of input and , the one-dimensional attention ( ) ∈ ℝ × × obtained on the global feature branch is extracted by point-bypoint convolution: stands for global average pooling, and stands for point-by-point coupon product. Global attention information can be calculated by global feature branch, because global average pooling is used here, which makes the obtained feature information contain global information. Point-by-point convolution is also used here, and the convolution direction is changed by gradually compressing the channel, to assign more weight to the text area with high response. Similarly, the three-dimensional attention ( ) ∈ ℝ × × obtained on the local feature branch is also extracted by point-by-point convolution, and the formula is as follows: stands for point-by-point convolution. ( ) has the same shape as the input feature and can retain and highlight details in low-level features. The global one-dimensional attention ( ) and the local three-dimensional attention ( ) are collected, the feature ∈ ℝ × × that needs attention is defined, and the formula is as follows: In this paper, the characteristic layers C 4 and C 5 are taken as examples. We choose to perform the initial feature fusion for the two input features, C 4 and C 5 , that is, perform the original operation process of FPN. Given two inputs, C 4 and C 5 , the high-order feature C 5 is upsampled by linear interpolation, and then it is fused with the next-order feature C 4 , so that feature information from different levels of semantics is fused in each layer of feature map. In the above process, the initial feature fusion is just the addition of simple corresponding elements, and then we will carry out more detailed operations.
Given the feature X, after the initial fusion of input C 4 and C 5 , the one-dimensional attention g(x) ∈ R C×1×1 obtained on the global feature branch is extracted by point-bypoint convolution: g(X) = PWConv(PWConv(Avg(X))) Avg stands for global average pooling, and PWConv stands for point-by-point coupon product. Global attention information can be calculated by global feature branch, because global average pooling is used here, which makes the obtained feature information contain global information. Point-by-point convolution is also used here, and the convolution direction is changed by gradually compressing the channel, to assign more weight to the text area with high response. Similarly, the three-dimensional attention L(X) ∈ R C×H×W obtained on the local feature branch is also extracted by point-by-point convolution, and the formula is as follows: PWConv stands for point-by-point convolution. L(X) has the same shape as the input feature X and can retain and highlight details in low-level features.
The global one-dimensional attention g(X) and the local three-dimensional attention L(X) are collected, the feature X ∈ R C×H×W that needs attention is defined, and the formula is as follows: where ⊕ represents addition and represents broadcast addition operation. The main reason is that the global one-dimensional attention uses the global average pooling operation, and the obtained feature height-width shape is 1 × 1, while the local three-dimensional attention keeps the same height-width dimension H × W as the feature X, so the broadcast operation is needed when the two are added together. The obtained feature X is activated by sigmoid function, and, finally, the dot multiplication operation is performed with the smaller feature layer (here, C 4 ) in the original input. The formula is as follows: Here, using sigmoid activation function can make the value of each element between [0, 1], which can enhance useful information and suppress useless information.
The original input of the network is 640 × 640 × 3, and, after the output of the residual correction branch (RCB), five feature maps (taking ResNet18 as an example) are obtained, namely Y 1 , Y 2 , Y 3 , Y 4 , and Y 5 . Due to the large size of the feature map Y 1 , the parameters will increase dramatically, which leads to the high complexity of the network. Therefore, we abandon Y 1 and choose the last four feature maps Y 2 -Y 5 , the sizes of which are shown in Table 1. In order to reduce the parameter quantity and complexity of the network model, we reduce the dimension of the feature maps Y 2 -Y 5 and change the number of channels to 64, so as to obtain the input of the two-branch attention feature fusion (TB-AFF) module, namely C 2 -C 5 . The purpose of TB-AFF proposed in this paper is to introduce richer details and global information for high-level features and low-level features, so that the extracted features can better highlight the local and global feature of text examples, thus improving the information representation ability. Therefore, we keep the input and output dimensions of the two-branch attention feature fusion (TB-AFF) module consistent, that is, P 2 = C 2 , P 3 = C 3 , P 4 = C 4 , and P 5 = C 5 . To sum up, the two-branch attention feature fusion (TB-AFF) module combines local and global feature information and uses feature maps with different scales to extract attention weights and adjust the text position. The main contributions are as follows: (1) TB-AFF module focuses on the size of attention by point-by-point convolution, instead of convolution kernels with different sizes. Point-by-point convolution is also used to make TB-AFF as lightweight as possible.
(2) TB-AFF module is not in the backbone network but is based on feature pyramid network FPN. It aggregates global and local feature information, strengthens contact with contextual feature information, and updates the text area.

Differentiable Binary Module
According to the probability graph P ∈ R H×W generated by the segmentation network, h and w represent the height and width of the input image, respectively. To transform the probability graph into a binary graph, the binarization function is essential. The standard binarization function is shown in Formula (11), and the pixel with a value of 1 is considered as an effective text area.
where t is the set threshold, and (i, j) represents the coordinate point. Equation (11) is a standard binary function, which is non-differentiable, so it cannot be optimized by dividing the network. In order to solve the problem that the binarization function is not differentiable, this paper uses Formula (12) to make differentiable binarization: where B is an approximate binary graph, T is an adaptive threshold graph learned from the network, and k is an amplification coefficient. In the training process, the role of k is to increase the propagation gradient in the back propagation, which is friendly to the improvement of most prediction error areas and conducive to the generation of more significant predictions. In this paper, set k = 50. Specifically, the probability map (P) and the threshold map (T) are predicted by using features, and the approximate binary map is obtained by combining the probability map and the threshold map according to the differentiable binary module, so that the threshold of each position of the image can be predicted adaptively. Finally, the text detection box is obtained from the approximate binary image. The structure of microbinarization is shown in Figure 4. The green path represents the standard binarization process, and the red path is the differentiable binarization used in this paper.
where ′ is an approximate binary graph, is an adaptive threshold graph learned from the network, and is an amplification coefficient. In the training process, the role of is to increase the propagation gradient in the back propagation, which is friendly to the improvement of most prediction error areas and conducive to the generation of more significant predictions. In this paper, set = 50.
Specifically, the probability map ( ) and the threshold map ( ) are predicted by using features, and the approximate binary map is obtained by combining the probability map and the threshold map according to the differentiable binary module, so that the threshold of each position of the image can be predicted adaptively. Finally, the text detection box is obtained from the approximate binary image. The structure of microbinarization is shown in Figure 4. The green path represents the standard binarization process, and the red path is the differentiable binarization used in this paper.

Loss Function
Loss function plays a vital role in deep neural network. Here, we use loss function and binary cross entropy loss function to optimize our network. The loss function of this paper consists of three parts in the training process: segmentation map loss , binarization on map and threshold map : where and are weight parameters, is set to 1, and is set to 10. Among them, binary cross entropy loss function is used for probability map loss and binary map loss , and its formula is as follows. Hard negative mining is also used to overcome the imbalance of positive and negative samples.
Among them, represents the sampling sample of the image with the ratio of positive and negative samples of 1:3. For the loss of the adaptive threshold map, the loss function is adopted, and its formula is: where is the index of pixels in this area, and * is the label of adaptive threshold map.

Loss Function
Loss function plays a vital role in deep neural network. Here, we use L 1 loss function and binary cross entropy loss function to optimize our network. The loss function of this paper consists of three parts in the training process: segmentation map loss L s , binarization on map L b and threshold map L t : where α and β are weight parameters, α is set to 1, and β is set to 10. Among them, binary cross entropy loss function is used for probability map loss L s and binary map loss L b , and its formula is as follows. Hard negative mining is also used to overcome the imbalance of positive and negative samples.
Among them, S l represents the sampling sample of the image with the ratio of positive and negative samples of 1:3. For the loss L t of the adaptive threshold map, the L 1 loss function is adopted, and its formula is: where R d is the index of pixels in this area, and y * is the label of adaptive threshold map.

Experimental Results and Analysis
In this paper, three challenging public datasets are tested, namely, the multidirectional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. The visualization results of this method on different types of text examples are shown in Figure 5, including curved texts (e) and (f), multi-directional texts (a) and (b), and multilingual texts (c) and (d). For each cell in Figure 5, the second, third, and fourth columns are probability graph, threshold graph, and binarization graph, respectively.
The ICDAR 2015 dataset contains 1000 training images and 500 test images. All the images are taken automatically by the camera, and the shooting angle is not adjusted, so it is very random, and there is tilt and blur. Therefore, the text may appear in any direction and any position. At the same time, the text appears randomly in a certain position in the image, and the dataset has not been adjusted to improve the image quality, in order to increase the difficulty of detection.
In this paper, three challenging public datasets are tested, namely, the multidirectional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. The visualization results of this method on different types of text examples are shown in Figure 5, including curved texts (e) and (f), multi-directional texts (a) and (b), and multilingual texts (c) and (d). For each cell in Figure 5, the second, third, and fourth columns are probability graph, threshold graph, and binarization graph, respectively. The ICDAR 2015 dataset contains 1000 training images and 500 test images. All the images are taken automatically by the camera, and the shooting angle is not adjusted, so it is very random, and there is tilt and blur. Therefore, the text may appear in any direction and any position. At the same time, the text appears randomly in a certain position in the image, and the dataset has not been adjusted to improve the image quality, in order to increase the difficulty of detection.
The Total-Text dataset is a public dataset used to detect bent texts. It contains bent texts of commercial signs in real scenes, and the language to be detected is English. There are 1555 pictures, 1255 training images, and 300 test images.
The MSRA-TD500 dataset belongs to a multi-language and multi-category dataset, including Chinese and English, and contains 500 pictures, 300 training images, and 200 test images. These images are mainly taken indoors (offices and shopping malls) and outdoors (streets) with cameras. Indoor images include signs, house numbers, and warning signs. Outdoor images include guide boards and billboards with complex backgrounds.

Training Configuration
In this paper, Python 3.7 is used as the programming language, and Pytorch1.5 is used as the deep learning framework. All the experiments were carried out on TITAN RTX. The initial learning rate is set to 0.007. The training process includes two steps: first, we use SynthText dataset [32] to train the network for 100,000 iterations, and then we finetune the model on the benchmark real dataset 1200 times. We only use the official training images of each dataset to train our model, with a weight attenuation of 10-4 and a momentum of 0.9. The optimizer used for training is Adam [33]. The training batch size is set The Total-Text dataset is a public dataset used to detect bent texts. It contains bent texts of commercial signs in real scenes, and the language to be detected is English. There are 1555 pictures, 1255 training images, and 300 test images.
The MSRA-TD500 dataset belongs to a multi-language and multi-category dataset, including Chinese and English, and contains 500 pictures, 300 training images, and 200 test images. These images are mainly taken indoors (offices and shopping malls) and outdoors (streets) with cameras. Indoor images include signs, house numbers, and warning signs. Outdoor images include guide boards and billboards with complex backgrounds.

Training Configuration
In this paper, Python 3.7 is used as the programming language, and Pytorch1.5 is used as the deep learning framework. All the experiments were carried out on TITAN RTX. The initial learning rate is set to 0.007. The training process includes two steps: first, we use SynthText dataset [32] to train the network for 100,000 iterations, and then we fine-tune the model on the benchmark real dataset 1200 times. We only use the official training images of each dataset to train our model, with a weight attenuation of 10-4 and a momentum of 0.9. The optimizer used for training is Adam [33]. The training batch size is set to 16. The training data are enhanced by randomly rotating the angle, cropping, and flipping in the range of (−10, 10), and all the pictures are readjusted to 640 × 640.
It is worth noting that the fuzzy text marked "ignored" is ignored in the training process. In the preprocessing stage of the network, the labels of probability graph and threshold graph are created based on the labels of training datasets. Since small text areas are not easy to detect, some too-small text areas will be ignored in the process of creating labels (for example, the minimum side length of the smallest rectangle of the text area is less than 3 or the polygon area is less than 1). Therefore, during the training process, a small part of the text marked "ignored" will be discarded.
Since the test images of different scales have great influence on the detection effect, the aspect ratio of the test images is kept in the reasoning stage, and the size of the input images is adjusted by setting an appropriate height for each dataset.

Experiment and Discussion
In order to better prove the realization of each module proposed in this paper, we have carried out detailed ablation experiments on the multi-directional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. Three main performance parameters, precision (P), recall (R), and comprehensive evaluation index (F), are considered to evaluate the detection performance of the model, which proves the effectiveness of the residual correction branch (RCB) and the two-branch attention feature fusion (TB-AFF) modules proposed by us. During the training process, the experiment was conducted in the same environment, and the place marked " √ " indicated that the method was used. The results are listed in Tables 2-4. As can be seen from Table 2, on the ICDAR2015 dataset, after adding the RCB module, the recall rate and F value exceed the original DB model results by about 4.68% and 1.56%, respectively. After adding the TB-AFF module, the recall rate and F value exceed the original DB model results by about 4.82% and 2.03%, respectively. At the same time, by adding these two modules, the method achieves 79.48% recall rate, 87.26% accuracy rate, and 83.19% F value in natural scene text image detection, which ensures the integrity of text information in the process of text detection. Compared with the results of the original model, the recall rate and F value increased by 5.68% and 2.39%, respectively, under the same accuracy. Therefore, the detection performance of the network combining these two modules is better than that of the network using the RCB module or TB-AFF module alone.
As can be seen from Table 3, on the Total-Text dataset, compared with the local original DB model reproduction results, the introduction of the RCB module improves the recall rate and F value by about 4.56% and 2.12%, respectively. After the introduction of the TB-AFF module, the recall rate and F value increased by about 5.33% and 2.10%, respectively. At the same time, by introducing the two modules, this method achieves 78.95% recall rate, 87.37% accuracy rate, and 82.95% F value in natural scene text image detection. Compared with the results of the original model, the recall rate and F value are improved by 5.15% and 2.15%, respectively, under the same accuracy rate. Therefore, the detection performance of the network combining these two modules is better than that of the network using the RCB module or TB-AFF module alone.
As can be seen from Table 4, on the MSRA-TD500 dataset, compared with the local original DB model reproduction results, after the introduction of the RCB module, the recall rate and F value are increased by about 7.82% and 3.35%, respectively. After the introduction of the TB-AFF module, the recall rate and F value increased by about 6.78% and 2.95%, respectively. At the same time, by introducing the two modules, the method achieves 83.33% recall rate, 88.02% accuracy rate, and 85.61% F value in natural scene text image detection. Compared with the results of the original model, the recall rate and F value are increased by 9.53% and 4.81%, respectively, under the same accuracy. Therefore, the detection performance of the network combining these two modules is better than that of the network using the RCB module or TB-AFF module alone.
It can be seen from the above observation that in the residual correction branch (RCB) module, we introduce the average pool down-sampling operation to establish the connection between positions in the whole pool window. The experimental results show that using the 18-layer backbone network and the proposed RCB can greatly improve the baseline performance, and the results are obviously improved. This phenomenon shows that the network with a residual correction branch can generate richer and more distinctive feature representations than the original ordinary convolution, which is helpful to find more complete target objects and can be better confined to semantic areas, even though their sizes are small. At the same time, in order to overcome the semantic and scale inconsistency between input features, our two-branch attention feature fusion (TB-AFF) module combines local feature information with global feature information, which can capture context information better. The experimental results show that the multi-scale attention fusion network (MSAFN) with the dual-branch attention feature fusion (TB-AFF) module can improve the performance of advanced networks with a small parameter budget, which indicates that people should pay attention to feature fusion in deep neural networks, and a proper attention mechanism of feature fusion may produce better results. It is further explained that instead of blindly increasing the depth of the network, it is better to pay more attention to the quality of feature fusion.
We have carried out relevant experiments on the downsampling factor r, and the experimental results are shown in Table 5. We find that the smaller the value of r, the greater the complexity of Flops is. When r = 3,4,5, the complexity of its network Flops is similar. When r = 4, its f value reaches the relatively optimal result. In addition, we also verified it on Ours-Resnet-50 and found that when r = 4, the Flops and F values were well-balanced. Among them, the experimental results in Table 5 were carried out on the improved Resnet18, and Flops was calculated by inputting 3 × 640 × 640. Table 5. Ablation experiment of sampling factor r. In natural scene text detection, most cases appear as characters or text lines. The residual correction branch (RCB) designed by us increases the receptive field of the network by downsampling the feature map, thus modeling the context information around each spatial location and making the network detect the text information in the image more accurately and completely. The ablation experiment results also verify this point, which shows that the RCB proposed by us is effective. Figure 6 shows the visualization results of baseline and the method in this paper. For each unit in the graph, the second column is the probability graph, the third column is the threshold graph, and the fourth column is the binary graph. From the experimental results, the residual correction branch (RCB) and the double branch attention feature fusion (TB-AFF) modules play an important role in text feature extraction in model training, effectively enhancing the model's attention to text features, making effective use of the extracted text features and improving the detection accuracy of scene text to some extent. sults, the residual correction branch (RCB) and the double branch attention feature fusion (TB-AFF) modules play an important role in text feature extraction in model training, effectively enhancing the model's attention to text features, making effective use of the extracted text features and improving the detection accuracy of scene text to some extent.

Method Flops(G) P (%) R (%) F (%)
We compare the proposed method with other advanced methods on different datasets, including the multi-directional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. The experimental results are shown in Tables 6-8.   We compare the proposed method with other advanced methods on different datasets, including the multi-directional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. The experimental results are shown in Tables 6-8. Table 6. Test results of curved text dataset. The value in brackets refers to the height of the input image. "*" means multi-scale test. "MTS" and "PSE" are the abbreviations for Mask TextSpotter and PSENet.

Params (M) Flops (G) P (%) R (%) F (%)
TextSnake (  The algorithm in this paper is compared with other algorithms on the Total-Text curved text dataset, and the results are shown in Table 6. Our model outperforms segmentationbased algorithms such as the TextSnake algorithm, PSENet algorithm, and TextField algorithm in three evaluation indexes. Ours-ResNet-18 (800 × 800) achieved 78.95% recall, 87.37% accuracy, and 82.95% F value, which surpassed the original model DB-resnet-18 (800 × 800) by about 0.67%, 3.55%, and 2, respectively. Ours-ResNet-50 (800 × 800) achieved 82.19% recall rate, 88.06% accuracy rate, and 85.03% F value, which were about 3.76% and 3.79% higher than the original model DB-ResNet-50 (800 × 800), respectively. The above experimental results show that this model can adapt to any shape of curved text detection, and, in most cases, the method proposed in this paper is obviously superior to other methods.
We also tested the parameters and complexity of other models in Table 6, and the test results are shown in Table 6. As can be seen from Table 6, Ours-ResNet-18 (800 × 800) has increased a little parameter quantity and complexity compared with the baseline and achieved a performance improvement of 2.25%. For Ours-ResNet-50 (800 × 800), although the parameters and complexity of the model have increased, the performance has improved by about 4%. Compared with PSE-1s, our model only needs fewer parameters and complexity to achieve better performance.
On the multidirectional text dataset ICDAR2015, the comparison results between our algorithm and other algorithms are shown in Table 7 In addition, the model in this paper is superior to such regression-based algorithms as the RRD (rotation-sensitive regression detector) algorithm in the evaluation index. Ours-ResNet-18 (1280 × 736) outperforms the EAST algorithm by about 3.66%, 5.98%, and 4.99% in accuracy, recall, and F value respectively. The Corner algorithm will predict two adjacent texts as one text instance, resulting in inaccurate detection [34]. SPN algorithm (Short Path Network) has poor robustness for curved text examples. When the candidate region predicted in the first stage only contains a part of the text instance, the SRPN (Scalebased Region Proposal Network) algorithm cannot correctly predict the boundary of the whole text instance in the second stage [35]. Compared with the EAST algorithm, Corner algorithm, SPN algorithm, and SRPN algorithm, our model makes full use of semantic information to improve the accuracy of text pixel prediction and classification, reduces the interference of background pixels on small-scale texts, and makes use of rich feature information to improve the positioning ability of text examples.
On the long-text dataset MSRA-TD500, the comparison results between our algorithm and other algorithms are shown in Table 8 Figure 7 shows the visualization results of our method and the original DBNet on different types of text examples. It is worth noting that the images here are randomly selected from three datasets, which can better prove the robustness of our model. ResNet-18 (736 × 736) outperforms the segmentation-based TextSnake algorithm and CRAFT algorithm in three evaluation indexes. Figure 7 shows the visualization results of our method and the original DBNet on different types of text examples. It is worth noting that the images here are randomly selected from three datasets, which can better prove the robustness of our model.  Figure 7b,c, Baseline mistakenly detects the non-text area and detects the non-text area as the text area. Compared with Baseline, our method can avoid the false detection. As for Figure 7d, comparing Baseline and Ours, Baseline missed a part of the text (i.e., "1") in the figure, while our method can detect it. For Figure 7e, Baseline missed the middle English text, but our method can accurately detect it. For Figure 7f, Baseline detects "COFFEE" as two parts of text, but the actual "COFFEE" represents the same semantic information, which should be detected as  Figure 7b,c, Baseline mistakenly detects the non-text area and detects the non-text area as the text area. Compared with Baseline, our method can avoid the false detection. As for Figure 7d, comparing Baseline and Ours, Baseline missed a part of the text (i.e., "1") in the figure, while our method can detect it. For Figure 7e, Baseline missed the middle English text, but our method can accurately detect it. For Figure 7f, Baseline detects "COFFEE" as two parts of text, but the actual "COFFEE" represents the same semantic information, which should be detected as a whole text area, and our method can detect it.
The above results show that the proposed algorithm improves the detection ability on the multi-directional text dataset ICDAR2015, curved text dataset Total-Text, and multilingual text dataset MSRA-TD500. We can see that our network is very good in the natural scene text detection dataset, with good accuracy, recall rate, and F value, and can obtain a more efficient network. Experiments show that the residual correction branch (RCB) and double branch attention feature fusion (TB-AFF) modules are very important for text feature extraction and location information enhancement, which can improve the detection accuracy of the original algorithm without losing the detection efficiency. At the same time, in various challenging scenes, such as uneven lighting, low resolution, complex background, etc., this model can effectively deal with the drastic scale change of text, effectively improve the effect of text detection in natural scenes and accurately detect the scene text, which to some extent is inseparable from our proposed network model.

Discussion and Conclusions
In this chapter, based on ResNet and FPN network, a scene text detection algorithm based on double-branch feature extraction is proposed. The Methods Residual Correction Branch (RCB) and Double Branch Attention Feature Fusion (TB-AFF) are used to extract lightweight features. In order to make up for the deficiency of the feature-extraction ability and the receptive field of a lightweight network, embedding residual correction branch (RCB) into the backbone network to enhance its feature-extraction ability is helpful to locate the position of the text area more accurately, without including too many background parts, even at low network depth. In addition, this paper also proposes a dual-branch attention feature fusion (TB-AFF) module, which is used to enhance the feature expression of multiscale scene texts. It can make the network model extract features more efficiently, pay more attention to the label-related targets, improve the accuracy of its detection, effectively improve the existing models, and demonstrate good universality.
However, it is worth noting that exploring the setting of the best architecture is beyond the scope of this paper. This paper only makes a preliminary study on how to improve convolutional neural networks. We encourage readers to further study and design more effective structures and provide a different perspective of network-architecture design for computer vision. In the future, we will further optimize the structure of segmentation network and study a better network model, to reduce the complexity of the model, shorten the training time, improve the overall performance of the algorithm, and improve the accuracy of the deep learning model.  Data Availability Statement: Total-Text dataset: https://github.com/cs-chan/Total-Text-Dataset (accessed on 10 January 2022). MSRA-TD500 dataset: MSRA Text Detection 500 database (MSRA-TD500)-TC11 (iapr-tc11.org (accessed on 10 January 2022)). ICADR2015 dataset: Tasks-Incidental Scene Text-Robust Reading Competition (uab.es (accessed on 10 January 2022)).