CSFF-Net: Scene Text Detection Based on Cross-Scale Feature Fusion

: In the last years, methods for detecting text in real scenes have made signiﬁcant progress with an increase in neural networks. However, due to the limitation of the receptive ﬁeld of the central nervous system and the simple representation of text by using rectangular bounding boxes, the previous methods may be insufﬁcient for working with more challenging instances of text. To solve this problem, this paper proposes a scene text detection network based on cross-scale feature fusion (CSFF-Net). The framework is based on the lightweight backbone network Resnet, and the feature learning is enhanced by embedding the depth weighted convolution module (DWCM) while retaining the original feature information extracted by CNN. At the same time, the 3D-Attention module is also introduced to merge the context information of adjacent areas, so as to reﬁne the features in each spatial size. In addition, because the Feature Pyramid Network (FPN) cannot completely solve the interdependence problem by simple element-wise addition to process cross-layer information ﬂow, this paper introduces a Cross-Level Feature Fusion Module (CLFFM) based on FPN, which is called Cross-Level Feature Pyramid Network (Cross-Level FPN). The proposed CLFFM can better handle cross-layer information ﬂow and output detailed feature information, thus improving the accuracy of text region detection. Compared to the original network framework, the framework provides a more advanced performance in detecting text images of complex scenes, and extensive experiments on three challenging datasets validate the realizability of our approach.


Introduction
Text has become one of the essential means of conveying information in the contemporary world, and there is a wide variety of textual information in the social scenes we live in. Detecting the text in the natural environment is the process of locating text regions in an image through a detection network and representing them with polygonal bounding boxes. Accurate detection results are beneficial to comprehensive practical applications, such as instant translation, image retrieval, scene analysis, geographic location, license plate recognition, and so forth, which has aroused strong interest in the domain of computer vision and document analysis. Existing CNN-based text detection algorithms [1,2] can be divided into approximately two categories: regression-based and segmentation methods.
For regression-based scene text detection algorithms [3][4][5][6][7][8][9][10][11][12], text objects are usually represented in the form of a rectangular or square field with a certain orientation. Although the detection speed is fast and can avoid the generation of errors that accumulate over multiple stages, most existing relapsing-based ways are no longer able to handle the text detection problem accurately and efficiently due to the limited form of the text representation (axis-aligned rectangles, rotated rectangles or quadrilaterals), and in particular do not perform very well when used to detect curved text on datasets such as Total-Text [13], which is very unfavorable to the subsequent text recognition in the whole optical character recognition engine.
On the other hand, segmentation-based scene text detection algorithms [14][15][16][17][18][19][20] focus on locating text instances by classifying pixels. Although recent approaches have made telling improvements in scene text detection tasks, the focus has shifted from lateral text to multidirectional text and more sophisticated forms of text (curved text). There are two other challenges that need to be faced when detecting arbitrarily shaped scene text due to specific properties of the scene text, such as significant variations in color, scale, orientation, proportion, and shape that clearly distinguish it from the common target object, along with various properties of natural pictures, such as degree of image blur, intensity of illumination and so forth.
The first challenge is to refine features. First of all, the network formed by stacking standard convolutional [21,22] layers lacks sufficient high-semantic information extraction and storage capabilities. Specifically, the entire network learns through similar patterns, and the feature information extraction is incomplete. Secondly, under the complex background conditions, because of the limitation of CNN receptive field size, the text information in the image cannot be effectively utilized, thus it is impossible to arbitrarily localize the text more accurately. Therefore, to solve the above problems, we introduce two modules-the depth weighted convolution module and the 3D-Attention module. The property of scene text detection is improved by increasing the depth of the CNN model, with more parameters and a deeper network to learn more complex feature information.
The second challenge is the large-scale variation in the scene text. Firstly, the scale variation in scene text is much larger than that of a general target object, which makes it hard for CNN to learn a specimen. Secondly, as different scale feature layers have different distribution characteristics, the deep feature layer has rich semantic information but lacks accurate location information, while the shallow feature layer has detailed and rich information but introduces a large amount of redundant information, which can make some of the regions to belong to inappropriate areas (e.g., non-text regions) to be classified [23] incorrectly. Therefore, to settle this question, a new cross-level feature pyramid network is proposed in this paper to obtain feature maps of text representation at various scales. By aggregating these multi-scale feature maps, the problem of large-scale differences in scene texts is effectively solved, and the text area is located more finely, and it can be easily utilized in existing methods.
This article proposes a new text detector to effectively solve these two problems, allowing a more accurate detection of the text of the scene in any form. As shown in Figure 1, obtain an input image, then the feature pyramid backbone ResNeDt generates layers of different scales through a downsampling operation. Compared with the original residual network, ResNeDt increases the depth of the network, further enlarges the receptive field and adapts more effectively to arbitrarily shaped text, that is, horizontal, multi-directional, curved and wavy texts, thus achieving the finer localization of text regions. To collect the feature information for the surface layer and the deep layer comprehensively at the same time, we propose the Cross-Level Feature Pyramid Network for modelling the extracted feature information on two adjacent feature layers to further enhance feature extraction. To take advantage of this property, the module can produce multi-scale feature representations, effectively solving the problem of multi-scale variation in scene text detection with minimal increase in computational effort. Finally, the binarization map is obtained through adaptive learning of the differentiable binarization module to produce higher quality prediction boxes, further improving the robustness of text detection for various shapes.
To demonstrate the validity of our proposed framework, experiments have been carried out on three different types of datasets, containing ICDAR 2015 [24], Total-Text [13] and MSRA-TD500 [25]. In these datasets, Total-Text is specifically designed for curve text detection. Therefore, experimental results in the MSRA-TD500 and Total-Text datasets show that this method has high flexibility in complex situations (such as multilingual text, curved text, arbitrarily shaped text). Specifically, on Total-Text with arbitrarily shaped text, it significantly exceeded the results of most of the most modern methods, and our model achieves a comparable performance (82.6%). In addition, the framework proposed in this paper also achieves a good performance on the ICDAR 2015. Figure 1. The general architecture of the proposed CSFF-Net. ResNeDt is considered to be the backbone network. Our proposed method is mainly processed in three steps: Firstly, the pictures are output to different feature layers through the backbone network ResNeDt. Secondly, the outputs of the backbone network extract detailed feature information by the Cross-Level FPN. At last, the output result of the Cross-Level FPN is obtained by the DB module.
To sum up, the primary contributions of this article are as follows: (1) a deep weighted convolution module is proposed to produce more expressive features, which is a more efficient method with a universal structure and less computational effort than previous methods; (2) The proposed 3D-Attention module can model the contextual relevance of characteristic graph, thus improving the performance of text detection, and generating more representative features; (3) Cross-Level Feature Pyramid Network with Cross-Level Feature Fusion Module, which not only handles feedforward information flow efficiently but also enriches features through upper (lower) feature layers as well as jump connections, can effectively solve the problem of detecting arbitrarily shaped scene text, and improve the performance of text detection; (4) This article has realized the most advanced performance on several benchmarks including different forms of text instances (oriented, long, multilingual and curved), which proves the superiority of our newly designed module.

Related Work
The detection of scene texts has been a popular research theme and many means have been proposed. Before the appearance of deep learning, early text detectors [26][27][28] mainly used hand-crafted features as basic components, such as Stroke Width Transform (STW) [26], Maximally Stable Extremal Regions (MSER) [27] and symmetry feature [28]. In recent years, scene text detection method based on deep learning have achieved remarkable effects. Modern text detectors are mainly based on CNN and these methods can be divided into two categories, that is, methods based on regression and segmentation.
Regression-based detection methods typically follow a target detection framework driven by convolutional neural networks (CNNs) [1,2], such as Faster R-CNN and SSD [29]. Unlike ordinary objects, text is usually displayed in irregular shapes with different proportions. To deal with this problem, TextBoxes [3] used SSD as the base detector, modifying the size and shape of the convolutional kernel anchor box to accommodate variations in the proportion of text instances. As versions of the Faster RCNN, the Rotation Region Proposal Network (RRPN) [4] and Rotational Region CNN (R2CNN) [1] were designed to detect text in arbitrary directions in a two-stage manner. To handle the detection of long text, Baoguang et al. [5] and Zhi et al. [6] proposed SegLink and CTPN, which predicted text fragments and connected them into text boxes. RRD [7] extracted feature maps from two separate branches for text classification and regression to better detect long texts. Reference [8] obtained text vertices and grouped them into boxes. Unlike these methods, which regress anchor boxes/segments/corners, Xinyu et al. [9] and Wenhao et al. [10] performed box regression and predicted pixel offsets in the text area without using anchors and suggestions. Chuhui et al. [11] based on [9], the boundary region of the text is divided to distinguish the text instances. However, there are certain structural limitations in using this method to capture all possible shapes. The newly proposed LOMO [12] suggests an initiative refinement module to bitterly refine bounding box proposals for extremely long text, and then provides for centerline offsets, text region, and frame offsets to recreate the text instance.
Although regression-based approaches have achieved advanced performance, which still require tedious multi-stage tasks, which might require comprehensive adjustment and lead to sub-optimal performance. In addition, due to the huge difference in the aspect ratio of texts (especially non-Latin texts) and the limited receiving domain of CNN, these methods cannot effectively deal with texts under complex background conditions.
Segmentation-based approaches [30,31] mainly draw on semantic segmentation methods, where all pixels in a text bounding box are considered positive sample regions, and describe the text areas by adopting different representations, and then reconstruct the text instances through specific post-processing. Yongmin et al. [14] put forward the method of character probability prediction. The main idea of this algorithm is to use a Gaussian heat map to generate a heat map of a single character, and then use the distance between characters to generate an affinity heat map for weak supervision training and learning. This method is effective for dealing with languages with constant character spacing. Reference [15] formulated a range of text as multiple attributes, such as text region and orientation, and predicted the corresponding heatmaps by FCN to extract the text region. Liu et al. [16] proposed a Transverse and Longitudinal Offset Connection (TLOC) based on [32] and RNNs to directly regress the polygon shape of textboxes. Reference [17] considers the detection of curved text as an instance segmentation problem and uses MASK RCNN to generate the boundaries of text instances. The component segmentation method divides the text area into several components that are grouped into different instances by grouping data, communicating between nodes, or post-processing. For example, Pix-elLink [18] predicts connections between pixels and finds text area and separates links belonging to different text instances. Tian et al. [19] designed a two stages method to separate dense text instances. PSENet [20] is gradually extending kernels at a certain scale to split nearby text instances. Our method combines the advantages of goal detection and segmentation methods, adopts a three-step structure, and uses contour points to represent text areas. This model effectively solves the problem of large-scale differences and enhances the text-related regions by reducing the background interference of the feature layer and the use of the attention mechanism. Compared with the previous methods, this method gives a more accurate description of the text regions, so it can produce finer text boundaries.

Methods
Deep convolutional neural networks [21,22] have made a series of breakthroughs in image classification [21,23,33] and are able to effectively learn and understand highlevel semantic information directly from visual images because of their powerful feature representation capabilities. In order to improve network performance, build lightweight networks that are easy to deploy and meet the requirements for real-time performance in practical applications, in this paper we chose DBNet [34] as our baseline and improved the original neural network. Without adjusting the model infrastructure and ensuring the original feature information of the network, it improves the feature expression process of the backbone network, and introduces a depth weighted convolution module (DWCM) and a 3D-Attention module to model the context relevance of effective features, further optimizing the feature extraction network and enhancing the effectiveness. As shown in Figure 2, not only can the depth of the network be increased compared with the original network, but also the detection accuracy of the model is improved. Our overall network structure is shown in Figure 1, and its specific steps and roles are shown in Table 1. It can be seen from Table 1 that the network is mainly composed of three parts, namely Backbone, Neck and Head. Backbone refers to ResNeDt18 and ResNeDt50 in this paper, and its role is to extract the feature information in the image for later network use. Neck is placed between Backbone and Head. The Neck in this paper is our Cross-level FPN, which can make better use of the features extracted by the Backbone to generate more representative features. The Head is the detection head, which is the network that acquires the output content of the network. The Head here is Differentiable Binarization, which predicts the text boxes by using the features extracted before.

Depth Weighted Convolution Module (DWCM)
For any series of residual networks (e.g., ResNet 18,34,50,101,152), the structure of the front part is the same-7 × 7 standard convolutional layers and 3 × 3 maximum pooling layers-and then a series of respective residual structures formed by stacking several standard convolution layers. For standard convolution, the output feature mapping for the i-th channel can be expressed as follows: where * denotes the convolution handle and k i is the convolution kernel size.
However, the standard convolution has the following shortcomings: each standard convolutional output feature map must sum all channels, and all the feature layers of the original network are repeatedly generated by Equation (1). It is known that the entire network is learned using similar patterns. In addition, the network stacked by the standard convolution layers also lacks enough high semantic information storage capacity and cannot capture high semantic information. In addition, the presence of pooling layers somewhat causes insensitivity to image details, and the content of target information is less under complex background conditions, which leads to the inability to accurately locate the target object. As a result, the above drawbacks may result in a weaker representation of the feature map. To alleviate the above problems and improve the detection performance of convolutional networks, we specifically designed a novel convolutional neural network structure, as shown in Figure 3, which is added between the standard convolution layers to learn more image features. As an enhanced version of standard convolution, our depth weighted convolution module consists of two main steps: firstly, a 3 × 3 convolution operation is performed independently on each channel of the input, and then the output features are summed with the input features of the module element by element, that is, the ⊕ operation. The advantage of such processing is that it retains the reuse of feature information from the original network, reduces the loss of low-dimensional feature information and ensures that the network learns richer features at each spatial dimension. This process can be expressed by mathematical Formula (2): where f DWCM represents the 3 × 3 convolutional layer, x 1 in and x 1 out represents input and output, respectively.
Here, the single-channel convolution operation is performed on a two-dimensional plane and a single convolution kernel is applied to each channel. For example, the sample input size is set to H (image height) × W (image width) × K (number of image channels). The 3 × 3 convolution is chosen here because the 3 × 3 convolution structure is more computationally intensive in the GPU than the 1 × 1 convolution or even the 5 × 5 convolution, and using the 3 × 3 convolution structure is faster in the GPU operation. Each channel of the input feature is convolved with the corresponding convolution kernel of a single channel, so that the number of feature maps is kept unchanged. Here, after K-channel convolution operation, K feature maps (H × W) are still obtained, so the purpose of filtering the input features can be achieved, providing more efficient input features for subsequent operations. The expression is as follows: In Equation (3), G is the output, K is the convolution kernel of width (W) and height (H), X is the input, m denotes the m-th channel of the feature map, i,j denotes the coordinates of the output on the m-th channel, and w and h are the coordinates of the convolution kernel weight elements of the channel.
Compared with the standard convolution operation, this module has two advantages. Firstly, it enhances the information of the channels, making the feature representation generated by our depth-weighted convolutional module more expressive and making the network model achieve better results. As a result, the residual network with the depth weighted convolutional module can locate text regions more completely and more accurately. Secondly, the depth weighted convolution module is universal and can be easily applied to standard convolutional layers without introducing any parameters or changing hyperparameters, thus achieving portability.

3D-Attention Module
It is well known that human visual processing ability is limited and cannot process all the information at the same time. Attention is mainly focused on regions with significant features, and machine vision can also use this attention mechanism to effectively improve work efficiency. There is a large amount of information in complex scenes, and the most important information in an image is generally concentrated in a relatively small area, so using visual attention mechanisms to quickly and accurately acquire effective information from an image is particularly important in the process of visual model building. Therefore, inspired by this, we designed a simple and universal 3D-Attention module that applies it to the features in each BasicBlock [35] together with training, aiming to extract effective features to suppress ineffective features and screen high semantic feature information, enhance the network's ability to refine features and make the network more focused on information features, such as text regions in images, which can effectively improve the network's feature extraction ability and increase the model's expressiveness.
The module significantly expands the receptive field of each feature layer by improving the feature transformation of the convolutional network, helping the CNN to produce more representative information, enhancing the learning representation of the network, enriching the output features of the backbone network and improving the accuracy of feature extraction, thus further optimizing the network. Compared with other attention mechanisms, no additional parameters are introduced and only a small amount of computation is added to improve the model performance with a smaller overhead. First, we briefly introduce the channel attention mechanism [36] and the spatial attention mechanism [32]. The purpose of [36] is to obtain a one-dimensional feature vector with a size of (C × 1 × 1), while spatial attention obtains a two-dimensional feature map with a size of (1 × H × W). It is worth noting that C denotes channels number, and H and W are the height and width of the characteristic graph, respectively. The 3D-Attention module in this paper is similar to Zhu et al. [32] and Hu et al. [36], but there are some differences. The difference is that our attention produces a three-dimensional matrix (C × H × W) as an attention feature map, rather than a one-dimensional feature vector or a two-dimensional feature map. As shown in Figure 4. The 3D-Attention module uses only a standard ConvBnRelu (1 × 1 convolutional) and sigmoid activation function to obtain the attention feature map, and then multiplies the attention feature map by the input of the module and then adds it with the input features, thus obtaining a high semantic feature map under the 3D-Attention module, introducing fewer additional parameters to enhance the sensitivity of the network to text, and generate better detection results. The input of this module is the feature map output by the previous convolution block, and the attention module provides the position index corresponding to its dimension. We denote the input feature map as x 2 in and the output feature map as x 2 out , with x 2 out passed on to the next stage as the module output. Thus, the 3D-Attention module can be described as follows.
in which f 3D represents the 1 × 1 convolution layer, batch normalization layer BN and nonlinear layer Relu, followed by a sigmoid. The BN can prevent data distribution from shifting after matrix multiplication and nonlinear operation, which will slow down the convergence of network. Passing through the BN layer effectively avoids the gradient disappearance and explosion problems of deep networks, and also reduces the reliance on parameter initialization methods. The Sigmoid function lies in the output of a probability map that determines the weights of each feature. The non-linear relationship between the channels is constructed using the Relu activation function and the sigmoid function to enhance the non-linear capabilities of the model, improve the learning representation of the network. It is worth noting here that we have placed the 3D-Attention module after the depth weighted convolution module (DWCM), and only in this way can maximize the usefulness of each module. The 3D-Attention module proposed in this paper not only calibrates the features between channels, but also improves the local feature representation of spatial domain information. In the process of calibration, spatial features and channel information are effectively combined to further enrich the contextual semantic information of small targets (text) in shallow features. The advantages are mainly reflected in the following three aspects: Firstly, compared to standard convolution, each spatial location not only embeds its surrounding information as a scalar of the original spatial response, but also models the long-distance inter-channel dependencies to capture the rich contextual relationships. Screening each input channel facilitates the network to selectively enhance features containing useful information and suppress redundant features, thereby effectively improving the transferability of target features between high and low layers and enhancing the semantic information in the feature layer.
Secondly, instead of collecting global contextual information, the 3D-Attention module only considers the contextual information around each spatial location, which to some extent avoids certain pollution information from irrelevant regions (non-text). It also uses a residual connection structure in the deeper part of the network to further enhance the information transfer between non-adjacent feature layers, improve feature utilization, avoid the gradient disappearance problem and make the network layers deeper.
Finally, the 3D-Attention module can be easily embedded into modern classification networks for a wide range of tasks due to its generic nature. Although it increases the number of parameters in the network, the structure is simple. The introduction into existing networks will only add a small amount of computation and model complexity, with good generalization to different datasets, which is extremely attractive.

Cross-Level Feature Pyramid Networks
At present, many networks only use a single high-level feature to classify objects, but there is an obvious defect, that is, small objects (such as text) have less pixel information and are easily lost during the up-sampling process. In view of this kind of object size is different from the general object detection, the classic approach is to enhance multi-scale changes by using image pyramids, but this will bring a great deal of computation. Therefore, this paper proposes a Cross-Level Feature Pyramid Network (Cross-Level FPN) based on Feature Pyramid Network (FPN), as shown in Figure 5. Cross-Level FPN is a top-down network structure with horizontal connections, which is used to construct feature maps with high semantic information of different sizes. Specifically, the inputs {C 2 , C 3 , C 4 , C 5 } are the outputs of the backbone (ResNeDt), their sizes are 1 4 , 1 8 , 1 16 , 1 32 of the original size, corresponding to the outputs of level 2, 3, 4, 5 respectively. The level refers to each stage of the network. In general, the output feature maps that produce the same size are considered to be at the same level, and each level is defined as a stage, and the output of the last stage of each stage serves as the input of Cross-Level FPN, which enables us to create a pyramid network that contains more semantic information. Level 1 is not included in the feature pyramid network as it is too large in size and takes up a lot of memory.
A multi-scale feature representation is generated by extracting features for each scale of the image. Information from both high-resolution lower features and high-semantics higher features are used to predict the feature maps at each level. It can effectively solve the multiscale variation in scene text detection with minimal computation. Thus, it can be concluded that shallow feature layers (such as C 2 ) contain more textured (detail) information, while deeper feature maps (such as C 5 ) contain more semantic information. In order to combine feature maps with different features, Cross-Level FPN uses top-down and lateral linking strategies. The top-down path produces higher resolution features by upsampling that are smaller in spatial size but more semantically informative at higher pyramidal network levels. Then, the features are further enhanced by transverse connection. It should be noted that the feature sizes of transverse connections are the same. As shown in Figure 5, the red arrows represent the output branches, the blue arrows represent the correction branches, and the yellow circles indicate the Cross-Level Feature Fusion Module (CLFFM). The correction branch of feature layer C 5 is corrected by CLFFM for feature layer C 4 to obtain the output branch and correction branch of C 4 ; the correction branch of feature layer C 4 is corrected by CLFFM for feature layer C 3 to obtain the output branch and correction branch of C 3 ; the correction branch of feature layer C 3 is corrected by CLFFM for feature layer C 2 to obtain the output branch of C 2 branch. The output branches of all feature layers are fed into the next stage of the task as the output of the Cross-Level FPN.
The working mechanism of CLFFM is mainly introduced by taking two input feature layers C 4 and C 5 as examples, as shown in Figure 6. First of all, for the smaller resolution feature map (C 5 ) the bilinear interpolation method is used to improve the resolution to the same scale as (C 4 ), and cross-layer feature maps C 4 and C 5 on their branches are generated by a convolution operation: where C ↑ 5 denotes the upsampling operation, Conv is the 3 × 3 convolution, Bn denotes normalisation and Relu denotes the activation function.
Secondly, we multiply the generated cross-layer feature maps C 4 and C 5 by convolution operation and element by element, and output two branches, output branch F 4 and correction branch F 4 : Relu Bn Conv C 5 (7) where represents dot multiplication operation. Here, the processed two-level features are point multiplied. The purpose of this is that lower-level features can provide more accurate location information, while the up-sampling operation will cause errors in the positioning information of the deep network, so we combine them to form a deeper feature pyramid network, which integrates multiple layers of feature information and outputs them in different features. Then, both F 4 on the output branch and F 4 on the correction branch are passed through a convolution with a channel number of 64 and a convolution kernel size of 3 × 3. The resulting feature maps are then summed element by element with the cross-layer feature maps C 4 and C 5 respectively for feature fusion, and finally a 3 × 3 convolution is appended to generate the final feature maps P 4 and P 4 .
where ⊕ denotes the element-by-element summation operation. Finally, the purpose of convolution operation is to reduce the confounding effect caused by upsampling and further ensure the integrity of pyramid network structure. The reason for outputting two branches is, on the one hand, because P 4 on the correction branch can be used as an input to repeat the above process with the feature map C 3 of the previous stage. In this way, the high semantic information of the deep feature map is retained, which can be perfectly fused with the low-level feature map to further enhance feature extraction. On the other hand, one of the outputs of the CLFFM module (P 4 ), that combines the high semantic information of the high-level feature with the rich details of the low-level feature, thus obtaining the final highly accurate feature map, is called {P 2 , P 3 , P 4 , P 5 }, which corresponds to the input feature map with the same size {C 2 , C 3 , C 4 , C 5 }. This process can be expressed in mathematical Equation (11) as follows.
At the beginning of the iteration, it is necessary to add a 1 × 1 convolution to each input feature layer {C 2 , C 3 , C 4 , C 5 } to reduce the dimension, so as to ensure the consistency of the number of channels.

Differentiable Binarization
The structure of differentiable binarization is shown in Figure 7. The input is an image with text, and after the network a segmentation probability map P is obtained along with an adaptive threshold map T (each pixel on the image has a corresponding threshold and each pixel has a different threshold). The final result is obtained by performing a differentiable binarization operation using the P and T outputs. Specifically, after the enhanced feature extraction network Cross-Level FPN outputs four feature layers {P 2 , P 3 , P 4 , P 5 }, three of the feature layers {P 3 , P 4 , P 5 } are upsampled to the largest size feature layer P 2 . Then these four feature layers are spliced together to obtain a feature layer F, which has the same size as P 2 . The F is used to predict P and T. Finally, P and T are combined to obtain the binarized mapB.

Datasets
In this paper, experiments are carried out on three challenging public datasets. They are Total-Text [13], MSRA-TD500 [25], ICDAR15 [24]. The visualization results are shown in Figure 8.

1.
Total-Text [13] is a dataset used for detecting curved texts, which contains the curved texts of commercial signs and sign entrances in real-life scenes, with a total of 1555 pictures, 1255 training sets, and 300 the test sets.

2.
MSRA-TD500 [25] belongs to a multi-language and multi-category dataset, with 500 photos, 300 for training, and 200 for testing. These photos are used to shoot signs, house numbers and warning signs in indoor scenes and guide plates, and billboards in some complex backgrounds in outdoor sets. 3.
ICDAR2015 [24] is a linear detection and recognition dataset belonging to the English class, with 1500 images, including 1000 training pictures and 500 test pictures. This dataset is a street or shopping mall image taken randomly by Google Glass without focusing; the goal is to improve the generalization performance of the detection model.

Loss Functions
The loss function plays a crucial role in deep neural networks, the L 1 loss function and the binary cross-entropy loss function are used to optimize our network. The loss function in this paper consists of three components in the training process: probability map loss L s , binarization map loss L b , and adaptive threshold map loss L t , represented as follows: where α and β are the weight parameters, α is set to 1 and β is set to 10. Among them, the binary cross-entropy loss function is used for probability map loss L s and binary map loss L b . The formula is as follows, and negative hard mining [34] is used to overcome the imbalance between positive and negative samples. (13) in which S l represents samples whose positive and negative ratio is 1:3, and L 1 loss function is adopted for the loss L t of the adaptive threshold map, and its formula is: where R d is the index of the pixels in the region and y * is the label of the adaptive threshold map.

Implementation Details
The experiments in this paper use Python 3.7 as the programming language and the deep learning framework used is Pytorch1.5. All the experiments were conducted on TITAN RTX. The initial learning rate was set to 0.007. The training process involved two steps: firstly, the network was trained for 100k iterations by using the SynthText dataset [37], then the models were finetuned on the benchmark real-world datasets for 1200 epochs. Our model was trained by using the official training image of each dataset, a weight decay of 0.0001 and a momentum of 0.9. The training optimizer was Adam [38], and the training batch size was set to 16. The text marked as "NEGLECT" was discarded in the training process. In the pre-processing stage of the network, the labels of the probability map and threshold map were created based on the labels of the train dataset. Since smaller text regions are not easy to detect, some text regions that were too small (e.g., Minimum side length of the smallest rectangle of the text area less than three or polygon area less than 1 were ignored in the process of creating the labels. As a result, small regions of text marked as "NEGLECT" were discarded during the training process. In the testing stage, single-scale images were input, and the results were evaluated by the official evaluation protocol.
Because the test images of different scales have great influence on the detection effect [6,8], the aspect ratio of the test images was kept in the reasoning stage, and the size of the input image was adjusted by setting a suitable height for each dataset.
We made full use of and expanded the data in the same way as in [34], mainly in the following three ways: (1) random rotation; (2) random cutting; (3) random flipping. In order to improve the training efficiency, the processed images were all adjusted to 640 × 640.

Ablation Study
In order to better prove the realization of each module proposed in this paper, ablation research was carried out, which proved the effectiveness of our proposed Deep Weighted Convolution Module (DWCM), the 3D-Attention Module, and the Cross-Level Feature Pyramid Network (Cross-Level FPN). In the ablation experiments, we tested DB-Net, DBNet+DWCM,DBNet+3D-Attention,DBNet+DWCM+3D-Attention,DBNet+Cross-Level FPN and the method proposed in this paper (DBNet+DWCM+3D-Attention+Cross-Level FPN). The detailed experimental results are shown in Table 2. It can be seen from Table 2 that the precision, recall and F-measure of the baseline DBNet on the test dataset ICDAR2015 are 89.3%, 73.8% and 80.8%, respectively. Our method DBNet+DWCM+3D-Attention+Cross-Level FPN has a precision, recall and F-measure of 86.4%, 79.2% and 82.7%, respectively. The F-measure of this method on ICDAR2015 is 1.9% higher than the baseline, and the detection result is obviously better than the original DBNet. We explored the performance of the proposed module on baseline through ablation experiments, the results of which are shown in Table 2. Table 2 shows the impact of the different modules on the performance of the network, with the final network DBNet+DWCM+3D-Attention+Cross-Level FPN achieving a better performance; thus, proving the validity of our proposed module. It is worth noting that DBNet is our baseline. Table 2. Test results under different settings. "P", "R" and "F" respectively represent precision, recall and F-measure.  Figure 9 shows the visualization results of GT, baseline and our method, respectively. It is worth noting that the images in the figure (from top to bottom) are randomly selected from the test datasets ICDAR2015, MSRA-TD500 and Total-Text. The images here are randomly selected from three datasets, which can better prove the robustness of our model.

3D-Attention
DBNet+3D-Attention can effectively remove some irrelevant information and make the prediction box closer to the real box. In Table 1, it can be seen that the 3D-Attention module significantly improves the performance of ResNet-18. Specifically, using the ResNet-18 backbone network, the F-measure of the 3D-Attention module on the ICDAR2015 dataset has been improved by 0.7%, and the recall has been improved by 2.4%.

DWCM
Compared to DBNet, the DBNet+DWCM method will yield richer features when the DWCM is added. As shown in Table 1, the deep weighted convolution module can also result in a performance gain of 0.3% as it extends the receptive domain of the backbone network, it takes only a little extra time. For ResNet-18, depth weighted convolution improves the recall rate by 2.2% on the ICDAR2015 dataset.

3D-Attention+DWCM
Faced with different types of complex datasets, we take advantage of two modules, the Deep Weighted Convolution and the 3D-Attention module, as a starting point and propose DBNet+DWCM+3D-Attention to meet the challenges posed by this complexity. From Table 1 we can know that, for the ICDAR2015 dataset, 1.5% (with ResNet-18) improvements are achieved by the 3D-Attention and the DWCM, the precision of 87.6%, recall of 77.7%, F-measure of 82.3%. Compared to DBNet+DWCM and DBNet+3D-Attention and DBNet+DWCM+3D-Attention achieves 1.2% and 0.8% performance gain in terms of F-measure, respectively. Thus, it can be seen that the detection performance of a network combining the advantages of these two modules is better than that of a network applying either the depth weighted convolution module or the 3D-Attention module alone.

Cross-Level FPN
DBNet+Cross-Level FPN is able to fully capture text areas through constant supplementation and fusion when dealing with irregularly shaped text in complex background conditions, with better detection results than DBNet. As can be seen from Table 2, with the help of Cross-Level Feature Pyramid Networks, due to increased network representa-tion capabilities, the proposed method achieves a result of 76.3%, 88.6%, 82.0% in recall, precision and F-measure, respectively; it can bring a performance gain of 1.2%. The experimental results illustrate that cross-level feature pyramid networks can indeed extract feature information more comprehensively and improve image classification accuracy. It has stronger feature capture capability than the original feature pyramid network (FPN).

Compare with Previous Method
We compare our proposed method with previous methods on different datasets, including a benchmark for curved text, a benchmark for multi-directional text, and a benchmark for long text and multiple languages. Based on the evaluation criteria proposed in Herbert et al. [39] and Mark et al. [40], the experimental results are reported in Tables 3-5. Compared with the basic network DBNet, our approach shows significant improvements on all three datasets. Especially on the Total-Text dataset on the Total-Text dataset, the method of this paper also shows a corresponding improvement in each metric compared to the base network. Similarly, for the MSRA-TD500 dataset, the method outperforms its competitors in terms of P, R, F. By comparing the P, R, F on three datasets, our proposed module is robust in terms of improving text detection performance. Table 3. Test results on curve datasets. The values in brackets refer to the height of the input image. "*" refers to multi-scale test. "MTS" and "PSE" are short for Mask TextSpotter and PSENet.

Curved Text Detection
In this paper, the model is also evaluated on the Total-Text dataset, which is used to demonstrate the ability of detecting curved text. Set the height of the input image to 800 according to [3,4]. As shown in Table 3, the performance of our method is greatly improved compared to that of the original network. Specifically, "CSFF-ResNeDt-18 (ResNeDt-18+Cross-Level FPN+DB)" outperforms the previous baseline method by 1.4%. Compared to previous best method TextSnake, "CSFF-ResNeDt-50 (ResNeDt-50+Cross-Level FPN+DB)" shows advantages in accuracy and F-measure, and the effect is improved by 3.9% and 4.2% respectively. The visualization results are shown in Figure 8. Experiments show that this method can effectively deal with irregular shapes and curved texts in any direction, and shows strong robustness in detecting arbitrarily bent text instances. Compared with the baseline, the results of our method have higher accuracy and can obtain more accurate boundary boxes. It is worth noting that the CSFF-ResNeDt represents the different backbones used in our network.
Both the CRAFT and the CSFF-Net proposed in this paper are segmentation-based text detection. The difference is that CRAFT mainly detects a single character and the links among characters, and then determines the final text box based on the links among characters. while the CSFF-Net generates the text box by directly detecting the text. Since character-level image segmentation is more time-consuming and introduces less background information than text line image segmentation, a better performance may be achieved. LOMO detects the text region by regression, and then obtains the final text box by using the text box center line and text box boundary offset. The CSFF-Net in this paper obtains text regions directly by segmentation. The former network model is more complex possibly learning more feature information while resulting in a longer training time, which is not hardware friendly.

Multi-Oriented Text Detection
ICDAR 2015 is a multi-directional English text dataset, which contains a large number of small and low-resolution text examples. For ICDAR 2015, we evaluated our model using an image height of 736 or 1152 to test its performance in multi-oriented text. In Table 4, we can see that "CSFF-ResNeDt-50 (2048 × 1152)" achieves 89.6%,81.1% and 85.1% in the precision, recall, F-measure. Generally speaking, the model exceeds the baseline by 1.9% in terms of F-measure. Compared with other advanced methods, although the F-measure of this model on ICDAR 2015 dataset is not superior to other methods, it can also be compared with other methods. Compared with EAST, the method in this paper "CSFF-ResNeDt-50 (1280 × 736)" has improved by 7%, 7.6% and 6.9% in P, R and F respectively. For "CSFF-ResNeDt-18 (1280 × 736)", when ResNet-18 is used, the F-measure of the model reaches 82.7%.

Multi-Language Text Detection
The algorithm is robust for multilingual text detection. For the MSRA-TD500 dataset, the text contained in it is long and large, so large input cannot improve the performance. Therefore, we simply adjusted the height of the test images to 512 or 736 to fit our model. Table 5 shows the comparison results between this method and other advanced methods. The algorithm has high precision, recall, and F-measure, which is an advantage over most other existing algorithms on the MSRA-TD500 dataset. In this paper, "CSFF-ResNeDt-50 (ResNeDt-50+Cross-Level FPN+DB)" is superior to previous methods in terms of accuracy. For the accuracy, "CSFF-ResNeDt-50" exceeds the previous advanced method by 2.8%. With a lightweight backbone, "CSFF-ResNeDt-18 (512 × 512)" achieves a comparative accuracy compared to the most classic algorithm (Liu et al., 2018) (82.9 vs. 83.0). This proves that our model is robust for multilingual detection and can actually be used in complex natural scenarios. To summarize, this framework performs better than other existing methods in performing scene text detection tasks, and has a superior performance, and can effectively and accurately detect texts.
In the multilingual dataset, MSRA-TD500, the feature information of various texts is quite different. For example, the proportion of English text shapes is relatively small, and the white space between texts is large. Chinese text shapes are complex, and the overall proportion is relatively large. One of the advantages of CSFF-Net in this paper is that it is designed for multi-scale changes of texts, so it is sensitive to the feature information of multilingual texts and can detect texts well.

Discussion
Aiming at the problem of insufficient feature information extraction in complex background image classification, this paper proposes a structure for detecting arbitrary shape text in a complex background environment and successfully detects arbitrary shape text examples. The proposed Cross-Level Feature Pyramid Network (Cross-Level FPN) plays a crucial role in the training process and is used for effective feature reuse and fusion of multi-scale contextual information. Model detection accuracy is improved by using a deep weighted convolution module and a 3D-Attention module for the backbone feature extraction network to highlight the representation of important information. The efficiency and universality of our approach have been demonstrated in publicly available scene text datasets, including long, curved, oriented, and multilingual text cases. The experiments showed the superior performance of this method and have a comparable performance compared to more advanced methods. As we deepen the depth of the network to some extent and increase the multi-scale calculation, resulting in a slight increase in time during training, but it has little impact on the detection efficiency and can achieve a real-time detection effect, and the network should be further optimized in the subsequent work.
For the next step in the future, we hope that the end-to-end recognition model can be used to train this model, and we can see if its performance, robustness, and generalization ability can be transformed into a better scene text recognition system so that it can be further applied to a wider natural environment.