1. Introduction
Text detection technology in natural scenes is currently a popular research area in the field of image processing, and it has been widely used for street sign recognition, translation of scene images, and text recognition of license plates and billboards [
1]. The detection and recognition of the text information of street signs in natural scenes have received widespread attention from scholars in various countries [
2]. The current work [
3] mainly focuses on the text detection of road signs in fields and on highways, whose backgrounds are mostly large areas of sky and highway. The single background makes it easy to detect text information. However, in crowded city streets, the text information of street signs usually exists in a complex background. Tall buildings, pedestrian vehicles, and many other objects with similar appearances to street sign text are often mistaken as street sign text by the detection network, which reduces the accuracy of the text detection, Eliminating the complex background beyond the target to be detected is also a concern of scholars in various countries [
4]. In addition, natural scene street sign text is significantly affected by factors of the natural environment, such as lighting, obstructions, and shooting angles, making text detection in street sign images very challenging.
With the advancement of deep learning text detection algorithms, the focus of text detection has shifted from horizontal scene text detection [
5] to more challenging slanted text detection and arbitrary shape text detection [
6]. Network frameworks such as Mask R-CNN [
7] have achieved good results in scene text detection, but most Mask R-CNN-based methods use simple single-scale convolutional layers for stacking, which does not make full use of high-level semantic information for the detection of multi-scale and arbitrarily shaped text. In addition, some methods incorporate the attention mechanism in feature pyramid networks (FPNs) and replace ordinary convolution with inflated convolution [
8], but they still do not make full use of the low-level information of the network. This approach ignores the importance of low-level information for smaller text and text edge detection. In summary, the current methods for detecting street sign text in natural scenes still have the following problems: (1) it is not possible to remove the influence of complex backgrounds on street sign text detection, and (2) the high-level semantic information of the feature extraction network and the contextual information in the network are not fully utilized. Therefore, this study aimed to obtain a natural scene text detection method that can effectively reduce the influence of complex backgrounds and make full use of the contextual information of the feature pyramid network to achieve the effective detection of street sign text.
Through observation, it was found that although the shape of the text changed due to inclination and bending, the relationship between the pixel values of the same text region did not change. We used the maximally stable extremum region (MSER) [
9] method to pre-process the image, classify the text region and non-text region of the image, and remove a large number of non-text regions, thereby reducing the impact of the complex background on the detection of this text region. On this basis, a differentiable binarization network [
10] (DBNet) was used as the base network. Its feature pyramid component was improved to make full use of the high-level semantic information of the feature extraction network and the contextual information in the network to enhance the capability of the DBNet for image feature extraction based on the feature that the DBNet can set the binarization threshold adaptively and simplify the post-processing. Finally, the text regions classified by the MSER method were put into the improved DBNet network for further detection.
The main contributions of this paper are as follows: (1) A multi-channel MSER method is proposed, which uses R, G, B, and S channels to extract the natural scene street sign text area, effectively reducing the impact of strong light and complex background on the street sign text extraction; (2) A feature pyramid route enhancement (FPRE) module was proposed to improve the feature extraction network of DBNet model and enhanced the transmission of semantic information at the lower layers of the network; (3) A high-level feature enhancement (HLFE) module was proposed to make full use of the high-level semantic information of the network; (4) This paper constructed a natural scene street sign (NSSS) dataset for natural scene street sign text detection and used this dataset to evaluate the effectiveness of the method proposed in this paper.
2. Related Work
In recent years, researchers in related fields have conducted a considerable amount of work, and the existing deep learning-based methods can be broadly classified into three categories: regression-based methods, segmentation-based methods, and methods based on a mixture of regression and segmentation.
Regression-based methods predict text regions through strategies such as convolution and linear regression. Naiemi et al. [
11] introduced a pipeline based on a convolutional neural network to obtain more advanced visual features, and they proposed a new algorithm for encoding pixel values that highlighted the texture of characters. Liu et al. [
12] proposed a grouped channel combination block to implement data-driven anchor design and adaptive anchor assignment, and they proposed a uniform loss weighting model to mitigate the inconsistency between classification scores and localization accuracy. Lu et al. [
13] improve the shrinkage algorithm of the bounding box, making the model more accurate in predicting the short edges of the text area, and add Feature Enhancement Module (FEM) to increase the receptive field of the model and enhance its detection ability for long text areas. Wan et al. [
14] used self-attention-based text knowledge mining (STKM) in the training model to induce the convolutional neural network (CNN) backbone to display the feature information ignored by the original pre-trained model, thus improving the detection performance of the backbone network. Although the regression-based approaches have achieved good results in the detection of horizontal text, they are less effective at the detection of curved and slanted text.
The segmentation-based approach treats text detection as a semantic segmentation problem and achieves text detection by segmenting irregularly shaped text regions. PixelLink [
15] was the first to propose this idea and conduct related research. Pixels in the same text instance were first connected without regressing the text position and then text boxes were extracted directly from the segmentation results to achieve segmentation-based text detection. Zhu et al. [
16] proposed a text component extraction network for text detection in arbitrary shape scenes is proposed, which can detect different text components through two parallel branches. These two branches are the Feature Redistribution Module (FRM) and an improved Transformer decoder, which generate accurate text components to detect text instances. Zhu et al. [
17] proposed a Fourier contour embedding method, which predicted the Fourier vector of text instances in the Fourier domain and then reconstructed the text contour point sequence through the inverse Fourier transform in the image space domain. This approach could accurately approximate any closed shape. Hu et al. [
18] proposed a text contour attention detector that could accurately locate the text of any shape in any direction. Qiao et al. [
19] proposed a recursive segmentation framework that expanded the recursive path and refined the previous feature mapping into internal states to improve the segmentation quality. Cai et al. [
20] proposed a text detector, which dynamically generates independent text instance perceptual convolution parameters for each text instance from multiple features, thus overcoming some insurmountable limitations of arbitrary text detection and effectively formulating text detection tasks for arbitrary shape scenes based on dynamic convolution.
The methods based on a mixture of regression and segmentation combine the features of regression and segmentation models to improve the performance of text detection. The EAST model proposed by Zhou et al. [
21] could generate word- or line-level predictions directly from the complete image using a single neural network, simplifying the intermediate steps, and leading to a substantial improvement in the accuracy and precision of the model. Li et al. [
22] proposed an origin-independent coordinate regression loss and text instance accuracy loss on a pixel-based text detector, which alleviated the impact of the target vertex ordering and predicted the location of text instances more accurately. Liu et al. [
23] proposed a semi-supervised scene text detection framework (Semi Text) using a pre-trained supervised model and an unlabeled dataset to train a scene text detector that was both robust and accurate.
Although deep learning methods that can acquire high-level features through convolutional operations have achieved good results in recent years, these methods rely excessively on the adjustment of network parameters and lack flexibility. Some researchers still rely on traditional methods, such as MSER methods, for preliminary text region extraction from images. He et al. [
24] developed a contrast-enhanced maximally stable extremum region algorithm (CE-MSER) and combined it with CNNs to increase the robustness of the detection network. Mittal et al. [
25] used the characteristics of DCT, important information in the image is found by selecting multiple channels, and texture distribution is studied based on statistical measurement to extract features. Then a deep learning model is proposed to eliminate false positives and improve the performance of text detection. Hua et al. [
26] used a combination of MSER and cloud of line distribution (COLD) approaches to extract candidate regions of image text, and the extracted features were then sent to the CNN for extraction. This method exhibited better detection effects under low light.
Although the above methods have made some progress in the field of text detection, they still have the following problems in the detection of severely slanted and different-sized texts in complex scenes.
(1) Constrained by the candidate frame, regression-based networks are less effective at detecting text with large tilt angles.
(2) Segmentation-based methods do not work well for detecting small text instances with low contrast and text instances with complex layouts in images.
(3) Most of the methods are not able to effectively remove the interference of complex backgrounds and strong light on text detection, resulting in false detection.
For street sign text detection in natural scenes, compared with existing methods, our proposed method was designed to eliminate the impact of complex backgrounds and illumination on detection, as well as the effects of different sizes and shapes of text regions in the images during detection. The detection process is shown in
Figure 1. Although the text in natural scenes is greatly affected by the lighting and shooting angles, the pixel values and pixel relationships between text remain constant during character changes, so a multi-channel MSER method is proposed to preprocess the image. Compared with the traditional MSER method, the method in this paper uses multiple channels are used to extract the maximally stable extremal regions, reducing the impact of complex scenes and strong light on text area detection. In addition, an improved DBNet network is proposed to further detect the image of the text area preliminarily extracted by the multi-channel MSER method. Compared with the previous works, the newly added FPRE module and the HLFE module can make full use of the information of the feature extraction network, and improve the detection effect of the network on different shapes, sizes, and oblique texts.
5. Conclusions and Feature Works
This paper improved the natural scene street sign text detection method with differentiable binarization networks was proposed and created an NSSS dataset to better support this work. First, to solve the problem that the street sign text is interfered with by complex backgrounds and strong light in natural scenes, the Canny operator was used to enhance the text boundaries and used the multi-channel MSER method to remove a large number of non-text regions, which effectively reduced the interference of non-text regions and strong light on the text detection. In addition, to address the problem that the original DBNet model did not make full use of the high-level and low-level semantic information during the feature extraction network, this paper improved the feature pyramid network of the DBNet model, so that the low-level and high-level semantic information of the network could be more fully utilized, enhancing the ability of the network to detect the text information of street signs. The experimental results showed that the method proposed in this paper achieved significant improvements in the text detection of natural scenes and is quite competitive with existing methods, which proved the effectiveness of the method proposed in this paper.
However, the detection effectiveness of our model decreased when the character spacing of the same text line was large or the image was severely blurred, and the model also could not effectively detect other identifiers in the street signs. In addition, it will be future work to detect text with large character spacings in the same line and other identifiers in street signs that can provide valid information other than text, providing a street sign text detection model with more application value.