1. Introduction
Scene text spotting, i.e., detection and recognition of scene text in a unified network, is one of the most valuable tasks in the field of computer vision. Typical applications include industrial automatic inspection, smart vehicles, text retrieval, and advanced human-computer interfaces. In the past few years, convolutional neural network (CNN)-based scene text spotting has made remarkable progress. Jaderberg et al. [
1] first adopted the CNN to detect and recognize scene text. Borisyuk et al. [
2] employed Faster-RCNN [
3] to detect scene text and a CTC loss [
4] to recognize text. Liu et al. [
5] proposed an adaptive bezier-curve network to detect and recognize curved scene text in real-time.
In recent years, segmentation-based spotting approaches have achieved great performance on arbitrary-shaped scene text. Zhou et al. [
6] adopted the fully convolutional networks (FCN) [
7] to segment the text regions and predict their contours. Wang et al. [
8] modified the feature pyramid networks (FPN) [
9] to predict the segmentation results for each scene text region, and then used the PSE algorithm to compute the final detection result. In order to speed up the detection speed of the PSE algorithm, Wang et al. [
10] proposed the text kernel representation method to spot the scene text. However, if these methods is applied to the dense scene text containing various instances of bending, occlusion, lighting, and other difficult situations, they struggle to achieve satisfactory results.
Recently, Transformer [
11]-based network architecture has been proposed to establish self-attention mechanisms. Compared with CNN-based approaches, Transformer can parallel model global contexts of sequences and has achieved great progress in machine translation and natural language processing. More recently, Transformer has been adapted for computer vision and achieved state-of-the-art performances in image classification [
12,
13], medical image segmentation [
14], and object detection [
13]. However, if Transformer is directly used to encode the tokenized image patches for segmentation and detection tasks, the results are usually unsatisfactory [
14]. This is due to the fact that Transformer cannot extract local low-level visual cues well, which can be compensated by CNN architectures (e.g., U-Net [
15] and FPN [
9]).
Inspired by the success of Swin Transformer [
13] on object detection tasks, we propose a method called Rwin-FPN++, which incorporates the long-range dependency merit of Rwin Transformer into the feature pyramid networks (FPN) to effectively enhance the functionality and generalization of FPN. Specifically, we first propose the rotated windows-based Transformer (Rwin) to enhance the rotation-invariant performance of self-attention. Then, we attach the Rwin Transformer to each level on our feature pyramids to extract global self-attention contexts for each feature map produced by the FPN. Thirdly, we fuse these feature pyramids by upsampling to predict the score matrix and keypoints matrix of the text regions. Fourthly, a simple post-processing process is adopted to precisely merge the pixels in the score matrix and keypoints matrix and obtain the final segmentation results. Finally, we use the recurrent neural network to recognize each segmentation region, and thus achieve the final spotting results. To evaluate the performance of our Rwin-FPN++ network, we construct a dense scene text dataset with various shapes and occlusions from the substation secondary circuit cabinet wiring site. We train our Rwin-FPN++ network on public datasets and then evaluate the performance on our dense scene text dataset (
Figure 1). The experiments demonstrate that our Rwin-FPN++ network can achieve better spotting performance for dense scene text compared with state-of-the-art approaches.
Our main contributions are as follows:
- (1)
We improve the Swin Transformer network [
13] to the Rwin Transformer network. Compared with the shifted windows-based Transformer (Swin Transformer), the rotated windows-based Transformer (Rwin Transformer) can achieve better rotational invariance of the self-attention mechanism. For the task of scene text detection, because there are a large number of rotated and distorted texts, we modified the Swin Transformer by adding a rotating window self-attention mechanism. Thus our network can enhance the attention to rotated and distorted scene text.
- (2)
We combine the Rwin Transformer with the feature pyramid network to detect and recognize dense scene text. The Rwin Transformer is used to enhance the rotational invariance of the self-attention mechanism. The feature pyramid network is adopted to extract local low-level visual cues of scene text.
- (3)
A dense scene text dataset was constructed to evaluate the performance of our Rwin-FPN++ network. The 620 pictures of this dataset were taken from the wiring of the terminal block of the substation panel cabinet. Text instances in these pictures are very dense, with horizontal, multi-oriented, and curved shapes. This dataset can be downloaded from
https://github.com/cbzeng110/-DenseTextDetection (accessed on 10 February 2022).
- (4)
The experiments show that our Rwin-FPN++ network can achieve an F-measure of 79% on our dense scene text. Compared with previous approaches, our method outperforms all other methods in F-measure by at least 2.8% and achieves state-of-the-art spotting performances.
2. Related Work
We will describe the related work from three aspects, including scene text detection, scene text recognition, and scene text spotting.
Scene Text Detection. In the past few years, great progress has been achieved for deep learning-based scene text detection. Zhou et al. [
6] adopted fully convolutional networks (FCN) [
7] to segment the text regions and predict their contours. Wang et al. [
8] modified feature pyramid networks (FPN) [
9] to predict the segmentation results for each scene text region, and then used the PSE algorithm to compute the final detection result. Liao et al. [
16] modified the SSD [
17] algorithm to detect arbitrary-shaped scene text. Zhu et al. [
18] detected text regions in the Fourier domain and proposed a Fourier Contour Embedding (FCE) method to represent curved text contours. Dai et al. [
19] proposed a progressive contour regression approach to detect various aspect ratios of scene texts. Recently, Transformer has been adapted for computer vision and achieved state-of-the-art performances in image classification [
12,
13] and object detection [
13]. Tang et al. [
20] proposed a simple and effective Transformer-based scene text detection network, which is mainly composed of a feature sampling module and a feature combination module. However, if Transformer is directly used to encode the tokenized image patches for segmentation and detection tasks, the results are usually unsatisfactory [
14]. This is due to the fact that Transformer cannot extract local low-level visual cues well, which can be compensated by the CNN architectures (e.g., U-Net [
15] and FPN [
9]).
Scene Text Recognition. The task of scene text recognition is to identify the text content of the segmented scene text area. Shi et al. [
21] adopted RNN to extract visual feature sequences produced by the CNN and achieved highly competitive performances on scene text recognition. Qiao et al. [
22] recognized the low-quality scene texts robustly using an enhanced encoder-decoder network. Aberdam et al. [
23] recognized the scene text by extending the contrastive learning methods. Fang et al. [
24] proposed a bidirectional and autonomous ABINet to recognize the scene text.
Scene Text Spotting. The purpose of scene text spotting is to detect and recognize scene text in a unified network. Jaderberg et al. [
1] first adopted the CNN to detect and recognize the scene text. Liao et al. [
16] modified the SSD [
17] algorithm to detect arbitrary-shaped scene text. Borisyuk et al. [
2] employed Faster-RCNN [
3] to detect scene text and a CTC loss [
4] to recognize text. Borisyuk et al. [
25] proposed a fully convolutional point-gathering network (PGNet) to recognize multi-oriented text instances in real-time. Wang et al. [
10] treated the text line as the text kernel and proposed an end-to-end network for curved text spotting. Liu et al. [
5] proposed an adaptive bezier-curve network to detect and recognize arbitrary-shaped scene text in real-time. However, these methods mainly focus on the spotting of arbitrary-shaped scene text, which is difficult to achieve satisfactory results on dense scene text containing various instances of bending, occlusion, and lighting.
7. Future Work
Although our Rwin-FPN++ network has achieved a state-of-the-art spotting effect, it still has the following limitations: (1). Four Rwin Transformers are used in our network structure, which makes the network have more parameters and increases the network calculation amount. Thus the final recognition speed of our system is only 8 FPS, which does not meet the requirements of real-time computing. In the future, we will continue to optimize the network structure to achieve real-time computing capabilities. (2). The detection branch and recognition branch of our network are calculated separately, and there is no shared feature extraction between each other, resulting in some unnecessary repeated calculations. In the future, we will improve our network structure so that detection and recognition can share features. Thus the spotting speed can be improved further. (3). We intend to port the Rwin-FPN++ approach into the mobile devices and serve the automatic inspection of substations.