CSFF-Net: Scene Text Detection Based on Cross-Scale Feature Fusion
Abstract
:1. Introduction
2. Related Work
3. Methods
3.1. Depth Weighted Convolution Module (DWCM)
3.2. 3D-Attention Module
3.3. Cross-Level Feature Pyramid Networks
3.4. Differentiable Binarization
4. Results
4.1. Datasets
- Total-Text [13] is a dataset used for detecting curved texts, which contains the curved texts of commercial signs and sign entrances in real-life scenes, with a total of 1555 pictures, 1255 training sets, and 300 the test sets.
- MSRA-TD500 [25] belongs to a multi-language and multi-category dataset, with 500 photos, 300 for training, and 200 for testing. These photos are used to shoot signs, house numbers and warning signs in indoor scenes and guide plates, and billboards in some complex backgrounds in outdoor sets.
- ICDAR2015 [24] is a linear detection and recognition dataset belonging to the English class, with 1500 images, including 1000 training pictures and 500 test pictures. This dataset is a street or shopping mall image taken randomly by Google Glass without focusing; the goal is to improve the generalization performance of the detection model.
4.2. Loss Functions
4.3. Implementation Details
4.4. Ablation Study
4.4.1. 3D-Attention
4.4.2. DWCM
4.4.3. 3D-Attention+DWCM
4.4.4. Cross-Level FPN
4.5. Compare with Previous Method
4.5.1. Curved Text Detection
4.5.2. Multi-Oriented Text Detection
4.5.3. Multi-Language Text Detection
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Luo, Z. R2cnn: Rotational region cnn for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
- He, T.; Tian, Z.; Huang, W.; Shen, C.; Qiao, Y.; Sun, C. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5020–5029. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4161–4167. [Google Scholar]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 2, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
- Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2550–2558. [Google Scholar]
- Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 56–72. [Google Scholar]
- Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5909–5918. [Google Scholar]
- Lyu, P.; Yao, C.; Wu, W.; Yan, S.; Bai, X. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7553–7563. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- He, W.; Zhang, X.Y.; Yin, F.; Liu, C.L. Deep direct regression for multi-oriented scene text detection. In Proceedings of the 2017 IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 745–753. [Google Scholar]
- Xue, C.; Lu, S.J.; Zhan, F.N. Accurate Scene Text Detection Through Border Semantics Awareness and Bootstrapping. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 370–387. [Google Scholar]
- Zhang, C.; Liang, B.; Huang, Z.; En, M.; Han, J.; Ding, E.; Ding, X. Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10552–10561. [Google Scholar]
- Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 935–942. [Google Scholar]
- Baek, Y.M.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9365–9374. [Google Scholar]
- Yao, C.; Bai, X.; Sang, N.; Zhou, X.; Zhou, S.; Cao, Z. Scene text detection via holistic, multi-channel prediction. arXiv 2016, arXiv:1606.09002. [Google Scholar]
- Yuliang, L.; Lianwen, J.; Shuaitao, Z.; Sheng, Z. Detecting curve text in the wild: New dataset and new solution. arXiv 2017, arXiv:1712.02170. [Google Scholar]
- Lyu, P.; Liao, M.; Yao, C.; Wu, W.; Bai, X. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 67–83. [Google Scholar]
- Deng, D.; Liu, H.; Li, X.; Cai, D. Pixellink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 6773–6780. [Google Scholar]
- Tian, Z.; Shu, M.; Lyu, P.; Li, R.; Zhou, C.; Shen, X.; Jia, J. Learning shape-aware embedding for scene text detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4234–4243. [Google Scholar]
- Wang, W.H.; Xie, E.; Song, X.G.; Zang, Y.H.; Wang, W.J.; Lu, T.; Yu, G.; Shen, C. Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 8439–8448. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on Robust Reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, France, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Yao, C.; Bai, X.; Liu, W.Y.; Ma, Y.; Tu, Z.W. Detecting texts of arbitrary orientations in natural images. In Proceedings of the 2012 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1083–1090. [Google Scholar]
- Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the 2010 IEEE/CVF Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 15–17 June 2010; pp. 2963–2970. [Google Scholar]
- Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 2004, 22, 761–767. [Google Scholar] [CrossRef]
- Zhang, Z.; Shen, W.; Yao, C.; Bai, X. Symmetry-based text line detection in natural scenes. In Proceedings of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 2558–2567. [Google Scholar]
- Wang, X.B.; Jiang, Y.Y.; Luo, Z.B.; Liu, C.L.; Choi, H.; Kim, J. Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6449–6458. [Google Scholar]
- Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 19–35. [Google Scholar]
- Wang, Q.H.; Xie, E.; Li, X.; Hou, W.B.; Lu, T.; Yu, G.; Shao, S. Shape Robust Text Detection with Progressive Scale Expansion Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9336–9345. [Google Scholar]
- Zhu, X.A.; Cheng, D.Z.; Zhang, Z.; Lin, S.; Dai, J.F. An Empirical Study of Spatial Attention Mechanisms in Deep Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6687–6696. [Google Scholar]
- Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. arXiv 2014, arXiv:1312.6229. [Google Scholar]
- Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Hilton New York Midtown, New York, NY, USA, 7–12 February 2020; pp. 11474–11481. [Google Scholar]
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 2315–2324. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Freeman, H.; Shapira, R. Determining the minimum-area encasing rectangle for an arbitrary closed curve. Commun. ACM 1975, 18, 409–413. [Google Scholar] [CrossRef]
- Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Module | Input | Output | Roles | ||
---|---|---|---|---|---|
Backbone | ResNeDt(18/50) | , , , | Generating multi-level features | ||
Neck | Coss-level FPN | CLFFM | , | , | Generating correction and output values |
CLFFM | , | , | Generating correction and output values | ||
CLFFM | , | Generating output values | |||
Head | Differentiable Binarization | Probability map | Binary map | Generating prediction box | |
Threshold map |
Backbone | DWCM | 3D-Attention | Cross-Level FPN | P | R | F |
---|---|---|---|---|---|---|
Resnet-18 | 89.3 | 73.8 | 80.8 | |||
Resnet-18 | √ | 86.9 | 76.0 | 81.1 | ||
Resnet-18 | √ | 87.5 | 76.2 | 81.5 | ||
Resnet-18 | √ | √ | 87.6 | 77.7 | 82.3 | |
Resnet-18 | √ | 88.6 | 76.3 | 82.0 | ||
Resnet-18 | √ | √ | √ | 86.4 | 79.2 | 82.7 |
Method | P (%) | R (%) | F (%) |
---|---|---|---|
TextSnake (Long et al., 2018) | 82.7 | 74.5 | 78.4 |
ATRR (Wang et al., 2019b) | 80.9 | 76.2 | 78.5 |
MTS (Lyu et al., 2018a) | 82.5 | 75.6 | 78.6 |
TextField (Xu et al., 2019) | 81.2 | 79.9 | 80.6 |
LOMO (Zhang et al., 2019) * | 87.6 | 79.3 | 83.3 |
CRAFT (Baek et al., 2019) | 87.6 | 79.9 | 83.6 |
CSE (Liu et al., 2019b) | 81.4 | 79.1 | 80.2 |
PSE-1s (Wang et al., 2019a) | 84.0 | 78.0 | 80.9 |
DB-ResNet-18 (800 × 800) | 86.7 | 75.4 | 80.7 |
CSFF-ResNeDt-18 (800 × 800) | 87.4 | 77.3 | 82.1 |
DB-ResNet-50 (800 × 800) | 84.3 | 78.4 | 81.3 |
CSFF-ResNeDt-50 (800 × 800) | 86.6 | 78.9 | 82.6 |
Method | P (%) | R (%) | F (%) |
---|---|---|---|
EAST (ZHOU et al., 2017) | 83.6 | 73.5 | 78.2 |
Corner (Lyu et al., 2018b) | 94.1 | 70.7 | 80.7 |
RRD (Liao et al., 2018) | 85.6 | 79.0 | 82.2 |
PSE-1s (Wang et al., 2019a) | 86.9 | 84.5 | 85.7 |
SPCNet (Xie et al., 2019a) | 88.7 | 85.8 | 87.2 |
LOMO (Zhang et al., 2019) | 91.3 | 83.5 | 87.2 |
CRAFT (Baek et al., 2019) | 89.8 | 84.3 | 86.9 |
SAE (Tian et al., 2019) | 88.3 | 85.0 | 86.6 |
DB-ResNet-18 (1280 × 736) | 89.3 | 73.8 | 80.8 |
CSFF-ResNeDt-18 (1280 × 736) | 86.4 | 79.2 | 82.7 |
DB-ResNet-50 (1280 × 736) | 88.6 | 77.8 | 82.9 |
CSFF-ResNeDt-50 (1280 × 736) | 90.6 | 77.3 | 83.4 |
DB-ResNet-50 (2048 × 1152) | 89.8 | 79.3 | 84.2 |
CSFF-ResNeDt-50 (2048 × 1152) | 89.6 | 81.1 | 85.1 |
Method | P (%) | R (%) | F (%) |
---|---|---|---|
(He et al., 2016b) | 71.0 | 61.0 | 69.0 |
DeepReg (He et al., 2017b) | 77.0 | 70.0 | 74.0 |
RRPN (Ma et al., 2018) | 82.0 | 68.0 | 74.0 |
RRD (Liao et al., 2018) | 87.0 | 73.0 | 79.0 |
MCN (Liu et al., 2018) | 88.0 | 79.0 | 83.0 |
PixelLink (Deng et al., 2018) | 83.0 | 73.2 | 77.8 |
Corner (Lyu et al., 2018b) | 87.6 | 76.2 | 81.5 |
TextSnake (Long et al., 2018) | 83.2 | 73.9 | 78.3 |
(Xue, Lu, and Zhan 2018) | 83.0 | 77.4 | 80.1 |
(Xue, Lu, and Zhang 2019) | 87.4 | 76.7 | 81.7 |
CRAFT (Baek et al., 2019) | 88.2 | 78.2 | 82.9 |
SAE (Tian et al., 2019) | 84.2 | 81.7 | 82.9 |
DB-ResNet-18 (512 × 512) | 85.7 | 73.2 | 79.0 |
CSFF-ResNeDt-18 (512 × 512) | 88.8 | 77.7 | 82.9 |
DB-ResNet-18 (736 × 736) | 90.4 | 76.3 | 82.8 |
CSFF-ResNeDt-18 (736 × 736) | 87.8 | 81.8 | 84.7 |
DB-ResNet-50 (736 × 736) | 91.5 | 79.2 | 84.9 |
CSFF-ResNeDt-50 (736 × 736) | 89.4 | 82.3 | 85.7 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Ibrayim, M.; Hamdulla, A. CSFF-Net: Scene Text Detection Based on Cross-Scale Feature Fusion. Information 2021, 12, 524. https://doi.org/10.3390/info12120524
Li Y, Ibrayim M, Hamdulla A. CSFF-Net: Scene Text Detection Based on Cross-Scale Feature Fusion. Information. 2021; 12(12):524. https://doi.org/10.3390/info12120524
Chicago/Turabian StyleLi, Yuan, Mayire Ibrayim, and Askar Hamdulla. 2021. "CSFF-Net: Scene Text Detection Based on Cross-Scale Feature Fusion" Information 12, no. 12: 524. https://doi.org/10.3390/info12120524
APA StyleLi, Y., Ibrayim, M., & Hamdulla, A. (2021). CSFF-Net: Scene Text Detection Based on Cross-Scale Feature Fusion. Information, 12(12), 524. https://doi.org/10.3390/info12120524