AT-Text: Assembling Text Components for Efficient Dense Scene Text Detection
Abstract
:1. Introduction
- (1)
- We propose a novel dense text detection framework in the bottom-up mechanism, without character-level annotation, to generate parsimonious text components and assemble them into word/text lines. This endows the framework with the flexible ability to make use of local information and multi-language text detection.
- (2)
- We employ a segmentation model that encodes multi-scale text features, considerably improving the classification accuracy of text candidate components. This allows the AT-text to efficiently filter out false-positive components, making it powerful in dealing with densely arranged text instances.
- (3)
- The experiments demonstrate that AT-text achieves a highly competitive performance in popular datasets and exhibits stronger adaptability to dense text instances and challenging scenarios.
2. Related Works
3. The Proposed Method
3.1. Pipeline
3.2. Segmentation Model
3.3. Text Components Generation
3.4. Assembling Text Components
4. Experiments
4.1. Datasets and Evaluation
4.2. The Network Training
4.3. The Ablation Study
- Single: exploits the feature map of the last stage of the VGG16 architecture to yield a segmentation mask.
- Fuse: hierarchically fuses feature maps with the same size of the input image, as described in Section 3.3 to encode robust text representation.
- Δ(delta): number in the bracket is the variation in intensity contrast in the MSERs algorithm. Multiple numbers represent that we group all text components generated by these variations.
4.4. Comparison Results
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Zhang, Z.; Shen, W.; Yao, C.; Bai, X. Symmetry-Based Text Line Detection in Natural Scenes. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2558–2567. [Google Scholar] [CrossRef]
- Yao, C.; Bai, X.; Liu, W.Y.; Ma, Y.; Tu, Z.W. Detecting Texts of Arbitrary Orientations in Natural Images. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1083–1090. [Google Scholar] [CrossRef]
- Kang, L.; Li, Y.; Doermann, D. Orientation Robust Text Line Detection in Natural Images. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 4034–4041. [Google Scholar] [CrossRef]
- Jaderberg, M.; Vedaldi, A.; Zisserman, A. Deep features for text spotting. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 512–528. [Google Scholar] [CrossRef]
- Yin, C.X.; Pei, W.Y.; Zhang, J.; Hao, H.W. Multi-Orientation Scene Text Detection with Adaptive Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 9, 1930–1937. [Google Scholar] [CrossRef]
- Tian, C.N.; Xia, Y.; Zhang, X.N.; Gao, X.B. Natural scene text detection with MC-MR candidate extraction and coarse-to-fine filtering. Neurocomputing 2017, 260, 112–122. [Google Scholar] [CrossRef]
- Ma, J.; Wang, W.; Lu, K.; Zhou, J. Scene text detection based on pruning strategy of MSER-trees and Linkage-trees. In Proceedings of the International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 367–372. [Google Scholar] [CrossRef]
- Huang, W.; Qiao, Y.; Tang, X. Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees. In The International ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 497–511. [Google Scholar] [CrossRef] [Green Version]
- Tang, J.; Yang, Z.B.; Wang, Y.P.; Zheng, Q.; Xu, Y.C.; Bai, X. SegLink++: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping. Pattern Recognit. 2019, 96, 1–12. [Google Scholar] [CrossRef]
- Yan, C.Z.; Wu, K.L.; Zhang, C.S. A New Anchor-Labeling Method for Oriented Text Detection Using Dense Detection Framework. Signal Process. Lett. 2018, 25, 1295–1299. [Google Scholar] [CrossRef]
- Zhu, A.; Du, H.; Xiong, S.W. Scene Text Detection with Selected Anchor. arXiv 2020, arXiv:2008.08523. [Google Scholar]
- He, T.; Huang, W.L.; Qiao, Y.; Yao, J. Text-Attentional Convolutional Neural Network for Scene Text Detection. IEEE Trans. Image Process. 2016, 25, 2529–2541. [Google Scholar] [CrossRef] [Green Version]
- He, T.; Huang, W.L.; Qiao, Y.; Yao, J. Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network. arXiv 2016, arXiv:1603.09423. [Google Scholar]
- Yin, C.X.; Yin, Y.; Huang, K.; Hao, H.W. Robust Text Detection in Natural Scene Images. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 970–983. [Google Scholar] [CrossRef] [Green Version]
- Yao, C.; Bai, X.; Sang, N.; Zhou, X.Y.; Zhou, S.C.; Cao, Z.M. Scene Text Detection via Holistic. Multi-Channel Prediction. arXiv 2016, arXiv:1606.09002. [Google Scholar]
- Zheng, Y.; Li, Q.; Liu, J.; Liu, H.P.; Li, G.; Zhang, S.W. A cascaded method for text detection in natural scene images. Neurocomputing 2017, 238, 307–315. [Google Scholar] [CrossRef]
- Turki, H.; Halima, M.B.; Alimi, A.M. Text Detection Based on MSER and CNN Features. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 949–954. [Google Scholar] [CrossRef]
- Cho, H.; Sung, M.; Jun, B. Canny Text Detector: Fast and Robust Scene Text Localization Algorithm. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3566–3573. [Google Scholar] [CrossRef]
- Gupta, N.; Jalal, A.S. A robust model for salient text detection in natural scene images using MSER feature detector and Grabcut. Multimed. Tools Appl. 2018, 78, 10821–10835. [Google Scholar] [CrossRef]
- Kim, S.H.; An, K.J.; Jang, S.W.; Kim, G.Y. Texture feature-based text region segmentation in social multimedia data. Multimed. Tools Appl. 2016, 75, 12815–12829. [Google Scholar] [CrossRef]
- Sun, L.; Huo, Q.; Jia, W.; Chen, K. A robust approach for text detection from natural scene images. Pattern Recognit. 2017, 48, 2906–2920. [Google Scholar] [CrossRef]
- Huang, X.D. Automatic video superimposed text detection based on Nonsubsampled Contourlet Transform. Multimed. Tools Appl. 2017, 77, 7033–7049. [Google Scholar] [CrossRef]
- Wang, T.; Wu, D.J.; Coates, A.; Ng, A.Y. End-to-end text recognition with convolutional neural networks. In Proceedings of the International Conference on Pattern Recognition (ICPR), Tsukuba, Japan, 11–15 November 2012; pp. 3304–3308. [Google Scholar]
- Epshtein, B.; Ofek, E.; Wexler, Y. Detecting Text in Natural Scenes with Stroke Width Transform. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2963–2970. [Google Scholar] [CrossRef]
- Subramanian, K.; Natarajan, P.; Decerbo, M.; Castañòn, D. Character Stroke Detection for Text-Localization and Extraction. In Proceedings of the International Conference on Document Analysis and Recognition, Parana, Brazil, 23–26 September 2007; pp. 33–37. [Google Scholar] [CrossRef]
- Dinh, V.; Chun, S.; Cha, S.; Ryu, H.; Sull, S. An Efficient Method for Text Detection in Video Based on Stroke Width Similarity. In ACCV 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 200–209. [Google Scholar] [CrossRef]
- Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. Image Vision Comput. 2004, 22, 761–767. [Google Scholar] [CrossRef]
- LeCun, Y.; Boser, B.; Denker, J.S.; Howard, R.E.; Habbard, W.; Jackel, L.D.; Henderson, D. Handwritten digit recognition with a back-propagation network. In The International Conference on Neural Information Processing Systems; Morgan Kaufman: Burlington, MA, USA, 1997; pp. 396–404. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In The International Conference on Neural Information Processing Systems; ACM: New York, NY, USA, 2012; pp. 1097–1105. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar] [CrossRef] [Green Version]
- Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
- Simonyan, K.; Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Lucas, S.M.; Panaretos, A.; Sosa, L.; Tang, A.; Wong, S.; Young, R.; Ashida, K.; Nagai, H.; Okamoto, M.; Yamamoto, H.; et al. ICDAR 2003 robust reading competitions: Entries, results, and future directions. Int. J. Doc. Anal. Recognit. (IJDAR) 2005, 7, 105–122. [Google Scholar] [CrossRef]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazàn, J.A.; Heras, L.P. ICDAR 2013 Robust Reading Competition. In Proceedings of the International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar] [CrossRef] [Green Version]
- Wolf, C.; Jolion, J.M. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recognit. (IJDAR) 2006, 8, 280–296. [Google Scholar] [CrossRef]
- Tian, S.; Pan, Y.; Huang, C.; Lu, S.; Yu, K.; Tan, C.L. Text Flow: A Unified Text Detection System in Natural Scene Images. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4651–4659. [Google Scholar] [CrossRef] [Green Version]
- Zhu, A.; Gao, R.; Uchida, S. Could scene context be beneficial for scene text detection? Pattern Recognit. 2016, 58, 204–215. [Google Scholar] [CrossRef]
- Yang, C.; Yin, X.C.; Pei, W.Y.; Tian, S.; Zuo, Z.Y.; Zhu, C.; Yan, J. Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework with Dynamic Programming. IEEE Trans. Image Process. 2017, 26, 3235–3248. [Google Scholar] [CrossRef] [PubMed]
- Zhao, F.; Yang, Y.; Zhang, H.Y.; Yang, L.L.; Zhang, L. Sign text detection in street view images using an integrated feature. Multimed. Tools Appl. 2018, 77, 28049–28076. [Google Scholar] [CrossRef]
Delta (Δ) | Total-Number | Text-Background | Ratio of Text-Background | Only-Text | Ratio of Only-Text |
---|---|---|---|---|---|
Δ = 1 | 31,494 | 6469 | 20.54% | 458 | 1.45% |
Δ = 5 | 23,702 | 4912 | 20.72% | 311 | 1.32% |
Δ = 10 | 13,299 | 3138 | 23.60% | 233 | 1.75% |
Strategies | Precision | Recall | F-Measure |
---|---|---|---|
Single + Δ(10) | 64.03 | 63.36 | 63.70 |
Single + Δ(1) | 74.89 | 78.04 | 76.44 |
Single + Δ(1,5,10) | 75.06 | 78.32 | 76.66 |
Fuse + Δ(5) | 82.24 | 75.48 | 78.72 |
Fuse + Δ(1) | 86.19 | 80.24 | 83.10 |
Fuse + Δ(1,5,10) | 87.93 | 80.56 | 84.07 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, H.; Lu, H. AT-Text: Assembling Text Components for Efficient Dense Scene Text Detection. Future Internet 2020, 12, 200. https://doi.org/10.3390/fi12110200
Li H, Lu H. AT-Text: Assembling Text Components for Efficient Dense Scene Text Detection. Future Internet. 2020; 12(11):200. https://doi.org/10.3390/fi12110200
Chicago/Turabian StyleLi, Haiyan, and Hongtao Lu. 2020. "AT-Text: Assembling Text Components for Efficient Dense Scene Text Detection" Future Internet 12, no. 11: 200. https://doi.org/10.3390/fi12110200
APA StyleLi, H., & Lu, H. (2020). AT-Text: Assembling Text Components for Efficient Dense Scene Text Detection. Future Internet, 12(11), 200. https://doi.org/10.3390/fi12110200