Scene Text Detection with Polygon Offsetting and Border Augmentation
Abstract
:1. Introduction
- In addition to the text pixel masks, we also employed the offset masks and text instances border to represent the text instances, which improves the distinguishing of contiguous text instances.
- A post-processing pipeline to predict text instances location was proposed, which apparently yields higher accuracy while impacting slightly on inference time.
- The experimental results show our proposed method that has a competitive accuracy on standard benchmarks.
2. Related Works
3. Proposed Method
3.1. Text Representation
3.2. Network Structure
3.3. Loss Function
3.4. Text Instance Inference
4. Experiments
4.1. Datasets
4.2. Implementation Details
- Photometric distortion, as described in [32].
- Image rotation in range , horizontal and vertical flip with a probability of 0.5.
- Image size re-scale in range [0.5, 3].
- Randomly cropping image to 512 × 512.
- Mean and standard deviation normalization.
4.3. Results
4.3.1. Multi-Oriented English Text
4.3.2. Multi-Oriented and Multi-Language Text
4.3.3. Multi-Oriented and Curved English Text
4.4. Speed Analysis
4.5. Border Augmentation
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting Text in Natural Image with Connectionist Text Proposal Network. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland; pp. 56–72.
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An Efficient and Accurate Scene Text Detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
- Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-Sensitive Regression for Oriented Scene Text Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X. TextBoxes++: A Single-Shot Oriented Scene Text Detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shi, B.; Bai, X.; Belongie, S. Detecting Oriented Text in Natural Images by Linking Segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Deng, D.; Liu, H.; Li, X.; Cai, D. PixelLink: Detecting scene text via instance segmentation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 630–645. [Google Scholar]
- Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on Robust Reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar] [CrossRef]
- Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. ICDAR 2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification—RRC-MLT. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1454–1459. [Google Scholar] [CrossRef]
- Nayef, N.; Patel, Y.; Busta, M.; Chowdhury, P.N.; Karatzas, D.; Khlif, W.; Matas, J.; Pal, U.; Burie, J.C.; Liu, C.; et al. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition—RRC-MLT-2019. arXiv 2019, arXiv:1907.00945. [Google Scholar]
- Ch’ng, C.K.; Chan, C.S. Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition ICDAR, Kyoto, Japan, 9–15 November 2017; pp. 935–942. [Google Scholar] [CrossRef] [Green Version]
- Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2963–2970. [Google Scholar] [CrossRef]
- Neumann, L.; Matas, J. Real-time scene text localization and recognition. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3538–3545. [Google Scholar] [CrossRef]
- Cho, H.; Sung, M.; Jun, B. Canny Text Detector: Fast and Robust Scene Text Localization Algorithm. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3566–3573. [Google Scholar] [CrossRef]
- Zhang, C.; Liang, B.; Huang, Z.; En, M.; Han, J.; Ding, E.; Ding, X. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Xie, E.; Zang, Y.; Shao, S.; Yu, G.; Yao, C.; Li, G. Scene Text Detection with Supervised Pyramid Context Network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9038–9045. [Google Scholar] [CrossRef] [Green Version]
- Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In Proceedings of the The European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape Robust Text Detection With Progressive Scale Expansion Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), Montreal, QC, Canada, 7–12 December 2015; Volume 1, pp. 91–99. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. In Machine Learning in Medical Imaging; Wang, Q., Shi, Y., Suk, H.I., Suzuki, K., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 379–387. [Google Scholar]
- Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 1. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-based Object Detectors with Online Hard Example Mining. arXiv 2016, arXiv:1604.03540. [Google Scholar]
- Howard, A.G. Some Improvements on Deep Convolutional Neural Network Based Image Classification. arXiv 2013, arXiv:1312.5402. [Google Scholar]
- Li, Y.; Yu, Y.; Li, Z.; Lin, Y.; Xu, M.; Li, J.; Zhou, X. Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks. arXiv 2013, arXiv:1811.07432. [Google Scholar]
- Liu, J.; Liu, X.; Sheng, J.; Liang, D.; Li, X.; Liu, Q. Pyramid Mask Text Detector. arXiv 2013, arXiv:1903.11800. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9365–9374. [Google Scholar]
Method | Dataset | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
ICDAR 2015 | ICDAR2017 | ICDAR2019 | Total-Text | |||||||||
P | R | F | P | R | F | P | R | F | P | R | F | |
CTPN [1] | 51.6 | 74.2 | 60.9 | - | - | - | - | - | - | - | - | - |
EAST [4] | 80.5 | 72.8 | 76.4 | - | - | - | - | - | - | - | - | - |
SegLink [8] | 73.1 | 76.8 | 75.0 | - | - | - | - | - | - | - | - | - |
TextBoxes++ [7] | 87.2 | 76.7 | 81.7 | - | - | - | - | - | - | - | - | - |
R2CNN [3] | 85.6 | 79.7 | 82.5 | - | - | - | - | - | - | - | - | - |
PixelLink [9] | 85.5 | 82.5 | 83.7 | - | - | - | - | - | - | - | - | - |
TextSnake [21] | 84.9 | 80.4 | 82.6 | - | - | - | - | - | - | 82.7 | 74.5 | 78.4 |
PSENet [22] | 88.7 | 85.5 | 87.1 | 75.4 | 69.2 | 72.1 | - | - | - | 84.0 | 78.0 | 80.9 |
SPCNET [20] | 88.7 | 85.8 | 87.2 | 73.4 | 66.9 | 70.0 | - | - | - | 83.0 | 82.8 | 82.9 |
Pixel-Anchor [33] | 88.3 | 87.1 | 87.7 | 79.5 | 59.5 | 68.1 | - | - | - | - | - | - |
PMTD [34] | 91.3 | 87.4 | 89.3 | 85.2 | 72.7 | 78.5 | 87.5 | 78.1 | 82.5 | - | - | - |
CRAFT [35] | 89.8 | 84.3 | 86.9 | 80.6 | 68.2 | 73.9 | 81.4 | 62.7 | 70.9 | 87.6 | 79.9 | 83.6 |
LOMO [19] | 91.2 | 83.5 | 87.2 | 78.8 | 60.6 | 68.5 | 87.7 | 79.8 | 83.6 | 87.6 | 79.3 | 83.3 |
Our Method (ResNet-50 without BA) | 87.2 | 84.9 | 86.0 | 76.8 | 67.4 | 72.1 | 83.3 | 72.4 | 77.9 | 85.2 | 78.2 | 81.5 |
Our Method (ResNet-50 with BA) | 89.8 | 86.8 | 88.1 | 78.7 | 69.8 | 73.4 | 86.1 | 75.7 | 80.9 | 88.2 | 79.9 | 83.5 |
Method | Dataset and F-Measure Results | FPS | |||
---|---|---|---|---|---|
ICDAR2015 | ICDAR2017 | ICDAR2019 | Total-Text | ||
CTPN [1] | 60.9 | - | - | - | 7.5 |
EAST [4] | 76.4 | - | - | - | 17.1 |
SegLink [8] | 75.0 | - | - | - | 12.2 |
TextBoxes++ [7] | 81.7 | - | - | - | 13.2 |
R2CNN [3] | 82.5 | - | - | - | - |
PixelLink [9] | 83.7 | - | - | - | - |
TextSnake [21] | 82.6 | - | - | 78.4 | 12.7 |
PSENet [22] | 87.1 | 72.1 | - | 80.9 | 9.6 |
SPCNET [20] | 87.2 | 70.0 | - | 82.9 | - |
Pixel-Anchor [33] | 87.7 | 68.1 | - | - | - |
PMTD [34] | 89.3 | 78.5 | 82.5 | - | - |
CRAFT [35] | 86.9 | 73.9 | 70.9 | 83.6 | 11.2 |
LOMO [19] | 86.0 | 72.1 | 77.9 | 81.5 | - |
Our method (ResNet-34 without BA) | 83.2 | 67.6 | 72.5 | 78.9 | 26.1 |
Our method (ResNet-34 with BA) | 84.5 | 68.9 | 75.4 | 80.1 | 25.2 |
Our method (ResNet-50 without BA) | 86.0 | 72.1 | 77.9 | 81.5 | 18.7 |
Our method (ResNet-50 with BA) | 88.1 | 73.4 | 80.9 | 83.5 | 17.5 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kobchaisawat, T.; Chalidabhongse, T.H.; Satoh, S. Scene Text Detection with Polygon Offsetting and Border Augmentation. Electronics 2020, 9, 117. https://doi.org/10.3390/electronics9010117
Kobchaisawat T, Chalidabhongse TH, Satoh S. Scene Text Detection with Polygon Offsetting and Border Augmentation. Electronics. 2020; 9(1):117. https://doi.org/10.3390/electronics9010117
Chicago/Turabian StyleKobchaisawat, Thananop, Thanarat H. Chalidabhongse, and Shin’ichi Satoh. 2020. "Scene Text Detection with Polygon Offsetting and Border Augmentation" Electronics 9, no. 1: 117. https://doi.org/10.3390/electronics9010117