R-YOLO: A Real-Time Text Detector for Natural Scenes with Arbitrary Rotation
Abstract
:1. Introduction
- A novel framework is developed to detect scene texts in arbitrary orientations using a one-stage strategy, where a fully convolutional network (FCN) is employed to generate inclined bounding boxes for text, thereby avoiding the redundant and time-consuming intermediate steps adopted in existing methods. An anchor box with rotation angle information is designed to replace the traditional axis alignment anchor box so that text detection can be adapted to any rotation angle. A new algorithm, RDIoU-NMS, is proposed to substitute the traditional IoU-NMS algorithm.
- The 4th scale is added into the architecture of YOLOv4 to enhance the performance of detecting small-size natural-scene text.
2. Related Work
3. Proposed Method
3.1. Architecture of R-YOLO
3.2. Inclined Bounding Box Representation
3.3. Rotation Anchor Box
3.4. RDIoU-NMS
- Step 1: Sort the confidence scores in list C from large to small and adjust the order of bounding box storage in list B to make it consistent with the order of adjusted list C.
- Step 2: Take the inclined bounding box with the highest confidence as the target for comparison, delete it from list B and add it into the list D (initially D is empty). Calculate the RDIoU between the target inclined bounding boxes and remaining boxes in list B.
- Step 3: If the RDIoU is larger than the threshold Nt, delete the bounding box from list B.
- Step 4: Take the inclined bounding box with the second-highest confidence as the target for comparison and repeat Steps 2 and 3 until there are no more bounding boxes left in list B.
Algorithm 1 Calculate RDIoU-NMS |
Input:B = {b1,b2, …, bN}, C = {c1, c2, …, cN}, Nt, where B is the list of initial detection rotation boxes, C contains the corresponding detection confidence, and Nt is the NMS threshold. Output:D, S, where D and S are the list of final prediction bounding boxes and the corresponding confidence respectively. 1. Begin 2. D ← {}, S ← {} 3. While B ≠ empty do 4. m ← maxC 5. M ← bm, T ← cm 6. D ← D ∪ M, B ← B – M 7. S ← S ∪ T, C ← C – T 8. for bi in B do 9. if RDIoU (M, bi) ≥ Nt then 10. B ← B – bi, C ← C – ci 11. end 12. end 13. return D, S 14. end |
3.5. Learning of Text Detection
4. Experiments
4.1. Benchmark Datasets
4.2. Implementation Details
4.3. Evaluation on Oriented Text Benchmark
4.4. Evaluation on Long Text Benchmark
4.5. Evaluation on Horizontal Text Benchmark
4.6. Evaluation on Multi-Lingual Text Benchmark
4.7. Analysis and Discussion
4.8. Limitations of the Proposed Algorithm
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Neumann, L.; Matas, J. Scene text localization and recognition with oriented stroke detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 3–6 December 2013; pp. 97–104. [Google Scholar]
- Pan, Y.; Hou, X.; Liu, C. A hybrid approach to detect and localize texts in natural scene images. IEEE Trans. Image Process. 2011, 20, 800–813. [Google Scholar] [PubMed]
- Yin, X.; Yin, X.; Huang, K.; Hao, H. Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 970–983. [Google Scholar] [PubMed] [Green Version]
- Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 2963–2970. [Google Scholar]
- Huang, W.; Qiao, Y.; Tang, X. Robust scene text detection with convolution neural network induced MSER trees. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 497–511. [Google Scholar]
- Huang, W.; Lin, Z.; Yang, J.; Wang, J. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 97–104. [Google Scholar]
- Zhong, Z.; Jin, L.; Huang, S.; Feng, Z. DeepText: A new approach for text proposal generation and text detection in natural images. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 1208–1212. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. TextBoxes: A fast text detector with a single deep neural network. arXiv 2016, arXiv:1611.06779. [Google Scholar]
- Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. arXiv 2016, arXiv:1609.03605v1. [Google Scholar]
- Tian, S.; Pan, Y.; Huang, C.; Lu, S.; Yu, K.; Tan, C. Text flow: A unified text detection system in natural scene images. arXiv 2016, arXiv:1604.06877v1. [Google Scholar]
- Zhang, Z.; Shen, W.; Yao, C.; Bai, X. Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2558–2567. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
- Sun, L.; Huo, Q.; Jia, W. A robust approach for text detection from natural scene images. Pattern Recognit. 2015, 48, 2906–2920. [Google Scholar] [CrossRef]
- Deng, D.; Liu, H.; Li, X.; Cai, D. Pixellink: Detecting scene text via instance segmentation. In Proceedings of the 32nd AAAI Conference Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 6773–6780. [Google Scholar]
- Zhang, Z.; Zhang, C.; Shen, W.; Yao, C.; Liu, W.; Bai, X. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4159–4167. [Google Scholar]
- Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 20–36. [Google Scholar]
- Yang, Q.; Cheng, M.; Zhou, W.; Chen, Y.; Qiu, M.; Lin, W. Inceptext: A new inception-text module with deformable PSROI pooling for multi-oriented scene text detection. arXiv 2018, arXiv:1805.01167v2. [Google Scholar]
- Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2550–2558. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Liao, M.; Zhu, Z.; Shi, B.; Xia, G.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5909–5918. [Google Scholar]
- Zhang, C.; Liang, B.; Huang, Z.; En, M.; Han, J.; Ding, E.; Ding, X. Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 10552–10561. [Google Scholar]
- Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational region CNN for arbitrarily-oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3610–3615. [Google Scholar]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
- Liao, M.; Shi, B.; Bai, X. TextBoxes++: A single-shot oriented scene text detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, P.; Huang, W.; He, T.; Zhu, Q.; Qiao, Y.; Li, X. Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3047–3055. [Google Scholar]
- Bochkovskiy, A.; Wang, C. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934v1. [Google Scholar]
- Lin, H.; Yang, P.; Zhang, F. Review of scene text detection and recognition. Arch. Comput. Methods Eng. 2019, 27, 433–454. [Google Scholar] [CrossRef]
- Brisinello, M.; Grbić, R.; Vranješ, M.; Vranješ, D. Review on text detection methods on scene images. In Proceedings of the 2019 International Symposium ELMAR, Zadar, Croatia, 23–25 September 2019; pp. 51–56. [Google Scholar]
- Raisi, Z.; Naiel, M.A.; Fieguth, P.; Wardell, S.; Zelek, J. Text detection and recognition in the wild: A review. arXiv 2020, arXiv:2006.04305v2. [Google Scholar]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamuram, M.; Bigorda, L.; Mestre, S.; Mas, J.; Mota, D.F. ICDAR 2013 robust reading competition. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanovl, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR2015 competition on robust reading. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), Oradea, Romania, 11–12 June 2015; pp. 1156–1160. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Ye, J.; Chen, Z.; Liu, J.; Du, B. TextFuseNet: Scene Text Detection with Richer Fused Features. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan, 7–15 January 2021. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- He, W.; Zhang, X.; Yin, F.; Luo, Z.; Ogier, J.; Liu, C. Realtime multi-scale scene text detection with scale-based region proposal network. Pattern Recognit. 2020, 98, 107026. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Yao, C.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 1083–1090. [Google Scholar]
- Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 13–15 November 2017; pp. 1454–1459. [Google Scholar]
- He, W.; Zhang, X.; Yin, F.; Liu, C. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 745–753. [Google Scholar]
- Wang, Y.; Xie, H.; Fu, Z.; Zhang, Y. DSRN: A deep scale relationship network for scene text detection. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 947–953. [Google Scholar]
- Liu, Y.; Jin, L. Deep matching prior network: Toward tighter multi-oriented text detection. arXiv 2018, arXiv:1703.01425v1. [Google Scholar]
- Liu, F.; Chen, C.; Gu, D.; Zheng, J. FTPN: Scene text detection with feature pyramid based text proposal network. IEEE Access 2019, 7, 44219–44228. [Google Scholar] [CrossRef]
- Liu, X.; Liang, D.; Yan, S.; Chen, D.; Qiao, Y.; Yan, J. FOTS: Fast oriented text spotting with a unified network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, USA, 18–22 June 2018; pp. 5676–5685. [Google Scholar]
- Lyu, P.; Yao, C.; Wu, W.; Yan, S.; Bai, X. Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7553–7563. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9365–9374. [Google Scholar]
- Xu, Y.; Duan, J.; Kuang, Z.; Yue, X.; Sun, H.; Guan, Y.; Zhang, W. Geometry Normalization Networks for Accurate Scene Text Detection. arXiv 2019, arXiv:1909.00794v1. [Google Scholar]
- Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Xiang Bai Real-time Scene Text Detection with Differentiable Binarization Dec. arXiv 2019, arXiv:1911.08947v2. [Google Scholar]
Method | OS | R [%] | P [%] | F [%] | Device | FPS |
---|---|---|---|---|---|---|
DMPNet [42] | 68.2 | 73.2 | 70.6 | - | - | |
FTPN [43] | 68.2 | 78.0 | 72.8 | - | - | |
RRPN [23] | 77.0 | 84.0 | 80.0 | Titan X | 4.70 | |
R2CNN [22] | 79.6 | 85.6 | 82.5 | Tesla K80 | 2.20 | |
SRPN+VGGDet [35] | 79.7 | 92.0 | 85.4 | Titan Xp | 16.5 | |
SRPN+SRPNDet [35] | 74.8 | 85.2 | 79.6 | Titan Xp | 35.1 | |
TextFuseNet [33] | 89.7 | 94.7 | 92.1 | Tesla V100 | 4.10 | |
SegLink [18] | ✓ | 76.8 | 73.1 | 75.0 | - | - |
He et al. [25] | ✓ | 73.0 | 80.0 | 77.0 | - | - |
EAST [19] | ✓ | 78.3 | 83.3 | 80.7 | Titan X | 16.8 |
He et al. [40] | ✓ | 80.0 | 82.0 | 81.0 | Titan X | 1.10 |
DSRN [41] | ✓ | 79.6 | 83.2 | 81.4 | Titan X | 8.80 |
TextBoxes++ [24] | ✓ | 76.7 | 87.2 | 81.7 | Titan Xp | 11.6 |
RRD [20] | ✓ | 79.0 | 85.6 | 82.2 | Titan Xp | 6.50 |
R-YOLO | ✓ | 78.2 | 87.0 | 82.3 | RTX 3090 | 62.5 |
Method | OS | R [%] | P [%] | F [%] | FPS |
---|---|---|---|---|---|
RRPN [23] | 69.0 | 82.0 | 75.0 | 3.3 | |
SRPN+SRPNDet [35] | 70.8 | 83.6 | 76.7 | 28.1 | |
SRPN+VGGDet [35] | 77.0 | 84.9 | 80.7 | 14.6 | |
He et al. [40] | ✓ | 70.0 | 77.0 | 74.0 | - |
EAST [19] | ✓ | 67.4 | 87.3 | 76.1 | - |
SegLink [18] | ✓ | 70.0 | 86.0 | 77.0 | 8.9 |
TextSnake [16] | 73.9 | 83.2 | 78.3 | 1.1 | |
DSRN [41] | ✓ | 71.2 | 87.6 | 78.5 | 13.3 |
RRD [20] | ✓ | 73.0 | 87.0 | 79.0 | 10.0 |
R-YOLO (768 × 768) | ✓ | 76.5 | 88.3 | 82.0 | 40.0 |
R-YOLO (416 × 416) | ✓ | 79.9 | 90.9 | 85.0 | 80.3 |
R-YOLO (256 × 256) | ✓ | 71.6 | 88.6 | 79.2 | 95.2 |
R-YOLO (512 × 512) | ✓ | 81.9 | 90.2 | 85.8 | 66.6 |
Method | OS | R [%] | P [%] | F [%] | FPS |
---|---|---|---|---|---|
Faster R-CNN [34] | 71.0 | 75.0 | 73.0 | - | |
RRPN [23] | 72.0 | 90.0 | 80.0 | - | |
SRPN+VGGDet [35] | 84.2 | 92.5 | 88.2 | 20.9 | |
SRPN+SRPNDet [35] | 83.3 | 86.4 | 84.8 | 30.5 | |
TextFuseNet [33] | 92.3 | 96.5 | 94.3 | 4.00 | |
SSD [37] | ✓ | 60.0 | 80.0 | 68.0 | - |
TextBoxes++ [24] | ✓ | 74.0 | 86.0 | 80.0 | - |
YOLOv4 [26] | ✓ | 71.5 | 91.0 | 80.1 | 47.2 |
R-YOLO | ✓ | 82.9 | 90.1 | 86.4 | 47.0 |
Method | RN | OS | R [%] | P [%] | F [%] | FPS |
---|---|---|---|---|---|---|
FOTS [44] | 81.8 | 62.3 | 70.8 | 23.9 | ||
Lyu et al. [45] | 74.3 | 70.6 | 72.4 | - | ||
LOMO [21] | 67.2 | 80.2 | 73.1 | - | ||
CRAFT [46] | 68.2 | 80.6 | 73.9 | 8.60 | ||
GNNets [47] | 70.1 | 79.6 | 74.5 | - | ||
DB-ResNet-18 [48] | 63.8 | 81.9 | 71.7 | 41.0 | ||
DB-ResNet-50 [48] | 67.9 | 83.1 | 74.7 | 19.0 | ||
R-YOLO-RIoU | ✓ | 66.3 | 78.0 | 71.7 | 67.5 | |
R-YOLO-3 | ✓ | ✓ | 69.5 | 76.7 | 72.9 | 71.2 |
R-YOLO-4 | ✓ | ✓ | 71.7 | 77.1 | 74.3 | 67.6 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Zheng, S.; Zhang, C.; Li, R.; Gui, L. R-YOLO: A Real-Time Text Detector for Natural Scenes with Arbitrary Rotation. Sensors 2021, 21, 888. https://doi.org/10.3390/s21030888
Wang X, Zheng S, Zhang C, Li R, Gui L. R-YOLO: A Real-Time Text Detector for Natural Scenes with Arbitrary Rotation. Sensors. 2021; 21(3):888. https://doi.org/10.3390/s21030888
Chicago/Turabian StyleWang, Xiqi, Shunyi Zheng, Ce Zhang, Rui Li, and Li Gui. 2021. "R-YOLO: A Real-Time Text Detector for Natural Scenes with Arbitrary Rotation" Sensors 21, no. 3: 888. https://doi.org/10.3390/s21030888
APA StyleWang, X., Zheng, S., Zhang, C., Li, R., & Gui, L. (2021). R-YOLO: A Real-Time Text Detector for Natural Scenes with Arbitrary Rotation. Sensors, 21(3), 888. https://doi.org/10.3390/s21030888