Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms
Abstract
:1. Introduction
- A novel feature-extraction network based on EFPN-SwinTransformer was proposed. By incorporating SwinTransformer as the backbone network, efficient capture of global information in trademark images is achieved through the self-attention mechanism and enhanced feature pyramid module. This provides more accurate and expressive feature representations for subsequent text extraction, leading to improved text-extraction accuracy and efficiency.
- A novel feature point-retrieval algorithm based on corner detection is designed. The OTSU-based fast corner detector is presented to generate a corner map. It utilises OTSU’s method for selecting an adaptive threshold and subsequently conducts corner detection according to this threshold, thereby achieving both efficient and accurate corner identification. In addition, in the encoding phase, a feature point-retrieval mechanism based on corner detection is introduced to achieve priority selection of key-point regions, eliminating character-to-character lines and suppressing background interference.
2. Methodology
2.1. Feature Extraction Based on EFPN-SwinTransformer
2.2. Corner Detection Based on Feature-Query Mechanism
Algorithm 1 Corner-detection algorithm based on Feature-Query mechanism. |
|
3. Experiments
3.1. Datasets and Evaluation Criteria
3.2. Comparative Experiments
3.3. Ablation Experiments
3.4. Performance of Text Recognition
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, C.; Yuefeng, T.; Du, K.; Ding, W.; Wang, B.; Liu, J.; Wang, W. Character-level Street View Text Spotting Based on Deep Multi-Segmentation Network for Smarter Autonomous Driving. IEEE Trans. Artif. Intell. 2021, 3, 297–308. [Google Scholar] [CrossRef]
- Rong, X.; Li, B.; Muñoz, J.; Xiao, J.; Arditi, A.; Tian, Y. Guided Text Spotting for Assistive Blind Navigation in Unfamiliar Indoor Environments. In Proceedings of the Advances in Visual Computing: 12th International Symposium, ISVC 2016, Las Vegas, NV, USA, 12–14 December 2016; Volume 10073, pp. 11–22. [Google Scholar] [CrossRef]
- Wang, H.C.; Finn, C.; Paull, L.; Kaess, M.; Rosenholtz, R.; Teller, S.; Leonard, J. Bridging text spotting and SLAM with junction features. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; pp. 3701–3708. [Google Scholar] [CrossRef]
- Wang, J.; Liu, C.; Jin, L.; Tang, G.; Zhang, J.; Zhang, S.; Wang, Q.; Wu, Y.; Cai, M. Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2738–2745. [Google Scholar] [CrossRef]
- Zhang, P.; Xu, Y.; Cheng, Z.; Pu, S.; Lu, J.; Qiao, L.; Niu, Y.; Wu, F. TRIE: End-to-End Text Reading and Information Extraction for Document Understanding. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1413–1422. [Google Scholar] [CrossRef]
- Wu, J.; Du, J.; Wang, F.; Yang, C.; Jiang, X.; Hu, J.; Yin, B.; Zhang, J.; Dai, L. A Multimodal Attention Fusion Network with a Dynamic Vocabulary for TextVQA. Pattern Recognit. 2021, 122, 108214. [Google Scholar] [CrossRef]
- Yang, Z.; Lu, Y.; Wang, J.; Yin, X.; Florencio, D.; Wang, L.; Zhang, C.; Zhang, L.; Luo, J. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 8747–8757. [Google Scholar] [CrossRef]
- Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; Rohrbach, M. Towards VQA Models That Can Read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8309–8318. [Google Scholar] [CrossRef]
- Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust Scene Text Recognition with Automatic Rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4168–4176. [Google Scholar] [CrossRef]
- Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
- Sheng, F.; Chen, Z.; Xu, B. NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 781–786. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. SwinTransformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. arXiv 2014, arXiv:1406.2227. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar] [CrossRef]
- Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar] [CrossRef]
- Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar] [CrossRef]
- Li, H.; Wang, P.; Shen, C.; Zhang, G. Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8610–8617. [Google Scholar] [CrossRef]
- Lee, J.; Park, S.; Baek, J.; Oh, S.J.; Kim, S.; Lee, H. On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 2326–2335. [Google Scholar] [CrossRef]
- Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7094–7103. [Google Scholar] [CrossRef]
- Bhunia, A.; Sain, A.; Kumar, A.; Ghose, S.; Chowdhury, P.; Song, Y.Z. Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14920–14929. [Google Scholar] [CrossRef]
- Zhang, X.; Zhu, B.; Yao, X.; Sun, Q.; Li, R.; Yu, B. Context-Based Contrastive Learning for Scene Text Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 3353–3361. [Google Scholar] [CrossRef]
- Da, C.; Wang, P.; Yao, C. Levenshtein OCR. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 322–338. [Google Scholar] [CrossRef]
Method | Accuracy | Precision | Recall |
---|---|---|---|
CRNN [9] | 59.43 | 84.34 | 83.55 |
ASTER [10] | 68.54 | 86.97 | 85.79 |
NRTR [11] | 70.32 | 87.81 | 86.74 |
SAR [18] | 74.84 | 89.56 | 89.64 |
SATRN [19] | 78.93 | 91.22 | 90.96 |
SwinCornerTR | 84.88 | 93.84 | 94.47 |
Method | SVT | CUTE |
---|---|---|
CRNN [9] | 80.8 | - |
ASTER [10] | 89.5 | 79.5 |
NRTR [11] | 91.5 | 80.9 |
SAR [18] | 84.5 | 83.3 |
SATRN [19] | 91.3 | 87.8 |
ABINet [20] | 93.5 | 89.2 |
JVSR [21] | 92.2 | 89.7 |
ABINet+ConCLR [22] | 94.3 | 91.3 |
LevOCR [23] | 92.9 | 91.7 |
SwinCornerTR | 92.9 | 92.3 |
Method | SwinTransformer | EFPN | Feature-Query | Accuracy |
---|---|---|---|---|
Baseline | - | - | - | 81.3 |
Baseline+ | ✓ | - | - | 82.6 |
Baseline+ | ✓ | ✓ | - | 83.1 |
Baseline+ | - | - | ✓ | 83.8 |
Baseline+ | ✓ | ✓ | ✓ | 84.8 |
Corner Detector | Accuracy | Mean Detection Time/ms |
---|---|---|
Harris | 83.9 | 25 |
FAST | 84.2 | 1 |
OTSU-FAST | 84.8 | 6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, B.; Wang, X.; Zhou, W.; Li, L. Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms. Electronics 2024, 13, 2814. https://doi.org/10.3390/electronics13142814
Zhou B, Wang X, Zhou W, Li L. Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms. Electronics. 2024; 13(14):2814. https://doi.org/10.3390/electronics13142814
Chicago/Turabian StyleZhou, Boxiu, Xiuhui Wang, Wenchao Zhou, and Longwen Li. 2024. "Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms" Electronics 13, no. 14: 2814. https://doi.org/10.3390/electronics13142814
APA StyleZhou, B., Wang, X., Zhou, W., & Li, L. (2024). Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms. Electronics, 13(14), 2814. https://doi.org/10.3390/electronics13142814