A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition

Cheng, Yuning; Huang, Jinhong; Tai, Io San; Mondal, Subrota Kumar; Wang, Tianqi; Kabir, Hussain Mohammed Dipu

doi:10.3390/electronics14234594

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition

by

Yuning Cheng

¹,

Jinhong Huang

¹,

Io San Tai

¹,

Subrota Kumar Mondal

^1,*

,

Tianqi Wang

¹ and

Hussain Mohammed Dipu Kabir

^2,3

¹

School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, Macao 999078, China

²

AI and Cyber Futures Institute, Charles Sturt University, Orange, NSW 2800, Australia

³

Rural Health Research Institute, Charles Sturt University, Orange, NSW 2800, Australia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4594; https://doi.org/10.3390/electronics14234594 (registering DOI)

Submission received: 4 September 2025 / Revised: 18 November 2025 / Accepted: 18 November 2025 / Published: 23 November 2025

(This article belongs to the Special Issue Exploring Edge AI: Architectures, Algorithms, and the Role of Edge–Cloud Cooperation for Scalable AI Systems)

Download Versions Notes

Abstract

End-to-end scene text detection and recognition, which involves detecting and recognizing text in natural images, still faces significant challenges, particularly in handling text of arbitrary shapes, complex backgrounds, and computational efficiency requirements. This paper proposes a novel and viable end-to-end OCR framework that synergistically combines a powerful detection network with advanced recognition models. For text detection, we develop a method called Text Contrast Self-Calibrated Network (TextCSCN), which employs pixel-wise supervised contrastive learning to extract more discriminative features. TextCSCN addresses long-range dependency modeling and limited receptive field issues through self-calibrated convolutions and Global Convolutional Networks (GCNs). We further introduce an efficient Mamba-based bidirectional module for boundary refinement, enhancing both accuracy and speed. For text recognition, our framework employs a Swin Transformer backbone with Bidirectional Feature Pyramid Networks (BiFPNs) for optimized multi-scale feature extraction. We propose a Pre-Gated Contextual Attention Gate (PCAG) mechanism to effectively fuse visual and linguistic information while minimizing noise and uncertainty in multi-modal integration. Experiments on challenging benchmarks including TotalText and CTW1500 demonstrate the effectiveness of our approach. Our detection module achieves state-of-the-art performance with an F-score of 88.21% on TotalText, and the complete end-to-end system shows comparable improvements in recognition accuracy, establishing new benchmarks for scene text spotting.

Keywords: scene text spotting; end-to-end OCR; text detection; supervised contrastive learning; Global Convolutional Network; Swin Transformer; feature fusion; attention mechanism

Share and Cite

MDPI and ACS Style

Cheng, Y.; Huang, J.; Tai, I.S.; Mondal, S.K.; Wang, T.; Kabir, H.M.D. A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition. Electronics 2025, 14, 4594. https://doi.org/10.3390/electronics14234594

AMA Style

Cheng Y, Huang J, Tai IS, Mondal SK, Wang T, Kabir HMD. A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition. Electronics. 2025; 14(23):4594. https://doi.org/10.3390/electronics14234594

Chicago/Turabian Style

Cheng, Yuning, Jinhong Huang, Io San Tai, Subrota Kumar Mondal, Tianqi Wang, and Hussain Mohammed Dipu Kabir. 2025. "A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition" Electronics 14, no. 23: 4594. https://doi.org/10.3390/electronics14234594

APA Style

Cheng, Y., Huang, J., Tai, I. S., Mondal, S. K., Wang, T., & Kabir, H. M. D. (2025). A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition. Electronics, 14(23), 4594. https://doi.org/10.3390/electronics14234594

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Two-Stage End-to-End Framework for Robust Scene Text Spotting with Self-Calibrated Detection and Contextual Recognition

Abstract

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI