Next Article in Journal
Sorption Extraction of Lithium from the Brines of the Pre-Aral Region Using Ion-Exchangers Under Static Conditions
Previous Article in Journal
The Influence of the Ethyl Oleate and n-Hexane Mixture on the Wetting and Lubricant Properties of Canola Oil
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Review

From Detection to Understanding: A Systematic Survey of Deep Learning for Scene Text Processing

1
School of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, China
2
Library of Xinjiang Normal University, Urumqi 830017, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9247; https://doi.org/10.3390/app15179247
Submission received: 5 July 2025 / Revised: 8 August 2025 / Accepted: 9 August 2025 / Published: 22 August 2025

Abstract

Scene text understanding, serving as a cornerstone technology for autonomous navigation, document digitization, and accessibility tools, has witnessed a paradigm shift from traditional methods relying on handcrafted features and multi-stage processing pipelines to contemporary deep learning frameworks capable of learning hierarchical representations directly from raw image inputs. This survey distinctly categorizes modern scene text recognition (STR) methodologies into three principal paradigms: two-stage detection frameworks that employ region proposal networks for precise text localization, single-stage detectors designed to optimize computational efficiency, and specialized architectures tailored to handle arbitrarily shaped text through geometric-aware modeling techniques. Concurrently, an in-depth analysis of text recognition paradigms elucidates the evolutionary trajectory from connectionist temporal classification (CTC) and sequence-to-sequence models to transformer-based architectures, which excel in contextual modeling and demonstrate superior performance. In contrast to prior surveys, this work uniquely emphasizes several key differences and contributions. Firstly, it provides a comprehensive and systematic taxonomy of STR methods, explicitly highlighting the trade-offs between detection accuracy, computational efficiency, and geometric adaptability across different paradigms. Secondly, it delves into the nuances of text recognition, illustrating how transformer-based models have revolutionized the field by capturing long-range dependencies and contextual information, thereby addressing challenges in recognizing complex text layouts and multilingual scripts. Furthermore, the survey pioneers the exploration of critical research frontiers, such as multilingual text adaptation, enhancing model robustness against environmental variations (e.g., lighting conditions, occlusions), and devising data-efficient learning strategies to mitigate the dependency on large-scale annotated datasets. By synthesizing insights from technical advancements across 28 benchmark datasets and standardized evaluation protocols, this study offers researchers a holistic perspective on the current state-of-the-art, persistent challenges, and promising avenues for future research, with the ultimate goal of achieving human-level scene text comprehension.
Keywords: scene text detection; scene text recognition; scene text spotting; transformer; evaluation metrics; datasets scene text detection; scene text recognition; scene text spotting; transformer; evaluation metrics; datasets

Share and Cite

MDPI and ACS Style

Liu, Z.; Song, R.; Li, K.; Li, Y. From Detection to Understanding: A Systematic Survey of Deep Learning for Scene Text Processing. Appl. Sci. 2025, 15, 9247. https://doi.org/10.3390/app15179247

AMA Style

Liu Z, Song R, Li K, Li Y. From Detection to Understanding: A Systematic Survey of Deep Learning for Scene Text Processing. Applied Sciences. 2025; 15(17):9247. https://doi.org/10.3390/app15179247

Chicago/Turabian Style

Liu, Zhandong, Ruixia Song, Ke Li, and Yong Li. 2025. "From Detection to Understanding: A Systematic Survey of Deep Learning for Scene Text Processing" Applied Sciences 15, no. 17: 9247. https://doi.org/10.3390/app15179247

APA Style

Liu, Z., Song, R., Li, K., & Li, Y. (2025). From Detection to Understanding: A Systematic Survey of Deep Learning for Scene Text Processing. Applied Sciences, 15(17), 9247. https://doi.org/10.3390/app15179247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop