1. Introduction
1.1. Background and Motivation
Scene text detection and recognition (STR) are fundamental tasks in computer vision, involving the localization of textual regions within natural images and the subsequent translation of these regions into machine-readable character sequences. These tasks have attracted significant research attention due to their wide-ranging applications and the technical challenges they pose. STR finds diverse applications across multiple domains. In autonomous driving, accurate detection and recognition of traffic signs, street names, and other textual information are crucial for navigation and safety [
1], enabling vehicles to understand and respond appropriately to road-related instructions. In document analysis, STR technologies automate data entry tasks by extracting text from scanned documents or images [
2], greatly improving efficiency and reducing manual errors. It also plays a vital role in assistive technologies for the visually impaired, converting visual text into auditory output to aid their daily lives [
3]. Additionally, STR is applied in content retrieval from images and videos, industrial automation, and smart city infrastructure management.
The significance of STR lies in its ability to bridge the gap between the visual and textual modalities, allowing machines to perceive and understand textual content in real-world environments. However, it faces several challenges due to the inherent variability and complexity of text in natural scenes. Firstly, there are variations in font and style. Text in natural scenes can display a multitude of fonts, sizes, and styles, making it difficult for algorithms to generalize across different text appearances [
4]. For example, a handwritten note in a cluttered image may have a completely different font and style compared to a printed signboard. Secondly, orientation and curvature pose problems. Text instances are often non-horizontally aligned and can be oriented in various directions or even curved, complicating the detection and recognition processes. Although methods such as TextSnake [
4] and ABCNet [
1] have been proposed to handle arbitrarily-shaped text, the challenge remains significant.
Thirdly, occlusion and partial visibility are common issues. Text regions may be partially occluded by objects or other text, resulting in incomplete or fragmented text instances that are challenging to detect and recognize accurately [
5]. For instance, a street sign may be partially blocked by a tree branch, making it difficult for an STR algorithm to correctly identify the text. Fourthly, lighting conditions play a crucial role. Variations in lighting, including shadows, glare, and low-light environments, can significantly affect the visibility and quality of text in images [
6], thereby impacting the performance of STR algorithms. A text in a dimly lit alley may be hard to detect and recognize compared to the same text in well-lit conditions. Finally, complex backgrounds are a major obstacle. Text in natural scenes is often embedded within complex backgrounds, making it difficult to distinguish text from non-text regions [
7]. This necessitates the development of algorithms that are robust against background noise and clutter.
To address these challenges, researchers have increasingly turned to deep learning techniques. Deep learning models, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have demonstrated remarkable success in various computer vision tasks, including STR [
2]. These models leverage large-scale annotated datasets and powerful computational resources to learn hierarchical representations of text, enabling them to generalize across diverse text appearances and complex backgrounds.
A comprehensive review of text detection and recognition in the wild was presented in [
8]. This work provided a taxonomy for text detection and recognition up to the year 2020, covering different aspects such as methods, datasets, and challenges. Since then, there have been significant developments in the field, especially in the areas of deep learning-based approaches. The following sections will delve into the evolution of STR techniques, with a particular emphasis on deep learning-based approaches. We will provide a comprehensive survey of the current state-of-the-art in this field, discussing the strengths and limitations of various methods and highlighting future research directions.
1.2. Scope and Contributions
This survey provides a systematic review of deep learning-based scene text detection and recognition (STR) from 2015 to present. While building upon the foundational taxonomy established in [
8], we significantly expand the coverage to include cutting-edge developments from 2020 to 2025. Our contributions are organized into five key aspects that collectively advance the understanding of this rapidly evolving field.
- (1)
Methodological Taxonomy and Classification
We systematically organize modern scene text recognition (STR) approaches into a hierarchical classification framework. This framework includes two-stage detection methods, as demonstrated by Faster R-CNN [
9] and Mask R-CNN [
10]; single-stage detection methods such as YOLO [
11] and SSD [
12]; and specialized architectures tailored for arbitrary-shaped text, including TextSnake [
4] and ABCNet [
1]. Furthermore, we provide an in-depth analysis of transformer-based detectors and their self-attention mechanisms, enhancing the understanding of this emerging paradigm.
- (2)
Historical Evolution and Contemporary Advances
Our review adopts a dual temporal perspective to examine the technical progression of STR methods. We trace their development across 28 benchmark datasets, providing a historical context for their evolution. Simultaneously, we incorporate the most recent innovations from 2020 to 2025, enabling readers to appreciate both the evolutionary trajectory and the current state-of-the-art in the field. This comprehensive approach facilitates a deeper understanding of how STR methods have advanced over time.
- (3)
Standardized Evaluation Framework
We conduct a thorough analysis of prevalent benchmark datasets, including ICDAR2015 [
13], Total-Text [
14], and IIIT5k [
15], among others. Based on this analysis, we establish consistent evaluation protocols that systematically organize performance metrics for detection tasks, such as Intersection over Union (IoU), precision, and recall, as well as recognition tasks, including accuracy and edit distance. This framework enables more meaningful cross-method comparisons and facilitates a fair assessment of different STR approaches.
- (4)
Systematic Challenge Assessment
Through rigorous empirical analysis, we identify and dissect three critical challenges that current STR systems face. These challenges encompass real-time processing constraints, the need for few-shot learning in low-resource language scenarios, and adversarial robustness. Our assessment is grounded in comprehensive benchmark evaluations, providing empirical evidence to support our findings. This systematic approach helps to highlight areas where further research is needed to advance the field of STR.
- (5)
Future Research Trajectories
We delineate promising research directions that have the potential to shape the future of STR. These directions include the integration of vision-language models, exemplified by CLIP [
16], the application of self-supervised learning techniques to enhance model generalization, and the development of scalable solutions for low-resource language scenarios. By providing these insights, we aim to guide future investigations in the field and stimulate innovative research that addresses the current limitations of STR systems.
By integrating these contributions, this survey not only serves as a comprehensive reference guide for current STR methodologies but also provides a clear roadmap for future research directions. This dual role benefits both academic researchers seeking to advance the state-of-the-art and industry practitioners working to apply STR techniques in real-world scene text analysis applications.
1.3. Paper Organization
The remainder of this paper is organized as follows:
Section 1 outlines the background and motivation of scene text detection and recognition (STR), emphasizing its practical significance and technical challenges.
Section 2 establishes the theoretical foundation of STR, including problem formulation and key challenges such as text variability (font, style, orientation), occlusion, lighting conditions, and complex backgrounds, along with traditional solution approaches.
Section 3 systematically reviews deep learning-based text detection methods, covering two-stage/single-stage detectors, arbitrary-shaped text handling architectures, transformer-based approaches, and their corresponding optimization strategies.
Section 4 analyzes deep learning solutions for text recognition, comparing sequence-to-sequence models, CTC-based methods, transformer architectures, and visual-based models, with discussion on character-level vs. word-level recognition and multilingual challenges.
Section 5 provides a methodological taxonomy of text spotting approaches, critically evaluating two-stage pipelines, unified networks, and emerging paradigms leveraging pre-trained language models, with analysis of their respective advantages and implementation challenges.
Section 6 explores cutting-edge research directions: (1) multilingual support through cross-lingual transfer learning and script-aware architectures, (2) robust recognition under real-world constraints (low-resolution, occlusion, distortion), and (3) data-efficient learning paradigms including few-shot learning and self-supervised methods.
Section 7 catalogs benchmark datasets (synthetic/real-world, multilingual) and evaluates standardized metrics for performance assessment.
Section 8 presents comparative analysis of state-of-the-art methods across different benchmarks, identifying performance determinants and methodological limitations.
Section 11 concludes with open challenges and promising research trajectories in STR.
2. Fundamentals of Scene Text Detection and Recognition
As illustrated in the schematic framework depicted in
Figure 1, our research provides a comprehensive and systematic exploration of contemporary methodologies in scene text understanding, adopting three interrelated yet distinct perspectives: Scene Text Detection, Scene Text Recognition, and Scene Text Spotting. Scene Text Detection forms the bedrock of this pipeline, focusing on accurately identifying and localizing text regions within complex natural images. Given the variability in text shapes, sizes, orientations, and cluttered backgrounds, detection algorithms must exhibit robustness and adaptability. These methods, formulated as object detection problems, aim to delineate text boundaries through bounding boxes or pixel-level masks, addressing challenges such as partial occlusion and font diversity. The advent of deep learning has revolutionized this field, with architectures like Faster R-CNN [
9], Mask R-CNN [
10], and fully convolutional networks (FCNs) [
17] setting benchmarks for accuracy across diverse scenarios.
Building upon detected regions, Scene Text Recognition delves into translating localized text into machine-readable formats, transcending mere geometric localization to encompass semantic interpretation. The complexity of this task arises from the need to decipher text amidst potential image degradations, such as poor lighting, low resolution, and diverse language scripts. Sequence-based methodologies, including connectionist temporal classification (CTC) [
18] and attention-based models [
19], have emerged as dominant paradigms. CTC-based approaches, exemplified by CRNN [
20], leverage recurrent neural networks (RNNs) to model sequential dependencies in textual data, while attention-based methods enhance recognition accuracy by focusing on salient image regions during decoding.
The integration of text detection and recognition into end-to-end (E2E) Scene Text Spotting systems represents a paradigm shift, enabling simultaneous localization and transcription of text within images. These systems eliminate intermediate steps like text region cropping and alignment, reducing error propagation and enhancing computational efficiency. E2E models, such as FOTS [
21], Mask TextSpotter [
22], and transformer-based approaches like SPTS [
23], have demonstrated state-of-the-art performance across benchmarks. By leveraging advanced deep learning architectures like transformers [
24], these models excel in capturing long-range dependencies and improving contextual understanding, pushing the boundaries of scene text detection and recognition. In summary, the problem formulation of scene text detection and recognition revolves around accurately localizing and interpreting textual information in natural images, with deep learning techniques driving significant advancements toward robust and efficient E2E systems capable of handling diverse textual content in complex scenes. These developments underscore the transformative potential of deep learning in computer vision and pave the way for future innovations.
2.1. Key Challenges in Scene Text Detection and Recognition
The task of scene text detection and recognition is inherently challenging due to the variability and complexity of text in natural scenes. These challenges can be categorized into three main areas: multi-oriented and curved text, low resolution and complex backgrounds, and language diversity and font styles.
Multi-oriented and Curved Text: Text in natural scenes often exhibits arbitrary orientations and can be curved, posing a significant challenge for traditional methods that rely on horizontal bounding boxes. These irregular shapes make it difficult to accurately localize and recognize text instances. To address this, advanced deep learning architectures, such as TextSnake [
4] and ABCNet [
1], have been proposed. These models leverage techniques like Bezier curves or adaptive text region representations to accurately capture the geometry of text, regardless of its orientation or curvature.
Low Resolution and Complex Backgrounds: Text in images frequently appears at low resolutions, blurred, or partially occluded, which degrades the quality of visual features available for detection and recognition. Moreover, complex backgrounds with similar colors or textures to text can introduce false positives or negatives. To mitigate these challenges, deep learning models are trained to learn robust features that can discriminate text from the background, even under adverse conditions. Techniques such as data augmentation, multi-scale feature fusion, and attention mechanisms are commonly employed to enhance the model’s ability to handle low-resolution and complex-background scenarios. These strategies help the model focus on relevant textual features while suppressing background noise.
Language Diversity and Font Styles: Scene text encompasses a wide range of languages, scripts, and font styles, each presenting unique challenges for STR systems. Languages with complex character sets, such as Chinese and Japanese, require models capable of recognizing thousands of unique characters. Additionally, diverse font styles and sizes demand models that can generalize across different visual representations of text. To tackle this challenge, deep learning-based STR systems are trained on large-scale datasets that encompass multiple languages and font styles. Furthermore, models like TrOCR [
25] and SVT [
26] leverage pre-trained language models and transformer architectures to capture the semantic and contextual information of text. This enables them to improve recognition accuracy across diverse languages and font styles, even when faced with unfamiliar or complex textual content.
2.2. Traditional Approaches
Before the advent of deep learning, scene text detection and recognition (STR) predominantly relied on traditional computer vision techniques. While these methods demonstrated effectiveness in certain controlled environments, they frequently encountered challenges when confronted with the inherent variability and complexity of text in natural scenes.
Connected-component-based Methods: Connected-component-based methods were pioneering techniques in scene text detection. These approaches operate under the premise that text regions within an image are composed of connected pixels sharing similar properties, such as color or intensity. Notable examples include the Stroke Width Transform (SWT) [
27] and Maximally Stable Extremal Regions (MSER) [
28]. SWT identifies text by analyzing the stroke width consistency of characters, assuming that characters within a text region exhibit uniform stroke widths. Conversely, MSER detects regions that remain stable across varying intensity thresholds, thereby isolating potential text areas.
Despite their success in simple scenarios, these methods often struggle in complex backgrounds or when dealing with multi-oriented and curved text. The fundamental limitation of connected-component analysis (CCA) lies in its inability to adapt to the diverse and unpredictable nature of text in natural scenes. Text in real-world images can vary significantly in font, size, orientation, and curvature, making it difficult for CCA-based methods to maintain consistent performance across different scenarios.
Sliding-window-based Methods: Sliding-window-based methods represent another traditional approach to scene text detection. These methods systematically scan the image using a fixed-size window and apply a classifier to each window to determine the presence of text. As discussed in the work by Wang et al. [
29], this approach can be computationally intensive due to the need to evaluate multiple window sizes and aspect ratios to accommodate the range of possible text scales and orientations.
While sliding-window methods can effectively detect text of varying sizes and orientations, they are prone to a high false positive rate. This is primarily attributed to the challenge of distinguishing text from non-text regions that exhibit similar visual features, particularly in complex backgrounds. Moreover, the fixed-size window restricts their ability to handle text of arbitrary shapes and sizes, limiting their robustness in real-world applications.
In summary, traditional approaches to scene text detection and recognition, while foundational, exhibit significant limitations when applied to natural scenes. The inherent variability and complexity of text in real-world images demand more sophisticated and adaptive techniques. The subsequent rise of deep learning has provided a powerful framework for addressing these challenges, leading to substantial advancements in the field of STR.
3. Deep Learning for Scene Text Detection
In the context of deep learning, numerous methodologies have been devised to address the challenge of detecting text within natural scenes. These methodologies can be systematically organized into several distinct categories, each characterized by its unique approach and corresponding strengths and limitations. The primary categories include Two-Stage Detection Method, Single-Stage Detection Method, Segmentation-Based Method, as well as Hybrid and End-to-End Method.
3.1. Two-Stage Detection Method
Two-stage detection methods have become fundamental in scene text detection, employing a proposal-based framework that achieves precise text localization through sequential region proposal and refinement stages. Building upon successful object detection architectures like Faster R-CNN [
9], these methods first generate potential text regions using either sliding windows or more sophisticated Region Proposal Networks (RPNs). The RPN efficiently scans the image at multiple scales and locations, outputting rectangular bounding boxes with associated objectness scores that indicate the likelihood of text presence.
A significant advancement in this framework came with Ma et al.’s Rotation Region Proposal Network (RRPN) [
30], which addresses a critical limitation of traditional RPNs by generating inclined proposals with orientation information. This innovation enables robust detection of arbitrarily oriented text, particularly valuable for real-world applications where text rarely follows strict horizontal alignment.
The second stage refines these proposals through a Region-of-Interest (RoI) pooling layer that extracts fixed-size features for each proposal. These features feed into parallel classification and regression networks that respectively determine text/non-text labels and adjust bounding box coordinates. For oriented text detection, Ma et al. [
30] extended this approach with their Rotation RoI (RRoI) pooling layer, which properly handles the geometric transformation of inclined proposals. Further enhancing this paradigm, Mask R-CNN [
10] introduced pixel-level mask prediction alongside classification and regression, enabling precise segmentation of irregular text shapes and closely spaced instances.
While two-stage methods achieve exceptional accuracy in complex scenes with cluttered backgrounds and diverse text orientations, their computational complexity and slower inference speeds compared to single-stage approaches remain challenges. The sequential processing pipeline, though effective, inherently requires more computation time. Nevertheless, these methods continue to dominate applications where detection quality outweighs speed considerations, and ongoing research like Ma et al.’s work demonstrates promising directions for efficiency improvements in oriented text detection scenarios.
The advantages and disadvantages of the two-stage-based scene text detection method are systematically analyzed as follows:
- (1)
Advantages
High Accuracy: Two-stage methods achieve exceptional precision in complex scenes with cluttered backgrounds and diverse text orientations, thanks to their sequential proposal-refinement pipeline [
9,
30].
Robustness to Orientation: Innovations like RRPN and RRoI pooling enable robust detection of arbitrarily oriented text, which is critical for real-world applications [
30].
Pixel-Level Segmentation: Mask R-CNN extends the framework with pixel-level mask prediction, allowing precise segmentation of irregular text shapes and closely spaced instances [
10].
- (2)
Disadvantages
Computational Complexity: The sequential processing pipeline inherently requires more computation time, leading to slower inference speeds compared to single-stage approaches [
31].
Speed–Accuracy Trade-off: While highly accurate, these methods are less suitable for real-time applications where speed is prioritized over detection quality [
32].
3.2. Boundary Point-Based Methods for Scene Text Detection
Boundary point-based methods have emerged as a promising approach for scene text detection, particularly in handling arbitrarily shaped text instances within natural scene images. Unlike traditional methods that rely on rectangular or quadrilateral bounding boxes, boundary point-based methods aim to accurately localize text boundaries by predicting a set of key points along the text contours. This approach enables precise detection of text with complex shapes, sizes, and orientations, addressing a critical challenge in scene text detection.
One notable example in this domain is the Adaptive Boundary Proposal Network (TextBPN) proposed by Zhang et al. [
33]. TextBPN generates coarse boundary proposals using a multi-layer dilated convolutional network, which predicts classification maps, distance fields, and direction fields as prior information. These proposals are then refined iteratively through a Graph Convolutional Network (GCN) and a Recurrent Neural Network (RNN) to predict accurate boundary points. By explicitly modeling text boundaries, TextBPN achieves state-of-the-art performance on challenging datasets such as CTW1500 and Total-Text, demonstrating the effectiveness of boundary point-based methods.
Further advancing this approach, Zhang et al. [
34] introduced the Boundary Transformer, which presents a unified coarse-to-fine framework for arbitrary shape text detection. Unlike TextBPN, the Boundary Transformer employs an iterative boundary transformer module to refine coarse proposals, leveraging a transformer encoder-decoder structure to model long-distance dependencies between boundary points. This enables the network to capture global geometric information more effectively, improving detection accuracy, especially for long and curved text instances. Meanwhile, Ye et al. [
35] proposed DPText-DETR, a Dynamic Point Text Detection Transformer network that directly utilizes explicit point coordinates to generate position queries and dynamically updates them. The introduction of an Enhanced Factorized Self-Attention (EFSA) module provides circular shape guidance for point queries, enhancing boundary localization accuracy.
The key advantage of boundary point-based methods is their high-precision handling of arbitrarily shaped text instances. By explicitly modeling text boundaries through key points, these methods can capture fine-grained details of text contours, even in complex backgrounds and layouts. Many of these methods employ an iterative refinement mechanism, which enables continuous improvement of boundary localization accuracy and ensures robust performance across various datasets. However, there are persistent challenges. One is the need for accurate initial boundary proposals, and the other is the computational complexity associated with iterative refinement. To tackle these issues, researchers have proposed several improvements. For example, incorporating multi-scale features can enhance the robustness of initial proposals [
36], and leveraging advanced architectures like deformable convolutions [
37] can improve feature extraction. Zheng et al. [
36], for instance, put forward a method for the dynamic optimization of boundary points. This method iteratively refines boundary points based on adjacent region information, thereby improving the detection accuracy of arbitrarily shaped scene text.
Boundary point-based methods have emerged as a reliable solution for detecting arbitrarily shaped text in complex scenes. These approaches utilize adaptive boundary modeling and iterative refinement to achieve state-of-the-art performance on challenging benchmarks while maintaining flexibility for diverse text geometries. The advantages and disadvantages of the Boundary Point-Based scene text detection method are systematically analyzed as follows:
- (1)
Advantages
Geometric Adaptability: The explicit boundary point representation enables precise localization of curved, rotated, and irregular text instances, outperforming traditional bounding box-based methods [
33,
34].
Progressive Optimization: Integrated refinement mechanisms (e.g., GCNs/RNNs) progressively enhance boundary precision through multi-stage processing, particularly effective for low-contrast or occluded text [
33].
Benchmark Performance: Demonstrated superiority on standard datasets (CTW1500, Total-Text) validates the paradigm’s effectiveness, with methods like TextBPN establishing new performance benchmarks [
34].
- (2)
Disadvantages
Proposal Dependency: Detection accuracy exhibits sensitivity to initial proposal quality, with performance degradation occurring when coarse proposals fail to capture basic text geometry [
36].
Computational Overhead: The iterative refinement process, while improving accuracy, introduces notable computational overhead that may limit real-time applicability [
37].
Future research directions should address these limitations through: (1) enhanced proposal generation networks to reduce initial prediction variance, (2) optimized refinement architectures to balance accuracy and efficiency, and (3) synergistic integration with complementary approaches (e.g., transformer-based recognition) to develop unified text spotting systems.
3.3. Single-Stage Detection Method
Single-Stage Detection Method aim to simplify the text detection pipeline by integrating the region proposal and bounding box regression steps into a single network, thereby improving inference speed and reducing computational overhead. These methods typically directly predict the coordinates of text bounding boxes from the input image without an explicit region proposal stage.
3.3.1. Regression-Based Method
Regression-Based Methods represent a well-established and prominent category within the realm of single-stage text detection techniques. These methods approach text detection as a regression task, wherein the network is trained to directly predict the coordinates of text bounding boxes from the input image.
Among the pioneering works in regression-based text detection, the Connectionist Text Proposal Network (CTPN), proposed by Tian et al. [
31], stands out as a significant milestone. Building upon the Faster R-CNN framework, CTPN introduces a vertical anchor mechanism to detect text proposals in a sliding window fashion. More specifically, CTPN partitions the input image into a sequence of horizontal proposals and employs a bidirectional Long Short-Term Memory (LSTM) network to capture contextual information across these proposals. Subsequently, the network predicts vertical offsets and text/non-text scores for each proposal, enabling precise localization of text lines. CTPN has demonstrated remarkable improvements over prior methods on standard benchmarks, such as ICDAR 2013 and ICDAR 2015, thereby validating the efficacy of Regression-Based Methods in text detection.
Another noteworthy regression-based approach is the Efficient and Accurate Scene Text Detector (EAST), introduced by Zhou et al. [
32]. EAST streamlines the text detection pipeline by eliminating intermediate steps, such as candidate proposal generation and non-maximum suppression. Instead, it directly predicts the geometric properties of text instances (e.g., axis-aligned bounding boxes or rotated rectangles) using a fully convolutional network (FCN). The network architecture comprises a feature extraction backbone (e.g., VGG16 or ResNet), followed by several upsampling layers to restore the spatial resolution of feature maps. EAST then utilizes a single-shot detection head to predict the position and shape of text instances, achieving real-time performance while preserving high accuracy. The effectiveness of EAST has been demonstrated across various benchmarks, including ICDAR 2015, MSRA-TD500, and COCO-Text, further solidifying its position as a leading regression-based text detection method.
TextBoxes [
38] and its successor TextBoxes++ [
39] are also regression-based methods designed for text detection. TextBoxes adapts the Single Shot MultiBox Detector (SSD) architecture for text detection by modifying the default anchor boxes to better suit the aspect ratios and scales of text instances. Specifically, it introduces “tall” default boxes with larger height-to-width ratios to capture the elongated shape of text lines. TextBoxes++ further improves upon TextBoxes by incorporating multi-oriented text detection capabilities. It achieves this by adding more anchor orientations and using a rotation-invariant feature extraction strategy. Both TextBoxes and TextBoxes++ have shown competitive performance on various text detection benchmarks, demonstrating the versatility of regression-based methods in handling different text orientations and scales.
Additionally, YOLO (You Only Look Once) [
11], initially proposed for general object detection, has been adapted for scene text detection by modifying its architecture and loss function to better handle the unique characteristics of text, such as aspect ratio variations and multi-orientation. In the context of text detection, YOLO divides the input image into a grid and predicts bounding boxes and class probabilities for each grid cell. For text detection, a text/non-text class is used instead of traditional object categories. YOLO’s speed and efficiency make it suitable for real-time applications, although its accuracy on small or highly irregular text instances may be limited.
Regression-based methods for text detection offer a balanced combination of efficiency and versatility in handling standard text detection tasks. These approaches streamline the detection pipeline by directly predicting bounding boxes through integrated region proposal and regression mechanisms, eliminating the need for complex post-processing [
31,
32]. This architecture confers significant advantages for real-time applications where computational efficiency is paramount. Furthermore, their adaptability to multi-oriented text through rotated anchor designs or rotation-invariant features enhances their practical utility across diverse scenarios [
39].
The advantages and disadvantages of the regression-based scene text detection method are systematically analyzed as follows:
- (1)
Advantages
Computational Efficiency: The unified network architecture achieves faster inference speeds compared to segmentation-based pipelines, particularly beneficial for real-time implementations.
Orientation Robustness: Architectural enhancements like rotated anchors enable effective detection of multi-oriented text without compromising detection speed.
- (2)
Disadvantages
Geometric Constraints: The inherent limitations of rectangular bounding boxes lead to suboptimal performance on curved or irregular text geometries [
4].
Scale Sensitivity: Detection reliability decreases for small text instances or extreme aspect ratios due to regression instability [
31].
To mitigate these limitations, recent advancements have adopted hybrid strategies incorporating segmentation components or alternative shape representations (e.g., Bezier curves) while maintaining the efficiency benefits of regression frameworks. This evolutionary direction demonstrates the ongoing refinement of regression-based paradigms to address complex text detection challenges.
3.3.2. Segmentation-Based Method
Segmentation-based methods constitute a significant category within the realm of deep learning for text detection and recognition, particularly in the context of scene text detection. Unlike regression-based methods, which directly predict the bounding boxes of text instances, segmentation-based approaches treat text detection as a pixel-wise classification problem. This section delves into the core principles, recent advancements, and notable contributions of Segmentation-Based Method in text detection and recognition, with a focus on their evolution in handling the complexities of scene text. The fundamental idea behind Segmentation-Based Method is to classify each pixel in an image as either belonging to a text region or the background. This is typically achieved through the use of fully convolutional networks (FCNs) that output a probability map indicating the likelihood of each pixel being part of a text instance. Subsequently, post-processing steps such as thresholding, connected component analysis, or contour extraction are applied to generate the final bounding boxes or polygons of the detected text.
One of the pioneering works in the segmentation-based text detection domain is the modified Fully Convolutional Network (FCN) approach by Liu et al. [
40]. This method employs two labels of word stroke regions (WSRs) and text center blocks (TCBs) to predict text regions at a pixel level. By leveraging multi-level convolutional features, it demonstrates effectiveness in detecting scene text, especially in complex backgrounds. The use of both WSRs and TCBs enables the method to capture detailed text structure and context, leading to improved detection accuracy. TextSnake [
4] is another representative work that models a text instance as a sequence of ordered, overlapping disks centered along the text centerline. Each disk is characterized by its radius, orientation, and a binary mask indicating whether the pixel belongs to the text region. This flexible representation allows TextSnake to effectively detect text of arbitrary shapes, including curved and oriented text, thus addressing a significant challenge in scene text detection. By decomposing text instances into a sequence of disks, TextSnake can handle complex text layouts and variations in scale and orientation.
The Progressive Scale Expansion Network (PSENet) [
41] introduces a progressive scale expansion strategy for detecting text instances with arbitrary shapes. PSENet generates multiple scale-expanded kernels for each text instance, starting from a small kernel that captures the core region of the text and gradually expanding to cover the entire text area. This strategy allows PSENet to effectively handle text instances with complex shapes and varying scales. By combining the information from all scale-expanded kernels, PSENet can accurately segment text regions even in the presence of heavy occlusion or cluttered backgrounds, and has shown superior performance on challenging datasets such as Total-Text and CTW1500. The Pixel Aggregation Network (PAN) [
42] combines the strengths of segmentation-based and regression-based methods by introducing a pixel aggregation mechanism. It first uses an FCN to generate a score map indicating the text regions, and then employs a pixel aggregation algorithm to group pixels into text instances. PAN achieves high efficiency and accuracy by leveraging the spatial relationships between pixels, making it suitable for real-time applications.
In addition to existing approaches, Liu et al. [
43] introduced the Multi-level Feature Enhanced Cumulative Network (MFECN) for scene text detection, which is grounded in segmentation-based principles. The MFECN framework is designed to effectively handle scene text instances with irregular shapes by leveraging rich hierarchical multi-level features. It comprises two key modules: the Multi-level Features Enhanced Cumulative (MFEC) module and the Multi-level Features Fusion (MFF) module. The MFEC module is responsible for capturing features that exhibit cumulative enhancement of representational ability, while the MFF module ensures the thorough integration of both high-level and low-level features derived from the MFEC module. By enhancing the representational capability of the in-network hierarchical features, MFECN achieves performance that is either superior or comparable to other state-of-the-art methods.
As illustrated in
Figure 2, the MFECN framework consists of five integral components. First, the Multi-Stage Features Extraction component extracts features at different stages. Subsequently, the Multi-Scale Features Enhancement Cumulation component further processes these features to enhance their cumulative representational power. The Multi-Level Features Fusion component then combines the enhanced features from different levels. After that, the Text Proposal Generation component generates potential text proposals based on the fused features. Finally, the Text Instance Prediction component makes the final predictions of text instances. In the following sections, we will delve into each component in detail and also discuss the training strategy employed for the MFECN framework.
Segmentation-Based Methods have emerged as a powerful paradigm in text detection and recognition, offering distinct advantages over Regression-Based Methods. Their pixel-wise classification approach enables precise localization and rich feature extraction, making them particularly suitable for complex real-world scenarios.
The advantages and disadvantages of the Segmentation-Based scene text detection method are systematically analyzed as follows:
- (1)
Advantages
Arbitrary Shape Support: Pixel-wise classification allows these methods to naturally handle text instances of any shape or orientation, providing richer information about text regions [
4,
40].
Robustness to Noise and Occlusions: Segmentation-based approaches can still detect partial text regions even when the entire instance is not fully visible, enhancing robustness in cluttered scenes [
41,
42].
- (2)
Disadvantages
Post-Processing Complexity: The need for thresholding, connected component analysis, or contour extraction adds computational overhead and may introduce errors if not carefully tuned and optimized [
32].
Memory Intensity: FCNs used in segmentation-based methods often require significant memory resources, limiting their scalability to high-resolution images [
10].
The qualitative detection results of the Segmentation-Based scene text detection method, MFECN [
43], on publicly accessible datasets (CTW1500, Total-text, MSRA-TD500, ICDAR2013, and ICDAR2015) are depicted in
Figure 3. Furthermore,
Figure 4 illustrates the failure cases of the Segmentation-Based scene text detection method, MFECN [
43], in relevant test scenarios.
3.4. Hybrid and End-to-End Method
In the domain of deep learning for text detection and recognition, Hybrid and End-to-End methods stand out as highly effective strategies for tackling the challenges presented by complex scenes and the wide variety of text appearances. These two approaches have garnered considerable attention in the research community due to their distinct yet complementary strengths.
Hybrid methods ingeniously integrate the strengths of regression-based and segmentation-based techniques. By doing so, they aim to capitalize on the advantages of both paradigms. Regression-based methods are adept at quickly pinpointing potential text regions, providing a coarse localization of text within an image. On the other hand, segmentation-based methods excel at precisely delineating the boundaries of these regions, ensuring accurate text localization. Hybrid methods [
41,
44] typically employ a combination of these two strategies. For instance, some approaches first utilize regression to swiftly identify potential text areas and then apply segmentation to refine the boundaries of these regions, thereby enhancing the overall accuracy and robustness of text detection. Although the specifics of individual hybrid methods are not elaborated in the provided references, the general concept of merging regression and segmentation has been extensively explored in various studies, consistently demonstrating its potential to significantly improve text detection performance.
In contrast, end-to-end methods represent a paradigm shift by integrating text detection and recognition into a single, unified framework. This integration not only streamlines the overall pipeline but also facilitates seamless information flow between the two tasks. By jointly optimizing text detection and recognition within a single framework, end-to-end methods enable the sharing of valuable information between the tasks, often leading to enhanced overall performance.
Among the end-to-end methods, Mask TextSpotter [
22] is a notable example. It leverages the Mask R-CNN architecture to achieve joint text detection and recognition. By incorporating a mask prediction branch, Mask TextSpotter is capable of accurately segmenting text instances from the background while simultaneously recognizing the text content. This approach has yielded promising results across various benchmarks, particularly in scenarios where text instances are closely spaced or overlapping, showcasing its robustness and effectiveness.
Another significant contribution to the end-to-end text spotting literature is ABCNet (Adaptive Bezier Curve Network) [
1]. ABCNet introduces a novel approach by utilizing Bezier curves to describe arbitrary-shaped text instances. Traditional methods typically rely on rectangular or quadrilateral bounding boxes, which may not accurately capture the shape of curved or irregular text. In contrast, ABCNet adapts Bezier curves to fit the shape of text instances more precisely, especially in scenarios involving curved or irregular text. This innovation has enabled ABCNet to achieve state-of-the-art performance on curved text datasets, underscoring the effectiveness of incorporating geometric priors into the text detection and recognition pipeline.
In summary, hybrid and end-to-end methods are two distinct yet complementary paradigms for text detection and recognition in deep learning. Hybrid methods combine regression-based and segmentation-based approaches to enhance accuracy and robustness, while end-to-end methods integrate detection and recognition into a unified framework to optimize information flow and task performance. The continuous development of these methodologies has been driving the progress of state-of-the-art in scene text analysis. Here, we conduct a systematic analysis of their respective advantages and disadvantages to offer a comprehensive comparison.
The advantages and disadvantages of the Hybrid-Based scene text detection method are systematically analyzed as follows:
- (1)
Advantages
Combined Strengths: By integrating regression-based and segmentation-based techniques, hybrid methods leverage the speed of regression for coarse localization and the precision of segmentation for boundary refinement, achieving a balance between accuracy and efficiency [
41,
44].
Improved Robustness: The combination of approaches enhances accuracy and robustness, particularly in challenging scenarios with overlapping or closely spaced text instances, where individual methods may struggle [
41].
- (2)
Disadvantages
Design Complexity: Hybrid methods require careful architectural design to ensure effective integration of regression and segmentation components, which can be non-trivial and time-consuming [
1].
Potential Conflicts: The optimization objectives of regression and segmentation may not always align, leading to potential conflicts during training that need to be resolved through careful tuning and optimization [
37].
In contrast, End-to-End methods offer a more holistic approach by unifying text detection and recognition within a single framework. This integration facilitates a seamless flow of information between tasks, often enhancing overall performance and coherence [
1,
22]. Moreover, end-to-end training enables the joint optimization of detection and recognition components, capitalizing on the synergies between these tasks to yield more consistent and accurate results [
1,
22].
The advantages and disadvantages of the End-to-End-Based scene text detection method are systematically analyzed as follows:
- (1)
Advantages
Seamless Information Flow: Integrating text detection and recognition into a single framework enables the sharing of valuable information between tasks, often improving overall performance and coherence [
1,
22].
Joint Optimization: End-to-end training allows for the simultaneous optimization of detection and recognition components, leading to more coherent and accurate results by leveraging the synergies between the two tasks [
1,
22].
- (2)
Disadvantages
Increased Model Complexity: End-to-end methods typically involve more complex architectures, which can be harder to train and require larger datasets to achieve good generalization and robustness [
1].
Task-Specific Challenges: Balancing the objectives of detection and recognition within a single framework can be challenging, especially when dealing with diverse text appearances and scene complexities, necessitating careful design and optimization strategies [
22].
3.5. Loss Functions and Optimization
In scene text detection, selecting an appropriate loss function is vital for optimizing model performance, especially for accurately localizing text regions with diverse shapes and orientations. Traditional object detection tasks mainly rely on intersection-over-union (IoU)-based losses to measure the overlap between predicted and ground-truth bounding boxes. However, for scene text detection, especially with arbitrarily-shaped text, additional considerations are required.
3.5.1. Loss Functions
IoU-based losses are widely used in text detection frameworks, such as the smooth
loss in bounding box regression, which is adopted in classic models like Faster R-CNN [
9] and its variants. In object detection, the design of loss functions is crucial for enhancing model performance. For example, Fast R-CNN defines a multi-task loss function to optimize classification and regression tasks simultaneously:
where
and
R is a robust loss function, specifically the smooth
function. This design helps the model better handle bounding box regression during training, improving detection accuracy by refining the position and size of bounding boxes around text regions.
Nevertheless, IoU alone may not be sufficient for complex scenarios involving rotated or curved text, as it does not directly account for orientation or shape. To tackle these challenges, researchers have proposed specialized loss functions. For instance, Ma et al. [
30] introduced a rotation-aware loss function for rotated text detection, incorporating the orientation angle into bounding box regression:
with
This approach improves detection accuracy for multi-oriented text by better fitting inclined proposals to text regions.
Dice loss, initially introduced for image segmentation [
45], has also been applied to scene text detection, especially in segmentation-based methods like PSENet [
42]. At the pixel level, Dice loss measures the overlap between predicted and ground-truth masks, enabling precise capture of text region details, including shapes and orientations. Minimizing Dice loss during training encourages the model to generate accurate segmentation masks, which are then post-processed to extract bounding boxes for text recognition and analysis.
For PSENet [
42], the overall loss function
L is a weighted combination of two components:
where
and
represent losses for complete and shrunk text instances, respectively, and
balances their relative importance.
focuses on text/non-text segmentation using a training mask
M generated via Online Hard Example Mining (OHEM):
where
is a dice coefficient measuring the similarity between predicted (
) and ground-truth (
) segmentation results.
addresses shrunk text instances, ignoring non-text region pixels in
using a mask
W:
where
This formulation enhances the model’s ability to detect and segment text of varying scales and shapes.
Focal Loss is another option. Besides IoU-based and Dice losses, focal loss [
46] has been explored to address class imbalance issues in text detection, where the number of background pixels often far exceeds that of text pixels. Focal loss reduces the contribution of easy examples, allowing the model to focus more on hard examples and thus improving overall performance. The focal loss function is formulated as:
where
is the model’s estimated probability for the true class,
is a weighting factor for class
t, and
is a tunable focusing parameter.
3.5.2. Optimization Strategies
Optimization strategies are also essential for training deep learning models for scene text detection. Stochastic Gradient Descent (SGD) and its variants, such as Adam [
47], are commonly used to update the model’s parameters during training. The SGD update rule is:
where
represents the model parameters at iteration
t,
is the learning rate, and
is the gradient of the loss function
J with respect to the parameters
at iteration
t. Adam’s update rule is more complex, combining the advantages of momentum and adaptive learning rates.
Learning rate schedules, such as step decay or cosine annealing, are often used to ensure stable convergence and prevent overfitting. For step decay, the learning rate is reduced by a factor
after every
k epochs, as shown in:
where
is the initial learning rate and
is the current training epoch number.
Moreover, techniques like data augmentation (e.g., random rotations, flips, and color jittering) are employed to increase the diversity of the training data, enhancing the model’s robustness to variations in text appearance.
3.6. Annotation Bounding Box Formats in Scene Text Detection
In the domain of scene text detection, the selection of an appropriate bounding box representation is crucial, as it directly influences the geometric fitting accuracy and computational efficiency of the detection system. Different bounding box formats offer varying levels of geometric expressiveness, enabling them to handle text with different orientations and shapes. This section presents a systematic analysis of five commonly used annotation formats in benchmark datasets, focusing on their mathematical formulations, implementation strategies, and relative merits.
3.6.1. Format Specifications
To provide a clear comparison, we first summarize the mathematical representations and parameter counts for each bounding box format in
Table 1.
Rectangular boxes represent the simplest format, defined by the top-left coordinates
along with width
w and height
h, as shown in
Table 1. Despite their computational efficiency, rectangular boxes often fail to accurately fit non-horizontal text, leading to the inclusion of background pixels and potential performance degradation [
21].
To enhance the representation of oriented text, rotated rectangles introduce an orientation angle
. The transformation matrix for a rotated rectangle can be formulated as:
This modification improves the handling of multi-oriented text. However, rotated rectangles still struggle to accurately approximate curved text [
30].
For text regions affected by perspective distortion, quadrilaterals offer a more suitable representation by modeling the text as a 4-point convex polygon. The convex hull of these points can be computed as
Quadrilaterals have been adopted by datasets such as ICDAR2015 [
13] due to their ability to better fit perspective-distorted text. Nevertheless, they remain limited in representing concave shapes.
To further generalize the representation of text contours, polygonal annotations employ
n-vertex polygons. The polygon can be represented as the union of line segments connecting consecutive vertices:
where
to ensure closure. Polygonal annotations, as used in the Total-Text dataset [
14], provide precise capture of irregular text contours. However, this precision comes at the cost of increased annotation complexity.
Finally, for highly curved text, Bézier curves offer an elegant parametric solution. The position of a point on a Bézier curve can be calculated as:
where
are the control points that define the curve’s shape. ABCNet [
1] demonstrates the effectiveness of Bézier curves, particularly cubic Bézier curves with
, in accurately modeling curved text. This format provides a powerful tool for detecting text with complex shapes, although it may require more sophisticated algorithms for processing and interpretation.
3.6.2. Comparative Analysis
To provide a comprehensive evaluation of different bounding box formats, we present a performance comparison in
Table 2, utilizing metrics derived from recent benchmark studies. This analysis aims to elucidate the trade-offs between geometric precision, computational efficiency, and annotation complexity across various formats.
Based on the data presented in
Table 2, we can draw several key insights regarding the performance characteristics of each bounding box format:
- (1)
Geometric precision
In the context of curved text detection, Bézier curves demonstrate a significant advantage, achieving a 19% higher Intersection over Union (IoU) compared to rectangular boxes on the CTW1500 dataset [
48]. Polygons also perform well, closely following Bézier curves in terms of precision. Conversely, rectangular boxes exhibit the lowest precision due to their inability to accurately fit non-horizontal text, often resulting in the inclusion of background pixels.
- (2)
Computational efficiency
Rotated rectangles offer a favorable balance between precision and computational complexity, requiring only five parameters (as shown in
Table 1) compared to the eight parameters needed for quadrilaterals. This reduction in parameter count makes rotated rectangles particularly suitable for real-time scene text detection systems, such as RRPN [
30,
32].
- (3)
Annotation complexity
The annotation process becomes increasingly complex as the geometric expressiveness of the bounding box format increases. Polygonal annotations, for instance, demand approximately 2.6 times more annotation time than quadrilaterals [
49]. Bézier curves, while offering high precision, strike a balance between annotation complexity and computational efficiency, making them a viable option for datasets where curved text is prevalent [
1].
Ultimately, the choice of bounding box format depends on the specific requirements of the application. Bézier curves excel in scenarios involving curved text, as demonstrated by their effectiveness in ABCNet [
1]. For oriented text detection, rotated rectangles provide a sufficient level of precision with reduced computational overhead [
30]. In comprehensive benchmarks where high precision is paramount, polygonal annotations remain the gold standard, despite their higher annotation costs [
14].
3.6.3. Implementation Considerations
The implementation of different bounding box formats in modern scene text detection frameworks involves several key considerations, including the design of loss functions, post-processing techniques, and hardware acceleration strategies. These factors collectively influence the overall performance and efficiency of the detection system.
- (1)
Loss functions
The choice of loss function plays a crucial role in training models to accurately predict bounding box coordinates. For polygonal annotations, the Dice loss [
45] has been shown to outperform traditional IoU-based losses, particularly in scenarios involving complex text shapes. The Dice loss can be formulated as:
where
Y represents the ground truth mask and
denotes the predicted mask. This loss function emphasizes the overlap between the predicted and ground truth regions, making it well-suited for tasks requiring precise boundary delineation.
- (2)
Post-processing
Post-processing techniques are essential for converting raw model outputs into meaningful bounding box representations. For quadrilaterals, Point-to-Quad assignment [
50] is commonly used to match predicted points to the corresponding vertices of the ground truth quadrilateral. In contrast, Bézier curves leverage differential geometry principles for fitting, enabling the accurate reconstruction of curved text shapes [
1]. These post-processing steps are critical for ensuring that the final detection results align with the geometric properties of the text regions.
- (3)
Hardware acceleration
The computational efficiency of different bounding box formats is also influenced by their compatibility with hardware acceleration techniques. Rectangular formats, for instance, can take advantage of GPU-optimized operations in deep learning frameworks like TensorRT, resulting in faster inference times. In contrast, polygonal formats often require custom kernel implementations to handle the increased complexity of vertex processing [
41]. Recent advancements, such as the development of efficient variants like TextBPN++ [
51], have sought to reduce the computational overhead associated with polygonal annotations through sparse representations and optimized algorithms.
Recent trends in scene text detection research indicate a growing preference for Bézier and polygonal formats in high-precision applications, where accurate representation of complex text shapes is critical. The development of efficient variants and hardware acceleration techniques continues to drive the adoption of these advanced bounding box formats, enabling real-time performance without compromising detection accuracy.
4. Scene Text Recognition
Scene text recognition (STR) is a crucial task in computer vision, aiming to automatically transcribe textual information from natural scene images into machine-readable digital formats. With the advent of deep learning, significant progress has been made in STR, leading to the development of various sophisticated models. This section reviews the major deep learning approaches for STR, categorized into sequence-to-sequence (Seq2Seq) models, CTC-based models, Transformer-based models, Visual-based models, and discussions on character-level vs. word-level recognition.
4.1. Sequence-to-Sequence Models
Sequence-to-sequence (Seq2Seq) models have emerged as a cornerstone in scene text recognition (STR), leveraging their ability to effectively handle text as sequential data. One of the seminal works in this domain is the Convolutional Recurrent Neural Network (CRNN) [
20], which seamlessly integrates convolutional layers for feature extraction with recurrent layers (typically Long Short-Term Memory networks, LSTMs) for sequence modeling. This architecture excels in capturing both spatial features and temporal dependencies inherent in text sequences, demonstrating robust performance across diverse text orientations and scales. The use of Connectionist Temporal Classification (CTC) loss in CRNN facilitates end-to-end training without the need for explicit character-level annotations, streamlining the training process.
Building upon the success of CRNN, the Robust Scene Text Recognizer with Automatic Rectification (RARE) [
52] introduced significant advancements through the incorporation of Spatial Transformer Networks (STNs). STNs enable the model to automatically learn and apply geometric transformations to input images, rectifying irregularly shaped text into a canonical form with horizontally aligned characters. This rectification step significantly enhances recognition accuracy, particularly for text suffering from distortion or perspective effects.
Attention mechanisms have further propelled the performance of Seq2Seq models in STR. The Attention-based Scene Text Recognizer (ASTER) [
53] exemplifies this by integrating a sophisticated attention mechanism that dynamically focuses on relevant regions of the input image during the decoding process. This mechanism, often referred to as 2D attention, allows the model to attend to specific parts of the CNN-extracted feature maps when generating each output character, thereby capturing both global context and fine-grained details. This capability is particularly beneficial when dealing with long text sequences or complex backgrounds, where context information is crucial for accurate recognition.
As illustrated in
Figure 5, the proposed framework for irregular text recognition leverages attention mechanisms to dynamically adapt to text distortions. Extending this attention-based paradigm, the Show, Attend and Read (SAR) model [
54] introduces a hierarchical attention mechanism that integrates coarse-grained and fine-grained attention. This dual-level approach allows the model to simultaneously capture global contextual information (via coarse-grained attention) and local details (via fine-grained attention). By combining these perspectives, the SAR model effectively handles text instances of varying lengths and complexities, thereby enhancing recognition accuracy. The hierarchical attention design aligns seamlessly with the framework’s ability to implicitly manage text irregularities, as depicted in the figure, through a weakly supervised 2D attention module. This synergy ensures robust performance across diverse text layouts and distortions.
In summary, Seq2Seq models, through innovations such as CRNN, RARE, ASTER, and SAR, have demonstrated remarkable success in STR. These models leverage convolutional layers for feature extraction, recurrent layers for sequence modeling, and attention mechanisms for dynamic context capturing, collectively advancing the state-of-the-art in scene text recognition.
4.2. CTC-Based Approaches
Connectionist Temporal Classification (CTC) has been a cornerstone in many scene text recognition (STR) systems, especially when precise character-level alignments are not available [
18]. The CTC loss function elegantly addresses the challenge of variable-length sequences by marginalizing over all possible alignments between input features and output labels. This makes CTC-based models particularly well-suited for end-to-end training, as they do not require explicit character segmentation. The Convolutional Recurrent Neural Network (CRNN) architecture exemplifies the effective application of CTC in text recognition, where convolutional features are processed through recurrent layers before CTC decoding [
20]. CRNN treats recognition as a sequence labeling problem, predicting a probability distribution over possible characters at each time step, with the CTC loss enabling efficient training without character-level annotations.
Despite their strengths, CTC-based approaches have inherent limitations. The conditional independence assumption between output labels can lead to suboptimal performance, particularly in text instances with repeated characters or long silences, where strong character dependencies exist. This assumption may cause the model to overlook contextual relationships within the text sequence, limiting its ability to fully capture the contextual information crucial for accurate recognition, especially in scenarios requiring disambiguation based on surrounding characters.
To mitigate these limitations, recent advancements have explored hybrid approaches. Attention-based CTC models combine the alignment benefits of CTC with the contextual modeling capabilities of attention mechanisms [
55]. These models integrate an attention module to refine the alignment process, allowing the model to focus on relevant parts of the input sequence during predictions. Other innovations include combining CTC loss with sequence-to-sequence (Seq2Seq) loss functions, leveraging the strengths of both paradigms to improve recognition accuracy. Additionally, employing curriculum learning strategies to guide model training progressively has shown promise in enhancing the performance of CTC-based models [
56].
Despite the emergence of newer approaches, CTC-based methods remain relevant in contemporary text recognition systems due to their computational efficiency and straightforward implementation. They provide a solid foundation for building robust STR models, especially when combined with other techniques to address their limitations. As research in STR progresses, the refinement and integration of CTC-based methods with advanced architectures and training strategies are likely to further enhance their performance and applicability, ensuring their continued relevance in the field.
4.3. Transformer-Based Architectures
The transformative success of transformer models in natural language processing (NLP) has sparked significant interest in adapting these architectures for scene text recognition (STR) tasks. Vision Transformers (ViTs) [
57] have emerged as a powerful alternative by processing images as sequences of patches, thereby enabling effective modeling of global contextual information. This capability is particularly crucial for STR, as it must handle text in natural scenes with varying orientations, scales, and fonts.
ViTSTR [
58] exemplifies the potential of pure transformer architectures for text recognition, achieving competitive performance while simplifying model design by eliminating recurrent or convolutional layers. However, transformers often benefit from combining with convolutional neural networks (CNNs) to compensate for their relatively weaker local feature extraction capabilities. The TrOCR model [
25] demonstrates this hybrid approach effectively, utilizing a CNN backbone for visual feature extraction followed by a transformer decoder for sequence generation.
As depicted in
Figure 6, the proposed SwinTextSpotter [
59] and its derived variants demonstrate a highly effective integration strategy. This strategy leverages the Swin Transformers, which are characterized by their hierarchical architecture and shifted window attention mechanism. From a technical perspective, the hierarchical architecture of Swin Transformers allows these models to systematically capture multi-scale features from input images. The shifted window attention mechanism, on the other hand, enables the models to focus on different regions of the image while maintaining computational efficiency. This combination of features leads to a significant enhancement in performance for both text detection and recognition tasks. Moreover, the parallelizable nature of transformer architectures is a crucial advantage. It facilitates efficient training and inference processes on modern hardware accelerators. In the context of large-scale Scene Text Recognition (STR) applications, where computational resources and processing speed are of paramount importance, this property makes Swin Transformers and their variants particularly appealing.
Hybrid models that combine CNNs and Transformers have also been proposed for STR, aiming to leverage the strengths of both architectures. For instance, the Scene Text Recognition with a Single Visual Model (SVTR) [
26] decomposes an image text into small patches and uses a Transformer-based architecture to process these patches. SVTR dispenses with the sequential modeling entirely, relying solely on the visual model for feature extraction and text transcription. This approach results in a fast and accurate STR system, demonstrating the potential of hybrid architectures in achieving state-of-the-art performance.
Despite their strengths, transformer-based models face notable challenges. Their computational complexity scales quadratically with sequence length, which poses difficulties when processing long text sequences or high-resolution images. Furthermore, transformers demand large labeled datasets for effective training, constraining their utility in low-resource settings. Ongoing research aims to mitigate these issues by developing more efficient transformer variants and exploring semi-supervised and self-supervised learning strategies.
Transformer-based architectures have significantly advanced scene text recognition (STR) by effectively modeling global contextual information and long-range dependencies. Hybrid models that integrate CNNs and Transformers represent a promising avenue for further enhancing STR performance, leveraging the complementary strengths of both architectures. As research progresses, we anticipate the emergence of more innovative transformer-based solutions for STR tasks.
Figure 7 presents a performance comparison between two Transformer-based models, BUSNet [
60] and ABINet [
61], on difficult text recognition tasks. These tasks involve adverse conditions such as confusing fonts and blurred appearances. BUSNet demonstrates remarkable robustness and excels in recognizing hard examples, whereas ABINet encounters difficulties and fails in these cases. The visual comparison, which includes input images, ground truth, ABINet predictions, and BUSNet predictions (arranged from top to bottom), clearly highlights BUSNet’s superior capability in handling visually degraded text inputs.
Figure 8 showcases the failure cases of BUSNet [
60] on the IIIT5K dataset. Despite the relatively clear and seemingly simple input images, BUSNet still makes errors. For instance, it omits superscript characters, likely due to treating them as unimportant symbols. Moreover, the model exhibits limitations in handling non-English words, indicating restricted semantic reasoning capabilities. These failure cases provide valuable insights into the areas where BUSNet needs improvement.
4.4. Visual-Based Models
Visual-Based Scene Text Recognition (V-STR) models mainly utilize visual features extracted from scene text images for recognition tasks. With the emergence of deep learning techniques, especially Convolutional Neural Networks (CNNs) [
62,
63], these models have advanced significantly. CNNs exhibit excellent capabilities in capturing hierarchical visual features from images, which has driven the development of V-STR models.
One of the earliest and most influential V-STR models is the Convolutional Recurrent Neural Network (CRNN) [
20]. It combines CNNs for feature extraction and Recurrent Neural Networks (RNNs) for sequence modeling, treating scene text recognition as a sequence labeling problem. CNNs extract visual features from the input image, and RNNs process these features to predict a sequence of characters. This approach enables CRNN to handle variable-length text sequences naturally, facilitating end-to-end training without explicit character segmentation. However, CRNN and similar RNN-based models have limitations. RNNs, including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), process sequences sequentially, leading to slow training and inference times for long text sequences. Moreover, their recurrent nature makes them prone to vanishing and exploding gradient problems [
64,
65], hindering the learning of long-range dependencies in text sequences.
To overcome these limitations, fully convolutional models have been proposed. For instance, the Fully Convolutional Network (FCN) approach [
17] for scene text recognition directly maps the input image to a sequence of character predictions using a series of convolutional layers. FCNs are faster and easier to train than RNN-based models, but they may struggle to capture the sequential nature of text effectively, especially for long or complex text sequences. Another significant advancement is the integration of attention mechanisms. The Show, Attend and Read (SAR) model [
54] introduces a 2D attention mechanism that allows the model to focus on different parts of the text image at each decoding step, mimicking human reading. This approach improves recognition performance and enhances the interpretability of the model’s decisions.
More recently, Transformer-based models have emerged as a powerful alternative to CNN-RNN and fully convolutional architectures in V-STR. Transformers, initially designed for natural language processing tasks [
24], have shown remarkable adaptability to computer vision tasks, including scene text recognition. The Vision Transformer (ViT) [
57] and its derivative, the Scene Text Transformer (ViTSTR) [
58], leverage self-attention mechanisms to capture both global and local visual features from scene text images. As shown in
Figure 9, the ViTSTR framework is adapted from the original design in [
58]. It starts by partitioning the input image into patches, which are then transformed into one-dimensional vector embeddings (flattened 2D patches). Each embedding is integrated with a learnable patch embedding and a position encoding, serving as the input to the encoder. The network is trained end-to-end to predict a sequence of characters, where [GO] indicates the start of a sequence, and [s] represents a space or the end of a character sequence.
ViTSTR and its variants have achieved state-of-the-art performance on various benchmarks, demonstrating their effectiveness in scene text recognition. Nevertheless, Transformer-based models face challenges such as high computational complexity and memory requirements, especially when processing high-resolution images or long text sequences. Researchers are actively exploring techniques to mitigate these issues, including efficient attention mechanisms [
66], hierarchical architectures [
67], and model compression methods [
68].
In conclusion, Visual-Based Scene Text Recognition models have evolved from early CNN-RNN hybrid architectures to fully convolutional and Transformer-based models. Each generation of models has addressed the limitations of its predecessors while introducing new challenges and opportunities for further research. As the field advances, the integration of advanced visual feature extraction techniques with innovative sequence modeling methods will likely lead to more robust and efficient V-STR systems.
4.5. Character-Level vs. Word-Level Recognition
The choice between character-level and word-level recognition constitutes a fundamental design consideration in scene text recognition (STR) systems, each offering distinct advantages and trade-offs. Character-level recognition methods, such as ASTER [
53], treat each character as an independent classification task, providing flexibility in handling arbitrary text lengths and uncommon words not present in the training vocabulary. This approach is particularly effective for distorted or perspective text when combined with spatial transformer networks for preprocessing, as it processes characters sequentially. However, this sequential processing increases computational demands, especially for long text sequences, and may neglect word-level semantic context, potentially reducing accuracy due to challenges in modeling long-range dependencies between characters [
55].
In contrast, word-level recognition methods, exemplified by NRTR [
69], process entire words or text lines as single units, leveraging linguistic context to improve accuracy. NRTR’s fully convolutional architecture enables parallel processing and faster inference compared to character-based approaches, demonstrating superior performance with common words or phrases within its training vocabulary. By exploiting statistical regularities of word formation and usage, word-level models can enhance recognition accuracy. However, their reliance on predefined vocabularies limits their ability to handle out-of-vocabulary words or uncommon character combinations, posing a significant drawback in applications where novel words are likely to be encountered.
To balance these trade-offs, many STR systems adopt hybrid approaches that combine character-level and word-level recognition. For instance, some architectures use character-level processing for initial recognition, leveraging its flexibility to handle arbitrary text lengths and complexities, and then refine the results using word-level language models or dictionaries to incorporate linguistic context and improve overall accuracy. This hybrid strategy allows systems to benefit from the strengths of both approaches while mitigating their limitations. ASTER [
53] demonstrates the effectiveness of character-level recognition in handling distorted text through spatial transformer networks, while NRTR [
69] showcases the efficiency of word-level recognition with its parallel processing capabilities. By integrating elements of both levels, modern STR systems can achieve high accuracy and efficiency across diverse text instances and application scenarios, making hybrid approaches a promising solution for optimal performance in STR systems.
5. Scene Text Spotting
Scene text spotting, the task of simultaneously detecting and recognizing text in natural images, has seen significant advancements with the advent of deep learning techniques. This section categorizes and reviews existing methods for scene text spotting, focusing on two-stage, unified, RoI-based, segmentation-based, transformer-based, and pre-trained language model-based approaches.
5.1. Two-Stage Methods
Two-stage methods for scene text spotting typically consist of a detection stage followed by a recognition stage. The detection stage localizes text instances within the image, often as bounding boxes or polygons, while the recognition stage processes these regions to extract the actual text content.
- (1)
Detection Stage
The detection stage often employs convolutional neural networks (CNNs) to generate text proposals. For instance, the Connectionist Text Proposal Network (CTPN) [
31] uses a vertical anchor mechanism to predict text proposals at a fixed height, which are then linked to form text lines. Another popular method, EAST [
32], proposes a single-shot detector that directly predicts word or text-line level bounding boxes without anchor boxes or region proposal networks, achieving high efficiency and accuracy. For arbitrary-shaped text detection, methods like PSENet [
42] generate multiple scale-expanded kernels for each text instance, which are progressively merged to form the final text region.
- (2)
Recognition Stage
Once text instances are detected, the recognition stage processes these regions to extract the text content. Traditional methods often relied on hand-crafted features or shallow neural networks, but deep learning techniques, particularly recurrent neural networks (RNNs) and their variants like LSTMs [
64] and GRUs [
70], have become the standard for sequence modeling in text recognition. For example, the Convolutional Recurrent Neural Network (CRNN) [
20] combines CNNs for feature extraction and RNNs for sequence modeling, achieving end-to-end trainability for text recognition. More recently, attention mechanisms have been incorporated into text recognition models, such as the Attention-based encoder–decoder (AED) framework [
19], to dynamically weigh the importance of different parts of the input feature map when decoding each character.
- (3)
Integration of Detection and Recognition
The integration of detection and recognition stages is crucial for high performance. Some methods use a separate pipeline, where detected text regions are cropped and fed into the recognition model, but this may introduce errors due to inaccurate detection or cropping. To mitigate this, methods like FOTS [
21] propose a unified network that shares convolutional features between the detection and recognition branches, improving efficiency and synergy.
- (4)
Challenges and Limitations
Two-stage methods face several challenges. The accuracy of the detection stage heavily influences recognition performance, as errors can propagate. Separate processing of detection and recognition may lead to suboptimal feature utilization, and two-stage methods often have higher computational complexity compared to end-to-end methods.
5.2. Unified Network Architectures
Unified network architectures for scene text spotting aim to integrate the text detection and recognition tasks into a single end-to-end trainable model. This approach not only simplifies the overall pipeline but also allows for the sharing of features between the two tasks, leading to improved synergy and performance.
- (1)
Early Attempts
One of the early attempts at creating a unified model for scene text spotting was presented by Li et al. [
71], who proposed a convolutional recurrent neural network (CRNN) that jointly performed text detection and recognition. However, this model still relied on heuristic post-processing steps, such as image cropping and feature re-calculation, which limited its end-to-end trainability.
- (2)
Fully End-to-End Trainable Models
The advent of fully end-to-end trainable models marked a significant milestone in scene text spotting research. Liao et al. [
39] introduced TextBoxes++, which used a single-shot detector for text detection and a connectionist temporal classification (CTC) decoder for text recognition, all within a unified network.This model eliminated the need for intermediate steps like image cropping, thereby improving efficiency and accuracy.
- (3)
Region Proposal Network (RPN) Based Unified Models
Another line of research focused on leveraging region proposal networks (RPNs) to generate text proposals, which were then processed by a shared backbone network for both detection and recognition. For instance, Liu et al. [
21] proposed FOTS (Fast Oriented Text Spotting), which used an RPN to generate text proposals and a shared convolutional network to extract features for both detection and recognition. This approach significantly reduced computation time by sharing features between the two tasks.
- (4)
Notable Unified Architectures
As depicted in
Figure 10, the E2E-MLT [
72] represents a pioneering effort in constructing an end-to-end trainable network specifically designed for multi-lingual scene text spotting. From a structural perspective, as shown in the figure, the framework of E2E-MLT is adapted from the original design in [
72]. The process initiates with the generation and filtering of text proposals by a detector based on EAST [
32]. Subsequently, each text proposal, while preserving its aspect ratio, is normalized into a fixed-height, variable-width tensor. Ultimately, each proposal is either associated with a sequence of multi-language text and script class or rejected as no-text.
In a broader architectural view, the network comprises a shared convolutional backbone dedicated to feature extraction. This is followed by distinct branches for text detection and recognition. The detection branch is responsible for predicting bounding boxes for text instances. On the other hand, the recognition branch decodes the features within these bounding boxes into text strings. The significance of this unified architecture lies in its ability to enable joint optimization of both the detection and recognition tasks. By doing so, it enhances the overall accuracy and efficiency of the multi-lingual scene text spotting process, making it a notable contribution in this research domain.
Liu et al. [
73] introduces a character-aware neural network (CharNet) for scene text spotting, which explicitly models characters within text instances. The network consists of a text detection branch that predicts character-level bounding boxes and a recognition branch that decodes these characters into text strings. By jointly optimizing both branches, CharNet can handle text instances with varying lengths and fonts, improving the robustness of the system. The character-level supervision also helps in learning more discriminative features for both detection and recognition.
Liu et al. [
1] proposes a unified network (ABCNet) that models text instances as adaptive Bezier curves, allowing for more flexible handling of arbitrary-shaped text. The network consists of a Bezier curve detection branch that predicts control points for the Bezier curves and a recognition branch that decodes the features along these curves into text strings. ABCNet achieves real-time performance while maintaining high accuracy, demonstrating the potential of unified network architectures for practical applications.
- (5)
Transformer-Based Unified Models
With the success of Transformers in natural language processing and computer vision, researchers began exploring their application in scene text spotting. Carion et al. [
74] introduced DETR (DEtection TRansformer), which used a Transformer encoder–decoder architecture for object detection. Building on this, Huang et al. [
59] proposed SwinTextSpotter, which adapted the DETR framework for scene text spotting by incorporating a text-specific recognition head. This model achieved state-of-the-art performance on multiple benchmarks, demonstrating the effectiveness of Transformers in unified scene text spotting architectures.
- (6)
Enhancing Synergy Between Detection and Recognition
A key aspect of unified network architectures is the synergy between text detection and recognition tasks. Several methods have been proposed to enhance this synergy. For instance, Huang et al. [
59] introduced a Recognition Conversion mechanism in SwinTextSpotter, which explicitly guides text localization through recognition loss, leading to improved performance. Similarly, Lyu et al. [
6] proposed TextBlockV2, which leverages a pre-trained language model (PLM) for text recognition, thereby enhancing overall spotting performance without the need for precise text detection.
5.3. Region-of-Interest (RoI) Based Methods
RoI-based methods detect text regions first and then apply a recognition module within these regions, often facilitated by RoI pooling or RoI Align operations.
- (1)
RoI Pooling and RoI Align
RoI pooling, introduced in Fast R-CNN [
75], was one of the earliest techniques to enable the use of region proposals in object detection tasks. This technique was later adapted to scene text spotting, where it was used to extract features from detected text regions for recognition. However, RoI pooling suffered from quantization artifacts due to its coarse spatial quantization, which could degrade recognition performance, especially for small or irregularly shaped text instances. To address this limitation, RoI Align [
10] was proposed, which uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin. This technique significantly improved the accuracy of feature extraction, leading to better recognition performance. Several scene text spotting methods, such as Mask TextSpotter [
22], adopted RoI Align to enhance their recognition capabilities.
- (2)
Integration with Detection and Recognition
RoI-based methods facilitate the integration of detection and recognition modules by sharing a common backbone network for feature extraction. This shared backbone allows for efficient feature reuse, reducing computational overhead and improving overall system performance. For instance, in FOTS (Fast Oriented Text Spotting) [
21], a unified network is proposed that simultaneously detects and recognizes text instances by sharing convolutional features between the detection and recognition branches through RoI Align.
- (3)
Handling Arbitrary-Shaped Text
While traditional RoI-based methods were primarily designed for rectangular or axis-aligned text regions, recent advancements have extended these techniques to handle arbitrary-shaped text instances. Methods like TextDragon [
76] and ABCNet [
1] proposed novel RoI transformation mechanisms to convert irregularly shaped text regions into regularized feature maps suitable for recognition. TextDragon, for example, uses a RoISlide operation to slide a set of local units along the text region, capturing its shape and context information, which is then used for recognition.
5.4. Segmentation-Based Methods
Segmentation-based methods have emerged as a powerful paradigm for scene text spotting, particularly in handling arbitrary-shaped text instances. Unlike traditional bounding box-based methods, segmentation-based approaches treat text detection as a pixel-level classification problem, aiming to segment text regions from the background. This section reviews several notable segmentation-based methods for scene text spotting, highlighting their contributions and advancements.
- (1)
Mask TextSpotter Series
Mask TextSpotter [
22] was one of the pioneering works that introduced segmentation into scene text spotting. It extended the Mask R-CNN framework [
10] by adding a character segmentation branch for text recognition, enabling the end-to-end training of detection and recognition tasks. Mask TextSpotter v2 [
5] further improved the robustness by adopting a segmentation proposal network (SPN) to generate accurate text proposals, especially for irregularly shaped text. SPN generates polygonal proposals that tightly enclose text instances, which are then used to extract RoI features for recognition.
- (2)
TextDragon and Boundary TextSpotter
As illustrated in
Figure 11, the TextDragon framework [
76] introduces an innovative approach for text detection and recognition by representing text shapes through a sequence of quadrangular regions. Rather than directly segmenting entire text regions, TextDragon adopts a strategy where text lines are divided into multiple segments. These segments are subsequently aggregated to reconstruct the complete text instance, facilitating a more adaptable handling of texts with irregular shapes. Following detection, the identified text regions undergo rectification via a specially designed RoISlide operator. This operator is responsible for extracting and rectifying features from the segmented regions, thereby preparing them for the subsequent recognition process. Boundary TextSpotter [
51] represented text instances as a set of boundary points, which are more flexible than bounding boxes or masks for describing arbitrary shapes. By predicting these boundary points, the model can easily transform irregularly shaped text into a horizontal region for recognition. This method simplifies the text recognition process by aligning text features to a canonical orientation, thereby improving recognition accuracy.
- (3)
Advantages and Challenges
Segmentation-based methods offer several advantages in scene text spotting. First, they can handle arbitrary-shaped text instances more naturally by treating text detection as a pixel-level classification problem. Second, these methods often achieve higher detection accuracy by capturing detailed text boundaries. However, they also face challenges such as the need for precise pixel-level annotations and the computational complexity associated with processing high-resolution feature maps.
- (4)
Advantages and Challenges
Segmentation-based methods offer several advantages in scene text spotting. First, they can handle arbitrary-shaped text instances more naturally by treating text detection as a pixel-level classification problem. Second, these methods often achieve higher detection accuracy by capturing detailed text boundaries. However, they also face challenges such as the need for precise pixel-level annotations and the computational complexity associated with processing high-resolution feature maps.
5.5. Transformer-Based Methods
Transformer-based methods have recently gained significant attention in the field of scene text spotting due to their powerful capabilities in modeling long-range dependencies and global context information. These methods leverage the self-attention mechanism of transformers to improve the performance of both text detection and recognition tasks. This subsection reviews several notable transformer-based methods for scene text spotting.
- (1)
End-to-End Transformers
Early transformer-based approaches for scene text spotting often adopted an end-to-end training strategy, integrating both detection and recognition tasks within a single transformer architecture. For instance, Qiao et al. [
77] proposed MANGO, a Mask Attention Guided One-stage scene text spotter that utilizes a transformer decoder to generate text proposals and recognize text instances simultaneously. MANGO employs a mask attention mechanism to guide the transformer decoder to focus on relevant regions, improving the accuracy of both detection and recognition.
- (2)
Transformer-Enhanced Feature Extraction
Another line of research focuses on enhancing feature extraction using transformers. These methods typically use a transformer encoder to process convolutional features extracted from the input image, capturing global context information that is beneficial for both detection and recognition. For example, Huang et al. [
59] introduced SwinTextSpotter, which incorporates a Swin Transformer [
67] as the backbone network for feature extraction. The Swin Transformer’s hierarchical architecture and shifted window mechanism enable it to capture multi-scale features effectively, leading to improved performance in scene text spotting.
- (3)
Transformer-Based Text Recognition
In addition to end-to-end approaches, transformers have also been applied specifically to the text recognition task within scene text spotting. These methods often use a transformer decoder to decode the features extracted from the detected text regions into readable text. For instance, some approaches leverage the transformer’s ability to handle sequential data by treating the recognition task as a sequence-to-sequence problem, where the input is the feature sequence of the text region and the output is the corresponding text string.
- (4)
Hybrid Architectures
Several hybrid architectures have also been proposed to combine the strengths of transformers with other deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). These architectures typically use a CNN for initial feature extraction, followed by a transformer for further processing and refinement. For example, some methods may use a CNN to extract local features and then apply a transformer to model the global context and dependencies among these features, improving the overall performance of scene text spotting.
- (5)
Advantages and Challenges
Transformer-based methods offer several advantages in scene text spotting. First, their ability to model long-range dependencies and global context information makes them well-suited for handling complex text layouts and backgrounds. Second, the end-to-end training strategy of many transformer-based approaches simplifies the training pipeline and improves the synergy between detection and recognition tasks. However, these methods also face challenges such as the high computational cost associated with transformer architectures and the need for large amounts of annotated data for training.
5.6. Pre-Trained Language Model-Based Methods
Pre-trained language models (PLMs), such as BERT [
78], GPT [
79], and their variants, have revolutionized natural language processing by demonstrating exceptional performance on a wide range of language understanding tasks. Recently, researchers have started to explore the integration of PLMs into scene text spotting to leverage their powerful language modeling capabilities for improving text recognition accuracy, especially in handling complex and ambiguous text instances.
- (1)
Integration of PLMs for Text Recognition
One of the primary ways PLMs are incorporated into scene text spotting is through the integration with the text recognition module. Traditional text recognition models often struggle with recognizing text that is occluded, distorted, or written in unconventional fonts. PLMs, on the other hand, can provide rich semantic context that helps disambiguate characters and improve recognition accuracy.
For instance, Lyu et al. [
6] proposed TextBlockV2, a scene text spotting framework that leverages a PLM for text recognition. Instead of relying solely on visual features extracted from the image, TextBlockV2 uses a coarse-to-fine detection strategy to generate text proposals, which are then refined and recognized using the PLM. The PLM takes the visual features of the text proposal along with the surrounding context (if available) and generates a more accurate recognition result by considering the semantic meaning of the text.
- (2)
Language Model-Guided Text Detection
In addition to improving text recognition, PLMs can also guide the text detection process by providing prior knowledge about the language and text structure. For example, some methods use PLMs to predict the likely locations of text in an image based on the semantic content of the scene or the context provided by nearby text instances.
While direct language model-guided text detection is still an emerging area, research by Yu et al. [
80] hints at the potential of combining visual and linguistic cues for more robust text detection. They propose a method that uses a PLM to generate textual descriptions of the scene, which are then used to refine the text detection results by considering the compatibility between the detected text and the scene context.
- (3)
Multi-Modal Learning with PLMs
Another promising direction is multi-modal learning, where PLMs are combined with visual models to jointly learn from both textual and visual information. This approach allows the model to capture the synergies between the two modalities, leading to improved performance in scene text spotting.
For example, Song et al. [
81] introduce a vision-language pre-training framework for scene text spotting, where a visual encoder and a PLM are jointly trained on a large corpus of image-text pairs. The visual encoder learns to extract meaningful visual features from the image, while the PLM learns to model the semantic relationships between the text and the image. During inference, the model can simultaneously detect and recognize text instances by leveraging both visual and textual information.
- (4)
Challenges and Future Directions
Despite the promising results, integrating PLMs into scene text spotting also poses several challenges. First, PLMs are typically large and computationally expensive, making them difficult to deploy in real-time applications. Second, the integration of PLMs requires careful design to ensure that the visual and textual information are effectively fused and utilized.
Future research in this area may focus on developing more efficient PLMs specifically tailored for scene text spotting, exploring novel multi-modal learning architectures that better leverage the strengths of both visual and textual models, and investigating the use of PLMs for handling more complex text spotting scenarios, such as multi-lingual text or text in low-resource languages
6. Advanced Topics and Emerging Trends
6.1. Multi-Lingual and Multi-Script Support
In the field of scene text detection and recognition (STDR), the capacity to handle multiple languages and scripts has gained substantial importance. This is driven by the globalization of digital content and the heterogeneous nature of textual information encountered in natural scenes. In this subsection, we delve into the distinct challenges associated with multi-lingual and multi-script STDR, the proposed solutions to tackle these challenges, and the relevant datasets that facilitate research in this area.
- (1)
Challenges
The task of recognizing text across diverse languages and scripts poses significant obstacles. The primary challenges arise from the extensive diversity in character sets, writing systems (e.g., right-to-left scripts such as Arabic and Hebrew versus left-to-right scripts like Latin and Chinese), and linguistic structures. Scripts like Devanagari and Thai, with their complex glyph formations, further complicate the recognition process. Additionally, variations in font styles, sizes, and background conditions exacerbate the difficulty. A critical issue is the scarcity of labeled training data for many non-English languages and scripts, which severely restricts model development and generalization capabilities [
82].
- (2)
Solutions
In addressing the intricate challenges inherent in multi-lingual and multi-script scene text detection and recognition (STDR), researchers have devised a series of innovative strategies. These approaches are designed to bolster model performance across a wide spectrum of languages and scripts, drawing upon recent advancements in machine learning and natural language processing.
Firstly, Cross-Lingual Transfer Learning stands out as a pivotal technique. This method involves utilizing models pre-trained on expansive datasets, predominantly in English or synthetic data, and subsequently fine-tuning them on smaller, target-language-specific datasets. By doing so, it facilitates the transfer of knowledge from resource-rich languages to those with limited resources, thereby yielding substantial performance enhancements [
83]. This approach is particularly valuable in scenarios where labeled data for certain languages is scarce or non-existent.
Secondly, the development of Universal Language Models trained on vast multilingual corpora has emerged as a promising avenue. These models are capable of learning language-agnostic representations that are applicable across a diverse range of STDR tasks. Models such as CLIP [
16] showcase the potential of this approach by leveraging pre-trained vision-language knowledge to elevate recognition performance across various languages and scripts. The advantage of such models lies in their ability to generalize across languages, reducing the need for language-specific training data.
Lastly, the incorporation of Script-Aware Architectures into model design represents another effective strategy. This involves explicitly integrating script information into the model, which can be achieved through script-specific modules, attention mechanisms, script embeddings to condition model behavior, or script-specific convolutional kernels tailored to capture unique visual patterns. Transformer-based architectures, exemplified by TrOCR [
25], are frequently employed due to their proficiency in capturing long-range dependencies and context, which is indispensable for handling complex orthographies. Moreover, integrating language-specific knowledge, such as lexical constraints or language models, can further refine recognition results, particularly for scripts with intricate orthographic structures [
15]. This approach underscores the importance of tailoring model architectures to the specific characteristics of different scripts to achieve optimal performance.
- (3)
Datasets and Benchmarks
The advancement of multi-lingual scene text detection and recognition (STDR) has been significantly accelerated by the availability of specialized datasets and benchmarks. These resources serve as cornerstones for research progress, offering standardized evaluation platforms and a rich tapestry of training data. Below, we delve into some of the pivotal datasets that have played a transformative role in this field.
The ICDAR RRC-MLT-2019 dataset stands out for its dedicated focus on Multi-lingual Scene Text Detection and Script Identification. It encompasses a diverse array of images, each containing text in multiple languages, thereby providing a robust platform for evaluating and training STDR models across different scripts and languages [
82].
While datasets like Total-Text, CTW1500, and TextOCR were initially conceived with a primary emphasis on curved text detection, their utility extends far beyond this scope. These datasets also incorporate a substantial number of text instances in multiple languages, rendering them invaluable for research in multi-lingual STDR. Their inclusion of diverse text layouts and styles further enriches their value as comprehensive resources for model evaluation and training [
14,
84,
85].
Another notable dataset is HierText, which distinguishes itself by including multi-lingual text instances accompanied by hierarchical layout annotations. This feature enables joint text spotting and layout analysis, providing a holistic benchmark for assessing the performance of STDR models [
86] in multi-lingual settings. By offering a comprehensive evaluation framework, HierText facilitates the development of more accurate and robust algorithms capable of handling the complexities of multi-lingual text in natural scenes.
- (4)
Handling Adverse Conditions
In the realm of scene text detection and recognition (STDR), the challenges extend beyond multi-lingual text processing. A critical concern lies in recognizing text under adverse conditions, including low-resolution images, partial or complete occlusion, and curved text layouts. These conditions significantly degrade the quality of visual information available to the recognition system, thereby impacting its accuracy and robustness.
To tackle these challenges, researchers have proposed a range of solutions. Robust feature extraction techniques form the cornerstone of these approaches, as they enable the system to capture meaningful information even from degraded or distorted text instances. Enhanced contextual modeling further complements feature extraction by incorporating contextual cues from surrounding text or scene elements, aiding in disambiguating characters or words that may be difficult to recognize in isolation. For curved text, geometric transformations such as the Thin-Plate Spline (TPS) have proven effective in rectifying the layout, thereby transforming curved text into a more recognizable form. This transformation simplifies the subsequent recognition process and improves overall accuracy.
Moreover, attention mechanisms have emerged as a powerful tool in focusing the system’s attention on relevant image regions, even amidst noise or distortions. By dynamically weighting the importance of different parts of the input image, attention mechanisms enable the system to prioritize the most informative regions for recognition, thereby enhancing its robustness against adverse conditions. These combined strategies collectively contribute to improving the recognition accuracy of STDR systems under adverse conditions, paving the way for more reliable and versatile text recognition in real-world scenarios [
87].
- (5)
Methods for Robustness Against Noise and Distortions
Enhancing model resilience to noise and distortions involves several techniques. Realistic data augmentation can simulate noise conditions, while adversarial training can improve generalization capabilities. Developing noise-robust feature extractors is also crucial. Additionally, multi-task learning frameworks that jointly optimize detection and recognition tasks can contribute to overall robustness [
88].
6.2. Real-World and Challenging Scenarios
Scene text detection and recognition (STDR) in real-world applications often encounters a variety of challenging scenarios, including low-resolution images, occluded texts, and curved or distorted text layouts. These conditions significantly degrade the performance of traditional STDR models, which are typically trained on high-quality, well-annotated datasets. This section discusses the challenges associated with real-world and challenging scenarios in STDR and reviews the methods proposed to enhance model robustness against these conditions.
6.2.1. Handling Low-Resolution Texts
Low-resolution images are frequently encountered in surveillance videos, mobile camera captures, and various real-world situations where high-resolution imaging is impractical. The blurriness and absence of detail in low-resolution texts pose difficulties for models in accurately detecting and recognizing characters. To tackle this challenge, researchers have developed several techniques.
One approach involves using super-resolution (SR) algorithms. These algorithms enhance the resolution of blurred or low-resolution text images before text recognition. For instance, Wang et al. [
89] introduced TextSR, a content-aware text super-resolution network guided by recognition. This network integrates a super-resolution module with a text recognition network in an end-to-end manner. The super-resolution module improves text image quality by focusing on restoring text content, while the text recognition network provides feedback to guide the training of the super-resolution module, ultimately enhancing recognition performance.
Another technique is multi-scale feature fusion. Models incorporating this mechanism can capture both global context and local details, which is advantageous for recognizing low-resolution texts. Geng et al. [
90] proposed a Transformer-based multi-scale end-to-end approach that effectively handles texts of varying scales and resolutions.
6.2.2. Occluded Text Recognition
Occlusions, arising from overlapping objects, adverse lighting conditions, or image artifacts, frequently obscure segments of text, complicating the task of complete message interpretation for recognition models. To address this challenge, researchers have developed several strategies aimed at enhancing occluded text recognition. One prominent approach involves contextual modeling, where leveraging contextual information from surrounding text or image regions aids in inferring the content of obscured parts. Baek et al. [
83] demonstrated that incorporating language models or contextual embeddings significantly improves the recognition accuracy of occluded texts.
Another effective strategy employs attention mechanisms, particularly those that focus on non-occluded regions of the text. These mechanisms guide the model to disregard occluded areas and concentrate on interpretable segments, thereby enhancing recognition performance. Cheng et al. [
91] introduced a neighbor decoding approach that utilizes attention maps to improve accuracy on long and potentially occluded texts, showcasing the potential of attention-based methods in this domain.
6.2.3. Curved and Distorted Text Layouts
Curved or distorted text layouts are common in artistic designs, signboards, and natural scenes. Traditional STDR models, which assume rectangular or horizontal text regions, often struggle in these scenarios. To address curved and distorted texts, researchers have explored the following approaches.
Arbitrary-shape detection is one approach. Models capable of detecting text regions of arbitrary shapes, such as polygons or Bezier curves, are better suited for handling curved texts. Lyu et al. [
92] proposed a MaskOCR method that uses a masked encoder–decoder framework to recognize texts of arbitrary shapes.
Spatial Transformer Networks (STNs) are another useful tool. STNs can normalize distorted text regions into a more regular shape before recognition, allowing models to leverage their expertise in recognizing horizontal or rectangular texts. Liu et al. [
1] incorporated STNs into their ABCNet model to effectively handle curved texts.
6.2.4. Robustness Against Noise and Distortions
Noise and distortions in images, such as Gaussian noise, motion blur, or perspective distortions, can significantly degrade STDR performance. To enhance model robustness, researchers have explored the following methods.
Data augmentation is one common technique. Introducing noise and distortions during training helps models generalize better to real-world conditions. Techniques such as random cropping, rotation, scaling, and adding Gaussian noise are frequently used.
Adversarial training is another approach. Training models on adversarial examples, which are designed to fool the model, can improve their robustness against unseen distortions. Goodfellow et al. [
93] introduced adversarial training as a method to enhance model robustness in image classification tasks, and similar approaches have been adapted for STDR.
Domain adaptation is also a viable strategy. Transferring knowledge from a source domain (e.g., synthetic data) to a target domain (e.g., real-world data) with different noise and distortion characteristics can improve model performance. Techniques such as unsupervised domain adaptation (UDA) and few-shot domain adaptation have been explored for STDR [
83,
94].
6.3. Few-Shot and Zero-Shot Learning
Techniques for training models with limited labeled data are of paramount importance in scene text detection and recognition (STDR), where the cost and difficulty of collecting and annotating large-scale datasets for every possible language, script, and domain are high. Few-shot and zero-shot learning paradigms aim to enable models to generalize to new, unseen classes or domains with minimal or no labeled data, respectively.
6.3.1. Few-Shot Learning in STDR
Few-shot learning in STDR focuses on training models that can quickly adapt to new text classes or styles using only a handful of labeled examples. This is particularly useful in scenarios where the target domain has limited labeled data, such as recognizing rare languages or specialized scripts.
Meta-learning, or learning to learn, is a popular approach in few-shot learning. It involves training a model on a variety of tasks (or episodes) during the meta-training phase, so that it can quickly adapt to new tasks during the meta-testing phase. In STDR, meta-learning has been applied to both text detection and recognition tasks. For instance, some methods employ model-agnostic meta-learning (MAML) [
95] to train a base model that can be fine-tuned on a few labeled examples from the target domain. Bhunia et al. proposed MetaHTR, a meta-learning framework for writer-adaptive handwritten text recognition, which demonstrated significant performance gains with very few new style data samples [
96]. This approach leverages character-wise cross-entropy loss and instance-specific weights to adapt quickly to new writing styles.
To compensate for the lack of labeled data in few-shot scenarios, data augmentation techniques and synthetic data generation are often used. By applying transformations such as rotation, scaling, and perspective distortion to existing labeled data, or by generating synthetic text images using font rendering engines, researchers can create a larger and more diverse dataset for training. Jaderberg et al. generated synthetic text images to pre-train text recognition models, which were then fine-tuned on real data [
97]. Similarly, in few-shot learning scenarios, synthetic data can be used to augment the limited real data, helping models learn more robust features.
Another strategy in few-shot learning is to leverage knowledge from related domains where labeled data is abundant. For example, a model pre-trained on a large-scale English text dataset can be fine-tuned on a few labeled examples from a target language or script. This transfer learning approach can help the model quickly adapt to the new domain by leveraging the shared features and patterns across languages and scripts. Models pre-trained on large-scale scene text datasets like ICDAR or COCO-Text can be fine-tuned on specific tasks with few labeled examples, and this approach [
83] has been shown to significantly improve performance on tasks with limited data.
6.3.2. Zero-Shot Learning in STDR
In STDR (Scene Text Detection and Recognition), zero-shot learning aims to recognize text classes or styles that were not encountered during the training phase. This is accomplished by utilizing auxiliary information, including language embeddings, script characteristics, semantic attributes, and knowledge graphs, to bridge the gap between seen and unseen classes.
One prevalent approach leverages language embeddings or script characteristics. For instance, a model can be trained to recognize text based on its visual features and corresponding language embedding. When encountering a new language or script, the model can infer the text content using the language embedding, even without prior exposure to examples of that language or script. Incorporating language priors and lexicons further enhances zero-shot capabilities. Nguyen et al. proposed a dictionary-guided scene text recognition method that leverages a dictionary to generate possible outcomes and selects the one that best matches the visual appearance of the text [
98]. This method can recognize text even if the exact words were not seen during training, provided they are present in the dictionary.
Another effective strategy involves integrating semantic attributes or knowledge graphs into the model. By associating text classes with semantic attributes (e.g., “currency symbol” for the dollar sign) or linking them to a knowledge graph (e.g., WordNet), the model can infer the meaning of unseen text classes based on their relationships with seen classes. The advent of large-scale vision-language models like CLIP has further expanded zero-shot learning possibilities in STDR. Yu et al [
99]. demonstrated how a CLIP model can be transformed into a scene text spotter by leveraging its cross-modal alignment capabilities, enabling zero-shot recognition by aligning visual features with textual embeddings from the CLIP model. Additionally, generative models such as GANs or VAEs can synthesize text images mimicking unseen text classes, potentially facilitating zero-shot recognition through training on these synthesized images, although this approach is still nascent and faces challenges in generating realistic and diverse text images.
6.3.3. Applications in Domain Adaptation and Transfer Learning
Few-shot and zero-shot learning techniques have significant applications in domain adaptation and transfer learning for STDR. In domain adaptation, the goal is to adapt a model trained on a source domain to perform well on a target domain with different characteristics. Few-shot learning can be used to quickly fine-tune the model on a few labeled examples from the target domain, while zero-shot learning can enable the model to recognize text classes or styles that are not present in the source domain. In transfer learning, few-shot and zero-shot techniques can help leverage knowledge from related domains or tasks to improve performance on the target task. For instance, when deploying an STDR model in a new domain (e.g., from English to Chinese or from street signs to product labels), it may be impractical to collect and annotate a large amount of data in the target domain. In such cases, few-shot and zero-shot learning techniques can help the model quickly adapt to the new domain with minimal labeled data.
In conclusion, few-shot and zero-shot learning are promising research directions in STDR that aim to enable models to generalize to new, unseen classes or domains with minimal or no labeled data. By leveraging meta-learning, data augmentation, transfer learning, language embeddings, semantic attributes, knowledge graphs, and generative models, researchers are making significant progress towards this goal.
6.4. Self-Supervised and Unsupervised Learning
In the realm of scene text detection and recognition (STDR), self-supervised and unsupervised learning have emerged as promising paradigms to leverage the vast amount of unlabeled data available. These approaches aim to learn robust and generalizable representations without relying heavily on manually annotated datasets, which are often time-consuming and expensive to acquire.
6.4.1. Leveraging Unlabeled Data for Pre-Training and Fine-Tuning
One of the key challenges in STDR is the scarcity of labeled data, especially for low-resource languages or specialized domains. Self-supervised learning offers a solution by enabling models to learn from unlabeled data through pretext tasks designed to capture inherent structures and patterns in the data. For instance, contrastive learning has been successfully applied in STDR by training models to distinguish between similar and dissimilar image patches or text regions, thereby learning discriminative features without explicit supervision [
99,
100].
In the context of STDR, pre-training on large-scale unlabeled datasets using self-supervised methods can significantly improve the performance of downstream tasks, such as text detection and recognition. For example, Yu et al. [
99] proposed transforming a large-scale Contrastive Language-Image Pretraining (CLIP) model into a robust backbone for scene text detection and spotting, demonstrating that self-supervised pre-training can enhance the model’s ability to generalize across different datasets and tasks.
Similarly, Xie et al. [
100] introduced OCR-Text Destylization Modeling (ODM), a new pre-training method that transfers diverse styles of text found in images to a uniform style based on text prompts. ODM achieves better alignment between text and OCR-Text, enabling pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. This approach highlights the potential of self-supervised learning in improving the robustness and adaptability of STDR models.
6.4.2. Contrastive Learning and Pretext Tasks for STDR
Contrastive learning, as a prevalent self-supervised learning technique, has demonstrated remarkable potential across a range of computer vision tasks, notably including Scene Text Detection and Recognition (STDR). By comparing positive and negative sample pairs, models can effectively capture discriminative features that remain invariant to various nuisance factors, such as lighting conditions, viewpoints, and occlusions. In the context of STDR, contrastive learning can be employed to learn robust text representations by contrasting text images with their augmented counterparts or with non-text images, thereby enhancing the model’s ability to generalize across diverse scenarios [
6,
101].
Another significant approach within self-supervised learning is the utilization of pretext tasks, which involve designing surrogate tasks solvable with unlabeled data to compel the model into learning useful representations. For instance, in scene text recognition, pretext tasks may encompass predicting the relative positions of text patches within an image, reconstructing masked text regions, or classifying text styles and fonts. These tasks encourage the model to delve into the semantic and structural intricacies of text, ultimately benefiting downstream STDR tasks. Lyu et al. [
6] further exemplified this by proposing TextBlockV2, a scene text spotter that harnesses advanced Pre-trained Language Models (PLMs) for recognition without the need for precise detection. By fine-tuning a PLM on a large-scale OCR dataset and employing a straightforward detector for block-level text detection, TextBlockV2 achieves superior performance across multiple public benchmarks, highlighting the potential of combining self-supervised pre-training with fine-tuning on labeled data to attain state-of-the-art results in STDR.
In conclusion, self-supervised and unsupervised learning methodologies offer promising avenues for enhancing the performance and robustness of STDR models. By leveraging unlabeled data through contrastive learning and pretext tasks, these approaches facilitate the learning of rich and discriminative representations that generalize well across different datasets and tasks. As the field continues to advance, we anticipate the emergence of more innovative self-supervised learning techniques being applied to STDR, pushing the boundaries of what is achievable in this exciting domain.
7. Datasets and Evaluation Metrics
7.1. Datasets in Scene Text Detection and Recognition
Recent advances in scene text detection and recognition have been driven significantly by the availability of large-scale, diverse datasets.
Table 3 summarizes these datasets—including real-world and synthetic text collections—along with their download links. These resources are essential for model training and benchmarking, while also promoting algorithm development by encompassing challenges such as text variations, arbitrary orientations, scale diversity, and complex backgrounds.
7.1.1. Synthetic Datasets
- (1)
SynthText [
102]: SynthText is a large-scale synthetic dataset designed for scene text detection and recognition, comprising 858,750 synthetic scene images organized into 200 directories. The dataset contains 7,266,866 word instances and 28,971,487 characters, each annotated with precise word- and character-level bounding boxes. By leveraging a sophisticated synthetic engine, text is rendered onto natural scene backgrounds while accounting for local geometry, color, and texture, ensuring high realism. This dual-level annotation and realistic rendering make SynthText a versatile resource for training robust models in both text detection and recognition tasks, addressing the scarcity of annotated real-world data. Its scalability and automation further enhance its utility in large-scale deep learning applications.
Figure 12 illustrates representative samples from the SynthText dataset, demonstrating both the visual quality and annotation details.
- (2)
Synth90K [
97]: Building on the success of SynthText, Synth90K takes a significant leap forward by encompassing a staggering 9 million synthetic images. This substantial expansion in dataset size not only boosts the quantity of training samples but also diversifies the representation of text styles, fonts, and backgrounds, thereby providing a more comprehensive training environment for deep learning models.
- (3)
UnrealText [
103]: UnrealText is an innovative scene text image synthesis engine that leverages 3D graphics engines, specifically Unreal Engine 4 (UE4), to generate highly realistic scene text images. Unlike traditional methods that embed text onto 2D background images, UnrealText renders text and scenes as a whole, achieving photo-realistic lighting conditions, natural occlusion, and perspective transformation. The engine employs a viewfinding algorithm to explore virtual scenes and generate diverse camera viewpoints, an environment randomization module to simulate real-world lighting variations, and a mesh-based text region generation method to identify suitable text embedding positions on 3D meshes. This approach ensures that the synthesized images not only look realistic but also provide precise scene information, which is crucial for training robust scene text detection and recognition models. UnrealText also generates a large-scale multilingual scene text dataset, aiding in the development of multilingual text detection and recognition systems. Furthermore, it re-annotates existing scene text recognition datasets in a case-sensitive manner, including punctuation marks, to facilitate more comprehensive evaluations.
Table 3.
Authoritative Datasets for Text Spotting Tasks. Lang: Language, Det: Detection, Rec: Recognition, Seg: Segmantation, LID: Language Identification; WC: Cropped Word Script Classification; JC: Joint Text Detection and Script Classification; E2E: End-to-End Detection and Recognition; Multi-9: Arabic, Bangla, Chinese, English, French, German, Italian, Japanese, and Korean; Multi-10: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean; Arb.-Or: Arbitrary-oriented; Arb.-Sh: Arbitrary-shaped; Tr Imgs: Train Images, Ts Imgs: Test Images, Tt Imgs: Total Images.
Table 3.
Authoritative Datasets for Text Spotting Tasks. Lang: Language, Det: Detection, Rec: Recognition, Seg: Segmantation, LID: Language Identification; WC: Cropped Word Script Classification; JC: Joint Text Detection and Script Classification; E2E: End-to-End Detection and Recognition; Multi-9: Arabic, Bangla, Chinese, English, French, German, Italian, Japanese, and Korean; Multi-10: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean; Arb.-Or: Arbitrary-oriented; Arb.-Sh: Arbitrary-shaped; Tr Imgs: Train Images, Ts Imgs: Test Images, Tt Imgs: Total Images.
NO | Datasets | Scene/Synth | Task | Lang | Orient | Tr Imgs | Ts Imgs | Tt Imgs | Year | Download URLs (Last Accessed) |
---|
1 | SignboardText [104] | Scene | Det+Rec | Ch | Arb.-Or | 1200 | 500 | 2000 | 2024 | https://github.com/aiclub-uit/SignboardText (accessed on 1 July 2025) |
2 | HierText [86] | Scene | Det | En | Arb.-Sh | 8281 | 1634 | 11,639 | 2022 | https://github.com/google-research-datasets/hiertext (accessed on 1 July 2025) |
3 | VinText [98] | Scene | Det+Rec | En | Arb.-Or | 1200 | 500 | 2000 | 2021 | https://github.com/VinAIResearch/dict-guided (accessed on 1 July 2025) |
4 | TextOCR [85] | Scene | Det+Rec | En | Arb.-Sh | 24,902 | 3232 | 28,134 | 2021 | https://textvqa.org/textocr/dataset/ (accessed on 1 July 2025) |
5 | CTW1500 [105] | Scene | Det+Rec | En | Arb.-Sh | 1000 | 500 | 1500 | 2019 | https://github.com/Yuliang-Liu/Curve-Text-Detector (accessed on 1 July 2025) |
6 | Shopsign [106] | Scene | Det+Rec | Ch+En | Arb.-Or | 20,738 | 5032 | 25,770 | 2019 | https://github.com/chongshengzhang/shopsign (accessed on 1 July 2025) |
7 | RRC-MLT-2019 [82] | Scene | WC+JC+E2E | Multi-10 | Arb.-Or | 10,000 | 10,000 | 20,000 | 2019 | https://rrc.cvc.uab.es/?ch=15&com=downloads (accessed on 1 July 2025) |
8 | ICDAR 2019-LSVT [107] | Scene | Det+Rec | Ch+En | Arb.-Or | 30,000 | 20,000 | 450,000 | 2019 | https://rrc.cvc.uab.es/?ch=16&com=downloads (accessed on 1 July 2025) |
9 | Text-RRC-ArT [108] | Scene | Det+Rec | Ch+En | Arb.-Sh | 5603 | 4563 | 10,166 | 2019 | https://rrc.cvc.uab.es/?ch=14&com=downloads (accessed on 1 July 2025) |
10 | ICPR-MTWI [109] | Scene | Det+Rec | Ch+En | Arb.-Or | 10,000 | 10,000 | 20,000 | 2018 | https://tianchi.aliyun.com/competition/entrance/231685/information (accessed on 1 July 2025) |
11 | CTW [48,110] | Scene | Rec | Ch | Arb.-Or | 25,877 | 3269 | 32,285 | 2018 | https://ctwdataset.github.io/ (accessed on 1 July 2025) |
12 | CTW [48,110] | Scene | Det | Ch | Arb.-Or | 25,877 | 3129 | 32,285 | 2018 | https://ctwdataset.github.io/ (accessed on 1 July 2025) |
13 | RCTW [111] | Scene | Det+Rec | Ch+En | Arb.-Or | 8034 | 4229 | 12,263 | 2017 | - |
14 | MLT2017 [112] | Scene | Det+LID | Multi-9 | Arb.-Or | 7200 | 9000 | 18,000 | 2017 | https://rrc.cvc.uab.es/?ch=8 (accessed on 1 July 2025) |
15 | Total-Text [14] | Scene | Det+Rec | En+Ch+Jp | Arb.-Sh | 1255 | 300 | 1555 | 2017 | https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset (accessed on 1 July 2025) |
16 | SynthText [102] | Synth | Det+Rec | En | Arb.-Or | - | - | 858,750 | 2016 | https://www.robots.ox.ac.uk/~vgg/data/scenetext/(accessed on 1 July 2025) |
17 | COCO-Text [113] | Scene | Det+Rec | En | Arb.-Or | 43,686 | 10,000 | 63,686 | 2016 | https://bgshih.github.io/cocotext/ (accessed on 1 July 2025) |
18 | DOST [114] | Scene | Det+Rec | Jp | Arb.-Or | - | - | 32,147 | 2016 | https://rrc.cvc.uab.es/?ch=7&com=downloads (accessed on 1 July 2025) |
19 | ICDAR 2015 [13] | Scene | Det+Rec | En | Arb.-Or | 1000 | 500 | 1500 | 2015 | https://rrc.cvc.uab.es/?ch=4&com=downloads (accessed on 1 July 2025) |
20 | USTB-SV1K [115] | Scene | Det | En | Arb.-Or | 500 | 500 | 1000 | 2015 | - |
21 | HUST-TR400 [116] | Scene | Det | En | Arb.-Or | - | - | 400 | 2014 | - |
22 | CUTE80 [117] | Scene | Det+Rec | En | Arb.-Sh | - | 80 | 80 | 2014 | https://github.com/mohtashim-nawaz/Cute80-Dataset/tree/master/CUTE80 (accessed on 1 July 2025) |
23 | ICDAR 2013 [118] | Scene | Det+Rec | En | Horizontal | 229 | 233 | 462 | 2013 | https://rrc.cvc.uab.es/?ch=2&com=downloads (accessed on 1 July 2025) |
24 | MSRA-TD500 [119] | Scene | Det | En+Ch | Arb.-Or | 200 | 300 | 500 | 2012 | http://www.iapr-tc11.org/dataset/MSRA-TD500/MSRA-TD500.zip (accessed on 1 July 2025) |
25 | KAIST [120] | Scene | Det+Rec+Seg | Eng+Ko | Horizontal | - | - | 3000 | 2010 | - |
26 | ICDAR 2003 [121] | Scene | Det+Rec | En | Horizontal | 258 | 251 | 509 | 2003 | http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2003_Robust_Reading_Competitions |
The strategic importance of synthetic datasets emerges from their capacity to systematically address three fundamental challenges in scene text analysis: (1) controlled variation sampling for robust feature learning, (2) precise annotation scalability unobtainable through manual labeling, and (3) parametric control over the reality gap. When used as pre-training corpora, SynthText and its successors provide a variational basis that spans the appearance space of real-world text better than limited natural datasets can capture. This synthetic-to-real transfer paradigm, exemplified by the progressive improvements from SynthText to UnrealText, demonstrates that carefully designed synthetic data can not only compensate for data scarcity but actually exceed the representational diversity of natural image collections.
7.1.2. Real-World Datasets
Real-world datasets play a pivotal role in assessing the generalization capabilities of models that have been trained on synthetic data. These datasets offer a more accurate reflection of how models will perform in practical applications. Several notable real-world datasets have been instrumental in advancing the field of scene text detection and recognition:
- (1)
ICDAR 2003 (IC03) [
121]: As one of the pioneering datasets in the field, ICDAR 2003 primarily concentrated on horizontal text detection and recognition in natural scenes. It served as a foundational benchmark for subsequent research in scene text reading. The dataset comprised 258 images in the training set and 251 in the test set, totaling 509 images. Each text instance was annotated at the character and word levels, with additional annotations for text line segmentation in some cases. The competition aimed to establish a large, ground-truthed text-in-scene dataset, design standard formats for datasets and recognition results, and conduct competitions to evaluate the state-of-the-art in robust reading systems. ICDAR 2003 played a pivotal role in identifying the challenges and opportunities in scene text detection and recognition, setting the stage for the development of more advanced datasets and algorithms.
- (2)
ICDAR 2013 (IC13) [
118]: The ICDAR 2013 dataset, an extension of earlier ICDAR competition datasets, focuses on horizontal text detection and recognition in natural scenes. Comprising 229 training and 233 testing images, it provides bounding box annotations and transcriptions at both word and character levels, enabling detailed performance evaluation. While limited to English text with horizontal or near-horizontal orientations, the dataset remains a benchmark due to its challenging variations in text scale, lighting conditions, and background complexity. Its contributions have been instrumental in advancing horizontal text detection methods, paving the way for addressing more intricate scenarios in scene text analysis.
Figure 13 illustrates representative samples from the ICDAR 2013 dataset, demonstrating both the visual quality and annotation details.
- (3)
ICDAR 2015 (IC15) [
13]: The ICDAR 2015 dataset is a widely used benchmark for evaluating multi-oriented scene text detection algorithms, particularly in incidental text scenarios. Derived from real-world images captured via Google Glass, this dataset consists of 1000 training and 500 testing images featuring significant variations in text scale, resolution, and orientation. Each word-level annotation is provided as a quadrilateral bounding box, accurately reflecting the geometric diversity of natural scene text. These challenging characteristics, including arbitrary orientations and complex backgrounds, make ICDAR 2015 particularly valuable for assessing model robustness in practical applications. As shown in
Figure 14, the dataset’s representative samples demonstrate the typical challenges posed by real-world text detection scenarios, establishing it as an essential testbed for advancing scene text detection research.
- (4)
COCO-Text (CT) [
113]: COCO-Text is a pioneering large-scale dataset for advancing text detection and recognition in natural images, comprising 63,686 images with 145,859 text instances. The dataset is systematically divided into 43,686 training images (118,309 text instances), 10,000 validation images (27,550 text instances), and 10,000 test images (annotations not publicly available). Each text instance is precisely annotated with bounding boxes and key attributes, including text type (machine-printed or handwritten) and legibility. This comprehensive annotation scheme enables the development of models capable of accurately identifying and interpreting text across diverse conditions, effectively bridging the gap between synthetic data pre-training and real-world deployment. Given its extensive scale, diverse text types, and varied contextual appearances, COCO-Text serves as an indispensable benchmark for researchers aiming to enhance the robustness and generalizability of their scene text detection and recognition models.
- (5)
DOST [
114]: Comprising a vast collection of sequential images captured in downtown Osaka using an omnidirectional camera, DOST offers a unique dataset tailored for evaluating models’ capabilities in handling uncontrolled, real-world scene text. With 32,147 manually ground-truthed images containing 935,601 text regions, DOST provides a comprehensive benchmark for assessing models’ robustness in diverse and challenging urban environments. The dataset includes both legible and illegible text regions, reflecting the variability found in real-world scenarios, making it an invaluable resource for researchers aiming to develop systems capable of accurately detecting and recognizing text in complex, unstructured settings.
- (6)
RCTW [
111]: The RCTW (Reading Chinese Text in the Wild) dataset, introduced in the ICDAR 2017 competition, is a large-scale and challenging benchmark specifically designed for evaluating scene text reading systems targeting Chinese text. Comprising 12,263 images with detailed annotations, including the location and transcription of every text instance, RCTW covers a wide range of image sources, text fonts, layouts, and languages (Chinese and English). The dataset is split into a training/validation set of 8034 images and a test set of 4229 images, all annotated at the text-line level using polygons. RCTW sets up two main tasks: text localization and end-to-end recognition, providing a comprehensive platform for assessing models’ capabilities in detecting and recognizing Chinese text in natural images. Its emphasis on Chinese text, which has unique characteristics such as a larger character set and the absence of word spaces, makes RCTW an invaluable resource for researchers working on Chinese OCR and scene text reading technologies.
- (7)
CTW [
48,
110]: Designed specifically for Chinese text analysis in natural scenes, CTW (Chinese Text in the Wild) contains 32,285 street view images with 1,018,402 annotated Chinese characters. The dataset is partitioned into training (25,887 images), recognition testing (3269 images), and detection testing (3129 images) sets at an 8:1:1 ratio, ensuring images from the same street are grouped within the same set to prevent data leakage. It features 3850 unique character categories with comprehensive annotations including bounding boxes and six key attributes (occlusion, background complexity, distortion, 3D effect, wordart, and handwriting) to facilitate robust evaluation of text detection and recognition algorithms under diverse real-world conditions. The dataset’s large-scale, carefully curated test splits and detailed annotations make it particularly valuable for benchmarking model performance on challenging Chinese text scenarios.
- (8)
ICPR-MTWI [
109]: ICPR-MTWI is a groundbreaking large-scale dataset designed for multi-type web text reading, comprising 20,000 web images with a primary focus on Chinese and English text. This dataset stands out due to its diverse representation of text instances, encompassing fancy advertising fonts, multilingual texts, and complex-shaped text such as distorted and curved variations. Each text instance is precisely annotated with quadrangles and corresponding transcriptions, facilitating accurate model training and evaluation. The dataset is balanced with 10,000 images for training and 10,000 for testing, providing a robust benchmark for assessing model performance in handling the inherent challenges of web text reading. Given its rich variety of text types and complex layouts, which closely mirror real-world web scenarios, ICPR-MTWI serves as an indispensable resource for researchers striving to advance the state-of-the-art in web text reading technology.
- (9)
ICDAR 2019-LSVT [
107,
122]: ICDAR 2019-LSVT stands as a monumental dataset within the realm of Chinese scene text recognition, comprising an extensive collection of 450,000 images sourced from Chinese streets, thereby establishing itself as the largest dataset of its kind to date. This dataset uniquely combines 50,000 fully annotated images, divided into 30,000 for training and 20,000 for testing, with an additional 400,000 weakly annotated images. The fully annotated subset encompasses a wide spectrum of text orientations, including horizontal, multi-oriented, vertical, and curved text, annotated with precise quadrilateral bounding boxes or polygons with 8 or 12 vertices, ensuring comprehensive coverage of diverse text instances. In contrast, the weakly annotated images provide keyword transcriptions without explicit location annotations, offering a cost-effective approach to dataset expansion. During the competition, the test set was strategically divided into two segments for release. Notably, ICDAR 2019-LSVT is over 14 times larger than existing robust reading benchmarks and sets a precedent by incorporating partial annotations within ICDAR challenges. This dataset provides a robust evaluation platform for assessing models’ capabilities in handling large-scale, real-world street view text, effectively bridging the gap between academic research and industrial applications.
Figure 15 showcases illustrative examples of images and annotations from the ICDAR 2019-LSVT dataset, highlighting its versatility and utility.
- (10)
Text-RRC-ArT (ArT) [
108]: As a comprehensive benchmark for arbitrary-shaped text research, Text-RRC-ArT integrates and extends datasets like Total-Text and SCUT-CTW1500, comprising 10,166 images (5603 training, 4563 testing) with diverse text formations including curved and irregular layouts. Its polygon annotations precisely capture complex text geometries, advancing beyond traditional bounding boxes. The dataset establishes three core tasks: detection (localizing arbitrary-shaped text), recognition (transcribing cropped text), and spotting (end-to-end detection and recognition), evaluated through rigorous metrics like IoU thresholds and Normalized Edit Distance. Notably, the competition attracted 46 teams, with top methods leveraging segmentation and attention mechanisms, demonstrating the dataset’s role in driving progress in robust text understanding systems. This benchmark significantly contributes to addressing the challenges of real-world arbitrary-shaped text scenarios.
- (11)
TextOCR [
85]: A pioneering large-scale dataset designed for arbitrary-shaped scene text detection and recognition, TextOCR stands out by providing high-quality, densely annotated word-level annotations on real-world images. Collected from the TextVQA dataset, TextOCR comprises over 28,000 images and more than 900,000 annotated words, averaging 32 words per image, making it significantly larger and denser than existing datasets. This dataset not only facilitates the training of advanced OCR models but also enables end-to-end training of downstream applications, such as Visual Question Answering (VQA) and Image Captioning, by allowing feedback from these applications to refine the OCR pipeline. Unlike many previous datasets that focus on specific text orientations or shapes, TextOCR encompasses a wide variety of text configurations, including horizontal, vertical, multi-oriented, and curved text, providing a comprehensive benchmark for evaluating the robustness and adaptability of scene text recognition models. Additionally, TextOCR’s annotations are performed with polygon annotations for detection and character-level modeling for recognition, ensuring precise representation and accurate recognition of complex text layouts. The introduction of TextOCR has significantly advanced the field of OCR and scene text understanding by offering a rich resource for training and evaluating models that can handle the challenges of real-world text recognition tasks.
- (12)
CUTE80 [
117]: CUTE80 is a specialized dataset crafted to tackle the challenge of detecting arbitrarily-shaped (curved) text in natural scene images. Comprising 80 meticulously selected images, it encompasses diverse complex backgrounds, perspective distortions, and low-resolution scenarios, with text lines arranged in circular, S, Z, and other non-linear patterns. Each image is annotated with precise polygon points to accurately delineate the bounding regions of curved text, thereby facilitating a rigorous evaluation of text detection algorithms. As a pivotal benchmark, CUTE80 is instrumental in assessing the robustness and adaptability of text detection systems, especially those tailored for unconventional text orientations and configurations. By offering a dedicated platform for evaluating curved text detection, CUTE80 significantly advances the state-of-the-art in natural scene text recognition, fostering the development of more versatile and real-world-applicable models. The representative samples in
Figure 16 underscore its emphasis on visual diversity, further validating its comprehensive nature and reinforcing its status as a critical benchmark for driving scene text detection technologies towards practical deployment through rigorous model evaluation and enhancement.
- (13)
HierText [
86]: As the first dataset designed to unify scene text detection and geometric layout analysis, HierText introduces hierarchical annotations at word, line, and paragraph levels across 11,639 high-resolution natural images (8281 for training, 1724 for validation, and 1634 for testing). With an average of 103.8 words per image—approximately three times denser than prior datasets—HierText captures a diverse range of layouts, including curved text clusters and non-uniform typography. Polygon masks precisely annotate arbitrarily shaped text, while grouping labels facilitate the joint modeling of detection and semantic-geometric relationships. The benchmark proposes two primary tasks: (1) instance segmentation of text entities at the word or line level and (2) layout analysis via paragraph-level clustering. These tasks are evaluated using the Panoptic Quality (PQ) metric, which unifies segmentation and grouping metrics. The accompanying end-to-end
Unified Detector achieves state-of-the-art (SOTA) results, with 62.23 PQ for text detection and 53.40 PQ for layout analysis, all without complex post-processing. This demonstrates HierText’s utility in bridging the gap between scene text understanding and document layout analysis, advancing research in holistic text extraction for applications such as visual question answering and image translation.
- (14)
Union14M [
123]: Union14M is a comprehensive, large-scale dataset tailored for scene text recognition (STR) research, aiming to bridge the gap between existing benchmarks and the complexities of real-world scenarios. It consists of two main subsets: Union14M-L, a labeled portion comprising 3,230,742 images for training and 400,000 images for validation, along with a dedicated benchmark set of 409,383 images, and Union14M-U, an unlabeled subset with 10 million images. The labeled subset, Union14M-L, encompasses a diverse array of text images featuring curved, tilted, and vertical layouts, as well as challenging conditions such as blurring, complex backgrounds, and occlusion, thereby offering a rich and varied representation of real-world text scenarios. Meanwhile, the unlabeled subset, Union14M-U, provides a valuable resource for leveraging self-supervised learning techniques, enabling the development of more robust and adaptable STR models. By combining labeled and unlabeled data, Union14M not only captures a broader spectrum of real-world text situations but also serves as a critical stepping stone towards advancing STR research, particularly in addressing the intricacies of real-world text recognition tasks.
In contrast, synthetic datasets like SynthText [
102] offer a controlled environment for generating large amounts of annotated data, which is crucial for training deep learning models. SynthText includes about 800,000 images, with most text instances being multi-oriented and annotated with word and character-level rotated bounding boxes, as well as text sequences. While synthetic datasets are invaluable for pre-training models, real-world datasets like those mentioned above are essential for validating and refining these models to ensure they perform well in practical applications.
7.1.3. Multilingual Datasets
In the realm of scene text detection and recognition, the global prevalence of text in various languages and scripts underscores the critical importance of multilingual datasets. These datasets serve as essential resources for advancing technologies capable of handling text diversity. Below, we highlight several prominent multilingual datasets that have significantly contributed to this field:
- (1)
MLT2017 [
112]: MLT2017 is a large-scale, multilingual, and multi-oriented scene text dataset meticulously curated to evaluate model performance in detecting and identifying text scripts within natural images. Comprising 7200 training images, 9000 testing images, and 1800 validation images, it encompasses a wide variety of text instances differing in language (including Arabic, Bangla, Chinese, English, French, German, Italian, Japanese, and Korean), script, font, size, color, and orientation. This closely reflects the complexities of real-world scenarios, making MLT2017 an essential benchmark for assessing the robustness and adaptability of scene text detection and script identification algorithms. Its comprehensive annotations and challenging text instances facilitate rigorous testing and validation of models aimed at achieving high accuracy and efficiency in diverse and complex text environments.
- (2)
MSRA-TD500 [
119]: The MSRA-TD500 dataset stands as a foundational benchmark for evaluating scene text detection algorithms under complex real-world conditions. Unlike traditional datasets focusing solely on horizontal or uniformly oriented text, MSRA-TD500 specifically addresses the challenges posed by arbitrarily oriented and multilingual long text lines, which are commonplace in natural images but frequently neglected in model assessments. With 300 training and 200 test images, the dataset offers remarkable diversity in text attributes, encompassing language (Chinese-English bilingualism), font styles, size discrepancies, color contrasts, and orientations spanning horizontal, vertical, and curved layouts. This intentional design accurately reflects the inherent complexities of real-world text occurrences, such as those found on street signs, shopfronts, or in document archives, where text may appear in varied orientations or mixed scripts. The inclusion of such demanding scenarios makes MSRA-TD500 essential for evaluating the robustness (e.g., rotation and scale invariance) and adaptability (e.g., cross-lingual generalization) of detection algorithms. Its precise annotations, combined with visually cluttered backgrounds and overlapping text regions, enable thorough validation of models aiming for both high detection accuracy and computational efficiency in noisy environments. Representative examples in
Figure 17 further highlight the dataset’s focus on visual diversity, solidifying its position as a critical benchmark for driving scene text detection technologies toward practical applications.
- (3)
KAIST [
120]: The KAIST dataset is a meticulously curated collection comprising 3000 scene images captured under a wide range of lighting conditions, spanning both indoor and outdoor environments. These images were acquired using a combination of high-resolution digital cameras and low-resolution mobile phones, ensuring diversity in image quality. Notably, this dataset is multilingual, incorporating both Korean and English texts, thereby broadening its applicability across different linguistic contexts. Each image in the dataset is annotated with precision, utilizing horizontally aligned rectangular bounding boxes at the word level, along with binary masks for individual characters. These annotations facilitate crucial tasks such as text localization and character segmentation, providing a solid foundation for evaluating model performance. The KAIST dataset serves as an invaluable benchmark for assessing model robustness in real-world applications, where varying lighting, resolution, and language scenarios are prevalent. Representative samples from the dataset, as illustrated in
Figure 18, highlight its emphasis on visual diversity, reinforcing its pivotal role in advancing scene text detection technologies towards practical deployment.
- (4)
RRC-MLT-2019 [
82]: RRC-MLT-2019 emerges as a pivotal and comprehensive benchmark within the realm of multilingual datasets, meticulously crafted to propel the frontiers of multilingual text detection and recognition. This dataset encompasses 20,000 real natural scene images, featuring text in 10 distinct languages—Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean—thereby offering a rich tapestry of linguistic diversity. It is systematically divided into 10,000 images for training and 10,000 for testing, with word-level annotations using 4-corner bounding boxes to ensure precise localization. Additionally, a synthetic dataset comprising 277,000 images, mirroring the scripts of the real dataset, is provided to augment training efforts. RRC-MLT-2019 challenges researchers with four demanding tasks: text detection, cropped word script classification, joint text detection and script classification, and end-to-end text detection and recognition. This multifaceted benchmark not only assesses model performance across varied languages and scripts but also fosters the creation of resilient systems capable of tackling the complexities inherent in real-world multilingual scene text. Representative samples, as depicted in
Figure 19, highlight the dataset’s emphasis on visual diversity, cementing its status as a crucial benchmark for advancing scene text detection technologies towards practical, real-world deployment.
Among these datasets, the ICDAR 2019 Multilingual Scene Text Detection and Recognition (MLT) dataset, represented here by RRC-MLT-2019, stands out as a particularly formidable benchmark. It encompasses a broad spectrum of languages and features both horizontal and multi-oriented text instances, presenting a significant challenge for models to handle diverse scripts and languages. This dataset pushes the boundaries of scene text understanding, compelling models to generalize across linguistic boundaries and enhancing their applicability in real-world scenarios that span diverse geographical and cultural contexts.
7.1.4. Specialized Datasets
In the realm of scene text detection and recognition, while general-purpose datasets serve as invaluable resources for training and evaluating models across a broad spectrum of scenarios, specialized datasets play a pivotal role in addressing specific challenges encountered in real-world applications. These datasets are meticulously curated to reflect the nuances and complexities of particular domains, thereby enabling researchers to develop and refine models that excel in niche contexts. Below, we highlight several specialized datasets that have significantly contributed to advancing the state-of-the-art in scene text technology.
- (1)
USTB-SV1K [
115]: USTB-SV1K is a carefully curated dataset designed to assess the capability of models in managing complex, real-world text layouts encountered in multi-oriented scene text images. These images were directly sourced from Google Street View, presenting a challenging yet representative benchmark for evaluating performance in such scenarios. The dataset comprises 1000 street view images, evenly split into 500 for training and 500 for testing, collected from six major US cities. Notably, a significant portion (approximately 28%) of the text instances within the dataset are small-scale and blurry, which adds to its complexity and realism. Each text instance is annotated with word-level, multi-oriented bounding boxes, making USTB-SV1K a valuable resource for researchers focused on multi-oriented scene text detection and recognition tasks. Representative samples depicted in
Figure 20 highlight the dataset’s emphasis on visual diversity, underscoring its pivotal role as a critical benchmark for advancing scene text detection technologies towards practical applications.
- (2)
HUST-TR400 [
116]: HUST-TR400 is a carefully constructed dataset comprising 400 images. Each image features Arabic numerals and English letters presented in diverse fonts, scales, colors, and orientations. A distinguishing feature of this dataset lies in its unique text line-level annotations, which enable it to serve as a comprehensive benchmark for assessing scene text detection and recognition systems, especially those designed to handle multi-oriented texts. By incorporating a wide range of font styles and text orientations, HUST-TR400 closely approximates real-world scenarios, thereby offering a valuable resource for improving the robustness and adaptability of models in complex textual environments. Representative samples depicted in
Figure 21 highlight the dataset’s emphasis on visual diversity, further solidifying its position as a crucial benchmark for advancing scene text detection technologies towards practical applications.
- (3)
Total-Text [
14]: Another comprehensive dataset tailored for arbitrary-shaped scene text detection, Total-Text encompasses a diverse array of text configurations, including horizontal, multi-oriented, and curved text instances. This dataset is meticulously curated to evaluate models’ proficiency in managing intricate text layouts in real-world applications. With 1255 training images and 300 test images, each text instance within Total-Text is annotated at the word level using a polygon with 2N vertices, allowing for precise representation of various text shapes. Unlike CTW1500, which predominantly focuses on curved text lines, Total-Text incorporates a broader spectrum of text orientations and shapes, making it an indispensable benchmark for assessing the robustness and adaptability of scene text detection models.
- (4)
CTW1500 [
105]: CTW1500 is a specialized dataset tailored for arbitrary-shaped scene text detection, with a particular focus on curved text. This dataset provides a challenging benchmark for assessing models’ capabilities in handling complex text layouts encountered in real-world scenarios. With 1000 training images and 500 test images, each containing at least one curved text line annotated with a 14-vertex polygon, CTW1500 facilitates rigorous testing and validation of algorithms designed for detecting and recognizing text with intricate shapes. The dataset’s predominantly Chinese and English text, extracted from various sources, makes it highly representative of real-world text scenarios. Representative samples illustrated in
Figure 22 further underscore the dataset’s emphasis on visual diversity, reinforcing its role as a critical benchmark for advancing scene text detection technologies toward practical deployment.
- (5)
VinText [
98]: As the most extensive dataset for Vietnamese scene text recognition, VinText comprises 2000 annotated images containing 56,084 text instances, each annotated with quadrilateral bounding boxes and corresponding character-level transcriptions. The dataset is systematically partitioned into training (1200 images), validation (300 images), and testing (500 images) subsets to support rigorous model development and evaluation. Notably, VinText addresses a critical gap in Vietnamese text recognition research by providing large-scale, high-quality annotations that capture the linguistic characteristics and visual diversity of Vietnamese scene text. The quadrilateral annotation format enables precise localization while maintaining compatibility with mainstream detection frameworks. This dataset serves as an essential benchmark for advancing Vietnamese text recognition systems and facilitates cross-linguistic comparisons in scene text research.
- (6)
SignboardText [
104]: This dataset provides a comprehensive benchmark for signboard text analysis, featuring 2104 in-the-wild scene images with 79,814 manually annotated text instances at both line-level and word-level. The dataset captures the complexity of real-world signage, including diverse fonts, sizes, artistic styles, and languages against cluttered backgrounds. Notably, the average image contains 40 words, with 79.42% of words comprising 1–5 characters and 95.54% having fewer than 10 characters-longer words typically represent URLs or email addresses on signboards. By addressing these real-world text variations, SignboardText serves as a critical resource for advancing robust text detection and recognition models in applications like urban scene understanding and digital accessibility.
- (7)
Shopsign [
106,
124]: ShopSign dataset serves as a large-scale benchmark tailored for Chinese shop sign text analysis, encompassing 25,770 street-view images annotated with 196,010 text-lines. It is strategically divided into 20,738 training images (Train1) and 5032 testing images (Test1), with the test set meticulously crafted to evaluate real-world performance. Notably, the test set includes 2516 image pairs captured from diverse perspectives, enabling rigorous assessment of algorithm robustness in handling both horizontal and multi-oriented text. This dataset presents a myriad of challenges, including varying scales, orientations, lighting conditions, and five specifically categorized difficult cases: mirror, wooden, deformed, exposed, and obscure text. With comprehensive coverage of 4072 unique Chinese characters and the inclusion of night scenes, ShopSign stands as an indispensable resource for developing robust text detection and recognition systems tailored for complex urban environments.
In addition to the datasets mentioned above, specialized datasets like these are instrumental in advancing scene text detection and recognition technology. They enable researchers to fine-tune models to perform optimally in specific applications or scenarios where general-purpose datasets might fall short in providing sufficient coverage or diversity. By leveraging these datasets, researchers can push the boundaries of what is possible in scene text technology, developing solutions that are not only more accurate but also more adaptable to the unique demands of real-world applications.
7.2. Evaluation Metrics for Scene Text Detection
For scene text detection, the most commonly used metrics are Precision (P), Recall (R), and F-measure (F). Precision measures the proportion of detected text regions that are truly text, while Recall measures the proportion of actual text regions that are successfully detected. The F-measure is the harmonic mean of Precision and Recall, providing a balanced assessment of a model’s performance. Additionally, the Intersection over Union (IoU) threshold is often used to determine whether a detected bounding box matches a ground truth box.
7.2.1. DetEval Evaluation Criteria
For the ICDAR2013 dataset, the DetEval evaluation standard [
125] has been adopted in the ICDAR competitions of 2013, 2015, 2017, and 2019 to accurately assess the performance of scene text detection algorithms. DetEval considers three types of matching relationships between detected bounding boxes and ground truth bounding boxes: one-to-one, one-to-many, and many-to-one, as illustrated in
Figure 23.
For any image in the ICDAR2013 test set, let
G denote the set of ground truth bounding boxes and
D denote the set of detected bounding boxes obtained by the algorithm. The recall overlap matrix
and precision overlap matrix
are constructed using
G and
D. The row index
represents the
i-th ground truth bounding box
, and the column index
represents the
j-th detected bounding box
. The overlap measures
and
between
and
are defined as:
If
and
intersect,
and
take non-zero values; otherwise, they are set to 0. However, two bounding boxes are considered matched only if their overlap satisfies the quantitative constraints, i.e., the recall
and precision
both exceed their respective thresholds:
where
is the recall threshold and
is the precision threshold. In the evaluation system,
and
are set to 0.8 and 0.4, respectively. The specific definitions of the different matching types are as follows:
- (1)
One-to-one matching: A ground truth bounding box
and a detected bounding box
are considered a one-to-one match if the
i-th row and
j-th column in the overlap matrices
and
contain only one element satisfying Equation (
16).
Figure 23a illustrates this type of matching.
- (2)
One-to-many matching (splitting): A ground truth bounding box
matches a set of detected bounding boxes
if a sufficiently large proportion of
is detected, i.e.,
, and each detected bounding box is considered part of
with a sufficiently large overlap area, i.e.,
.
Figure 23b shows this type of matching.
- (3)
Many-to-one matching (merging): A detected bounding box
matches a set of ground truth bounding boxes
if each ground truth bounding box in
has a sufficiently large proportion detected, i.e.,
, and the total precision of the detected area is sufficiently large, i.e.,
.
Figure 23c demonstrates this type of matching.
Based on the analysis and explanation of these three matching types, the recall and precision can be defined as follows:
The functions
and
account for the three different matching scenarios and evaluate the quality of the matches:
where
is a penalty function used in the evaluation scheme. In the evaluation system,
is set to a constant value of 0.8.
The DetEval metric has these key characteristics:
Advantages: Its flexible matching system handles one-to-one, one-to-many, and many-to-one correspondences, enabling precise evaluation of complex text layouts. This makes it ideal for applications requiring accurate text localization.
Disadvantages: However, this sophisticated matching approach increases implementation complexity, potentially challenging users who prefer simpler evaluation methods.
7.2.2. MSRA-TD500 Evaluation Criteria
In the literature [
119], a minimum rectangular region is adopted as the basic unit for evaluating the performance of text detection algorithms. The authors propose an evaluation standard capable of assessing the performance of algorithms designed for detecting arbitrarily oriented long text lines. The calculation process is illustrated in
Figure 24. As depicted in
Figure 24c, there exists a certain angle between the detected bounding box
D and the ground truth bounding box
G, making it challenging to directly compute the overlap between
D and
G. To address this issue, the authors convert the problem of calculating the overlap between
D and
G into calculating the overlap between the rotated bounding boxes
and
of
D and
G, respectively:
Due to the angle between
D and
G, there may be discrepancies between
and
. To mitigate these differences, the authors adopt a relatively lenient evaluation method, similar to that used in the object detection task [
126]. The criterion for determining whether a scene text instance is correctly detected is based on the overlap between
D and
G. Specifically, if the angle between
D and
G is less than
and their overlap is greater than 0.5, the text instance is considered correctly detected. In cases where multiple detected bounding boxes correctly match a single ground truth bounding box, only the detected bounding box with the highest overlap is considered a valid match.
For any scene image in the MSRA-TD500 test set, the precision (
P), recall (
R), and
F-measure (
F) of the detection are defined as follows:
where
represents the number of bounding boxes correctly detected by the algorithm,
denotes the number of ground truth bounding boxes, and
indicates the number of bounding boxes detected by the algorithm.
To handle challenging scenarios in scene images, such as low-resolution, blurred, or occluded text, the authors introduce a flexible mechanism for evaluating algorithm performance. In extreme cases, this mechanism allows the text detection algorithm to miss challenging instances without incurring penalties. This approach ensures that the algorithm’s performance is not unduly affected by the presence of difficult-to-detect text.
The MSRA-TD500 evaluation metric has these key characteristics:
Advantages: Its rotated box overlap () with angular constraint () effectively handles multi-oriented text, while the flexible scoring accommodates challenging cases (blurred/occluded text) without excessive penalties, better matching real-world conditions.
Disadvantages: The 0.5 overlap threshold may exclude practically useful partial detections. While suitable for straight text lines, the rectangular region approach lacks precision for curved or irregular text instances.
7.2.3. IoU Evaluation Metric
For arbitrarily oriented (ICDAR2015 [
13]) and arbitrarily shaped (CTW1500 [
84], Total-text [
14]) scene text datasets, the Intersection over Union (IoU) metric is commonly adopted to evaluate algorithm performance. For any image in the test set, let
D denote the set of detected text region bounding boxes and
G represent the set of ground truth bounding boxes. An overlap matrix
of size
is constructed, where
and
indicate the number of ground truth boxes and detected boxes respectively.
In this matrix, each row index
corresponds to the
i-th ground truth box
, while each column index
represents the
j-th detected box
. The overlap between
and
is defined as:
A text region is considered correctly detected if the overlap exceeds 0.5. When multiple detected regions match the same ground truth box, only the detection with the highest IoU value is counted as a correct match to avoid duplicate counting.
For each test image, the precision (
P), recall (
R), and
F-measure (
F) are calculated as follows:
where
denotes the number of correctly detected text regions according to the IoU criterion.
The IoU (Intersection over Union) metric has these key features:
Advantages: IoU provides a simple, efficient way to measure overlap between detected text regions and ground truth. Its intuitive design ensures broad applicability and easy interpretation for text detection tasks.
Disadvantages: However, IoU is sensitive to minor box misalignments. This can result in misleading evaluations when small positional differences don’t substantially affect detection quality.
7.2.4. TedEval Evaluation Metric
For a comprehensive evaluation of scene text detectors, the TedEval metric [
127] is proposed to address limitations in existing metrics such as granularity and character completeness. The evaluation consists of two main stages: instance-level matching and character-level scoring.
First, an instance-level matching matrix
M of size
is constructed, where
indicates a valid match between ground truth
and detection
. The matching policy accommodates one-to-one, one-to-many, and many-to-one relationships with an area recall threshold of 0.4. Multiline cases are rejected using angle constraints:
where
is computed between bounding boxes in potential many-matches.
For character-level scoring, Pseudo Character Centers (PCC) are generated for each
with length
:
A character matching matrix
tracks whether
contains PCC
. The recall
and precision
are computed as:
where
indicates correct character matches. The final metrics are:
This approach provides fair evaluation by considering both detection localization and character-level completeness while being robust to annotation inconsistencies.
The TedEval evaluation framework demonstrates these key characteristics:
Advantages: TedEval offers comprehensive text detector assessment through dual instance-level and character-level evaluation. This approach combines geometric accuracy with recognition completeness, providing robustness against annotation inconsistencies and complex text layouts in real-world scenarios.
Disadvantages: The framework’s computational efficiency is impacted by its use of pseudo character centers and angle constraints for multiline matching. While its hierarchical evaluation prevents double-counting, the multi-granular annotation processing requires careful implementation to ensure consistency.
7.2.5. TIoU Evaluation Metric
The Tightness-aware Intersection-over-Union (TIoU) metric, as proposed by Liu et al. [
128], is designed to overcome significant shortcomings inherent in conventional evaluation methods for scene text detection. These shortcomings encompass assessments that are not aligned with the detection objectives, a lack of consideration for the tightness of detections, and vulnerabilities in one-to-many or many-to-one matching scenarios. TIoU offers a cohesive framework for quantifying three fundamental properties of scene text detection.
Firstly, it emphasizes completeness, ensuring that text instances are fully covered by the detected regions. Secondly, it focuses on compactness, aiming to minimize the inclusion of background elements or outliers within the detected areas. Lastly, TIoU incorporates tightness-awareness, which evaluates the quality of localization beyond simple binary thresholds, thereby providing a more nuanced assessment of detection performance.
- (1)
Mathematical Formulation
Considering a ground truth polygon
and its corresponding detection polygon
, the framework defines two fundamental spatial discrepancy measures. The Not-Recalled Area, denoted
, quantifies the uncovered portion of the ground truth region:
Complementarily, the Outlier-GT Area
captures non-target text inclusions within the detection:
- (2)
Core Metric Formulation
The evaluation framework defines penalization functions that quantify completeness and compactness properties:
which directly inform the primary quality scores. The TIoU-Recall metric (
) measures ground truth coverage:
while TIoU-Precision (
) assesses detection purity:
- (3)
Matching Protocol
The pairing mechanism operates under three fundamental constraints. Detection-GT pairs exhibiting IoU values below 0.5 are immediately discarded. Each ground truth element is matched exclusively to its highest-scoring detection counterpart, with reciprocal matching applied to detections. All resulting scores maintain continuous values within the unit interval, specifically .
- (4)
Aggregate Scores
The dataset-level performance assessment aggregates instance-level scores into comprehensive metrics. Recall
is computed as the mean TIoU-Recall across all ground truth instances:
while precision
represents the average TIoU-Precision over all detections:
These component metrics are then synthesized into the unified TIoU F-measure through harmonic averaging:
- (5)
Multi-Granular Annotation Handling
For datasets exhibiting annotation granularity inconsistencies such as ICDAR 2015 [
13], the TIoU protocol employs a hierarchical evaluation strategy. First, text-line annotations are programmatically generated from existing word-level ground truths. During evaluation, detected regions are matched against both annotation levels through a suppression mechanism where text-line matches automatically override word-level evaluations in overlapping areas. Unmatched regions subsequently undergo assessment at the word level.
To optimize scoring for text-line detections, the recall calculation incorporates an adjustment factor:
where
represents the text-line annotation. This multi-tiered approach ensures fair performance assessment of detection methods operating at different granularities while effectively preventing double-counting artifacts [
128].
The TIoU (Tightness-aware Intersection over Union) metric has the following key characteristics:
Advantages: TIoU improves upon traditional IoU by incorporating completeness, compactness, and tightness-awareness. This allows more nuanced evaluation, particularly in challenging scenarios where standard IoU fails to capture subtle detection quality differences.
Disadvantages: The enhanced evaluation comes at a cost. TIoU’s additional parameters may increase computational complexity, potentially affecting efficiency for large-scale datasets or real-time applications.
7.3. Evaluation Metrics for Scene Text Recognition
The evaluation of scene text recognition (STR) systems is crucial for assessing their performance and comparing different approaches. A variety of metrics have been proposed and widely used in the literature to evaluate the accuracy and efficiency of STR models. This subsection provides an overview of the most commonly used evaluation metrics for scene text recognition.
7.3.1. Word-Level Accuracy
Word-level accuracy is one of the most straightforward and widely used metrics for evaluating STR systems. It measures the proportion of correctly recognized words out of the total number of words in the test dataset. Mathematically, it can be expressed as:
This metric is intuitive and easy to understand, but it may not fully capture the performance of STR systems in cases where partial recognition is still useful. Despite this limitation, word-level accuracy remains a standard metric in many STR benchmarks [
20,
87].
The Word-Level Accuracy metric for scene text recognition has the following strengths and limitations:
Advantages: This metric offers an intuitive performance measure by computing the ratio of correctly recognized words to the total word count. Its simplicity enables straightforward comparisons across different systems and benchmarks, making it a widely adopted standard for evaluating overall word recognition performance.
Disadvantages: However, Word-Level Accuracy has limitations. Its binary (correct/incorrect) scoring overlooks partially correct recognitions that may still be practically useful. This can lead to underestimating system performance, particularly for challenging cases like occluded, blurred, or stylized text commonly found in real-world scenarios.
7.3.2. Character-Level Accuracy
Character-level accuracy is another important metric that evaluates the recognition performance at the character level. It measures the proportion of correctly recognized characters out of the total number of characters in the test dataset. This metric is particularly useful when dealing with multi-lingual texts or when the word boundaries are not clearly defined. The formula for character-level accuracy [
61,
129] is:
Character-level accuracy provides a more granular view of the recognition performance and is often used in conjunction with word-level accuracy.
The Character-Level Accuracy metric for scene text recognition has the following strengths and limitations:
Advantages: This metric provides fine-grained performance evaluation by measuring correctly identified characters against the total character count. Its granularity enables detailed system analysis, especially valuable for multilingual text recognition and cases with ambiguous word boundaries. The sensitivity to individual character errors makes it ideal for applications requiring precise transcription.
Disadvantages: While informative, Character-Level Accuracy may over-penalize minor errors that don’t affect semantic understanding. The character-level focus can sometimes overlook overall word recognition quality, particularly when partial matches remain useful. Additionally, its computational complexity grows with text length, making it less efficient for large-scale evaluations than word-level metrics.
7.3.3. Normalized Edit Distance (NED)
Normalized Edit Distance (NED) is a metric that takes into account the number of edits (insertions, deletions, substitutions) required to transform the recognized text into the ground truth text. It is normalized by the length of the ground truth text to provide a measure of similarity between the recognized and ground truth texts. NED is particularly useful when the recognition system may produce partial or slightly incorrect results that are still semantically meaningful. The formula for NED is:
where
R is the recognized text,
G is the ground truth text, and
is the minimum number of edits required to transform
R into
G. NED has been used in several STR benchmarks to evaluate the robustness of recognition systems [
87,
88].
The Normalized Edit Distance (NED) metric assesses text recognition performance with these key features:
Advantages: NED precisely measures recognition accuracy by calculating character-level edit operations (insertions, deletions, substitutions) normalized by text length. This effectively evaluates partial matches and enables fair comparisons across different-length texts, making it ideal for assessing semantically correct but imperfect outputs.
Disadvantages: NED’s computation becomes more intensive with longer texts, potentially slowing large-scale evaluations. The metric may disproportionately penalize certain errors (like single-character mistakes) while overlooking semantic equivalence. Its character-based approach also may not fully reflect word-level understanding needs.
7.3.4. F1-Score
F1-score is a harmonic mean of precision and recall, which are two fundamental metrics in information retrieval and classification tasks. In the context of STR, precision measures the proportion of correctly recognized words out of all words recognized by the system, while recall measures the proportion of correctly recognized words out of all words in the ground truth. F1-score combines these two metrics to provide a balanced evaluation of the recognition performance:
F1-score is particularly useful when dealing with imbalanced datasets or when both false positives and false negatives are important considerations [
14,
82].
The F1-Score metric for scene text recognition has the following key characteristics:
Advantages: By harmonically combining precision and recall, the F1-Score provides balanced performance evaluation, especially valuable for imbalanced datasets where accuracy alone would be misleading. Its single-score output enables direct comparison between different recognition methods.
Disadvantages: The metric may overlook subtle recognition errors in complex scenarios with partial or fragmented text. Its binary classification approach can oversimplify performance assessment in varied text layouts. Additionally, the dependence on predefined matching thresholds introduces potential evaluation subjectivity.
7.3.5. Other Metrics
In addition to the above metrics, several other metrics have been proposed to evaluate specific aspects of STR systems. For example, some works have used metrics based on the Levenshtein distance to measure the similarity between recognized and ground truth texts [
104,
130]. Others have proposed metrics that take into account the confidence scores of the recognition results or the computational efficiency of the recognition system [
22,
90].
7.4. Evaluation Metrics for Scene Text Spotting
In the field of scene text spotting, which encompasses both scene text detection and recognition, accurate evaluation protocols are crucial for assessing the performance of different algorithms. This section provides a detailed overview of two prominent evaluation metrics: Normalized Score (NS) and Generalized F-measure (GF), which are widely used in the literature to evaluate scene text spotting systems [
6,
131].
7.4.1. Normalized Score (NS)
The Normalized Score (NS) is a metric designed to evaluate the accuracy of scene text spotting algorithms by focusing on character-level differences between the ground truth and the predicted text. It is particularly useful in handling one-to-many and many-to-one matching cases, which are common in scene text spotting due to the complexities of text layout and recognition challenges.
Given a set of ground truth text boxes
and a set of predicted text boxes
for an image
I, the NS is calculated as follows:
where
N is the number of matched pairs
obtained through a pair matching algorithm followed by a set merging algorithm,
represents the Edit Distance function, which measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other, and
denotes the length of a text sequence.
The NS metric emphasizes character-level accuracy, making it particularly suitable for evaluating the performance of OCR systems in scenarios where exact character matches are critical. However, it should be noted that NS scores are not directly comparable across different datasets due to their dependency on the specific text content and layout of each dataset.
The Normalized Score (NS) metric evaluates scene text spotting performance with the following strengths and limitations:
Advantages: NS measures character-level accuracy using edit distance, making it ideal for OCR evaluation where precise text matching matters. It handles complex text layouts (e.g., one-to-many matches) and remains robust against partial recognition errors, providing finer-grained assessment than word-level metrics.
Disadvantages: NS scores depend on dataset content, complicating cross-dataset comparisons. While sensitive to character errors, it may overlook semantic similarity (e.g., synonyms or morphological variations). Additionally, edit distance calculations can be computationally expensive for large-scale evaluations.
7.4.2. Generalized F-Measure (GF)
While the F-measure is a commonly used metric in object detection tasks, it may not be directly applicable to scene text spotting due to the complexities of text layout and the need to evaluate both detection and recognition performance simultaneously. To address this limitation, the Generalized F-measure (GF) was proposed specifically for evaluating scene text spotting algorithms [
6].
The GF metric evaluates the performance of scene text spotting by considering both the geometric relationship between predicted and ground truth text boxes and the accuracy of the recognized text. Specifically, a predicted text box
is considered to accurately spot a ground truth text box
if the following condition is met:
where
represents the area of a text instance,
denotes the intersection area of two polygons, and
T is a predefined threshold (typically set to 0.4 as in [
6]).
Once the matching pairs are determined, the
GF can be calculated using the standard F-measure formula, which combines Precision and Recall:
where
and
, with TP, FP, and FN representing the number of true positives, false positives, and false negatives, respectively.
The GF metric provides a more comprehensive evaluation of scene text spotting algorithms by considering both the spatial accuracy of text detection and the semantic accuracy of text recognition. It is particularly useful in scenarios where the text layout is complex or the text instances are densely packed, making it challenging to distinguish between individual text boxes.
The Generalized F-measure (GF) is a scene text spotting evaluation metric with the following characteristics:
Advantages: GF offers a balanced assessment by jointly evaluating geometric alignment (via IoU thresholds) and text recognition accuracy. This dual focus enables robust performance evaluation, especially in complex layouts where detection and recognition are interdependent. Its tolerance for minor positional variations and sensitivity to OCR errors make it well-suited for end-to-end text spotting systems.
Disadvantages: GF’s effectiveness depends on threshold selection, which may introduce subjectivity. It can also be affected by annotation inconsistencies, particularly for curved or rotated text. Additionally, its computational cost may hinder large-scale or real-time applications.
7.5. Metric Selection Principles for Scene Text Tasks
The diversity of evaluation metrics in scene text analysis stems from the inherent complexity of text detection and recognition tasks, where distinct performance dimensions require specialized measurement paradigms. This necessitates a multi-metric evaluation framework, where each indicator serves to complement others in capturing system capabilities. We now analyze the rationale for metric selection across three critical dimensions.
- (1)
Task-Specific Requirements
The choice of metrics is fundamentally guided by the inherent demands of the detection and recognition subtasks. For geometric accuracy in text detection, metrics such as DetEval, Intersection-over-Union (IoU), and Tightness-aware IoU (TIoU) are commonly employed. DetEval is particularly useful for handling complex matching scenarios, such as one-to-many correspondences. TIoU, on the other hand, penalizes background inclusion to enforce tight bounding boxes, which is crucial for applications like autonomous driving where precise localization is imperative. Although IoU is less sensitive to boundary tightness, its computational simplicity has led to its widespread adoption. For text recognition, textual correctness is evaluated using metrics like Normalized Edit Distance (NED) and Word Accuracy. NED quantifies character-level dissimilarity, accommodating partial matches and making it robust for multilingual or noisy text. Word Accuracy, in contrast, enforces exact matching requirements, directly reflecting end-user expectations in applications such as document digitization.
- (2)
Granularity and Robustness
Metrics can be categorized based on their evaluation scope and error sensitivity, which we refer to as granularity and robustness. In terms of spatial granularity, box-level IoU provides a coarse assessment of localization accuracy, whereas TIoU introduces tightness-awareness for finer granularity. This distinction is particularly important in domains like medical document analysis, where millimeter-level precision is essential. Regarding linguistic granularity, character-level metrics such as 1-NED reveal partial recognition capabilities, which are valuable for languages with ambiguous word boundaries or complex scripts. Word-level metrics, however, align more closely with user-centric performance metrics, as seen in commercial OCR systems.
- (3)
Application Context
The operational context of the application dictates the prioritization of specific metrics. For safety-critical systems, such as autonomous driving, geometric metrics like TIoU are essential to ensure minimal false positive regions, thereby enhancing safety. In contrast, user-facing systems, such as OCR for document processing, often prioritize character-level robustness to tolerate partial recognition errors while maintaining overall legibility. This balance ensures that the system remains functional and user-friendly, even in the presence of imperfect recognition results.
8. Performance Comparison and Analysis
In recent years, deep learning-based methods have significantly advanced the state-of-the-art in scene text detection and recognition (STR). To provide a comprehensive overview, we present comparative tables summarizing the performance of leading models on benchmark datasets for detection, recognition, and spotting tasks.
8.1. Evaluations of Scene Text Detection Performance
We present an overview of the benchmark results achieved by various deep learning-based scene text detection methods across multiple widely-used datasets, including ICDAR 2013 [
118], ICDAR 2015 [
13], MSRA-TD500 [
119], Total-Text [
14], CTW1500 [
84], and MLT 2017 [
112]. The methods are evaluated using standard metrics such as Precision, Recall, and F-measure (F1-score), which are commonly adopted in the field of object detection and scene text detection.
8.1.1. ICDAR 2013: Method Performance Overview
The ICDAR 2013 dataset has long served as a widely recognized benchmark for evaluating scene text detection and recognition methods. In this section, we present a comprehensive overview of the performance of several representative methods on this dataset, as summarized in
Table 4. The methods are categorized based on their underlying strategies, including segmentation-based, boundary points-based, hybrid and end-to-end, and regression-based approaches. Each category is assessed in terms of precision, recall, and F-measure, providing insights into their relative strengths and weaknesses.
- (1)
Segmentation-Based and Boundary Points-Based Methods
Segmentation-based methods, such as MFECN [
43], AB-LSTM [
40], and MSR [
132], have demonstrated robust performance on the ICDAR 2013 dataset. These methods leverage the pixel-level prediction capabilities of convolutional neural networks (CNNs) to accurately delineate text regions. MFECN, in particular, achieved an impressive F-measure of 93.37%, underscoring the efficacy of its multi-level feature enhanced cumulative network. By integrating features from multiple scales, MFECN is able to capture both local details and global context, leading to precise text region localization.
Boundary points-based methods, on the other hand, focus on accurately detecting the boundaries of text regions. CRAFT [
133] and Text-RPN [
134] are notable examples within this category. CRAFT, with an F-measure of 95.2%, employs character-level attention to identify individual characters and their relationships, resulting in precise text line detection. This method is particularly effective in handling curved or arbitrarily shaped text, where traditional bounding box-based methods may struggle.
- (2)
Hybrid, End-to-End, and Regression-Based Methods
Hybrid and end-to-end methods integrate multiple stages of text detection and recognition into a unified model, often benefiting from shared feature representations across tasks. SPCNET [
135], CLRS [
136], and TextSpotter [
137] are representative examples within this category. SPCNET, achieving an F-measure of 92.10%, utilizes a supervised pyramid context network to enhance text region localization while mitigating false positives. By combining the strengths of segmentation-based and regression-based approaches, hybrid methods are able to achieve state-of-the-art performance on the ICDAR 2013 dataset.
Regression-based methods, such as RRPN [
30], RRD [
138], and Textboxes [
38], directly predict the bounding boxes of text regions. RRPN, with an F-measure of 91.0%, introduces rotation-invariant features to handle text of arbitrary orientations, demonstrating computational efficiency and real-time performance capabilities. These methods are particularly suitable for applications where rapid text detection is essential, such as real-time video surveillance or augmented reality.
Additionally, two-stage detection methods like FEN [
139], which achieved an F-measure of 92.30% with multi-scale testing, utilize a feature enhancement network to refine region proposals and improve text detection accuracy. By incorporating additional contextual information and refining the initial region proposals, two-stage methods are able to achieve higher precision and recall rates compared to their single-stage counterparts.
Overall, the performance of scene text detection methods on the ICDAR 2013 dataset has witnessed substantial advancements over the years, driven by breakthroughs in deep learning techniques. Segmentation-based and hybrid methods have exhibited particularly strong performance, owing to their ability to accurately localize text regions at the pixel level. Despite this, regression-based methods remain competitive due to their computational efficiency and real-time performance, making them suitable for a wide range of applications. The choice of method ultimately depends on the specific requirements of the application, including factors such as accuracy, speed, and computational resources.
Table 4.
Quantitative performance comparison (%) of representative scene text detection methods on the ICDAR 2013 dataset, in “DetEval” evaluation protocol. ‡ means multi-scale. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. The metrics include Precision (P), Recall (R), F-measure (F), and FPS.
Table 4.
Quantitative performance comparison (%) of representative scene text detection methods on the ICDAR 2013 dataset, in “DetEval” evaluation protocol. ‡ means multi-scale. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. The metrics include Precision (P), Recall (R), F-measure (F), and FPS.
8.1.2. ICDAR 2015: Method Performance Overview
The ICDAR 2015 dataset stands as a widely recognized benchmark for evaluating scene text detection methods. It comprises a diverse collection of images featuring multi-oriented text instances within natural scenes, posing significant challenges for detection algorithms. This section presents a comprehensive overview of the performance of representative scene text detection methods on this dataset, as summarized in
Table 5. The methods are categorized based on their underlying approaches, including segmentation-based, regression-based, boundary points-based, and hybrid methods, each offering distinct advantages in addressing the complexities of scene text detection.
- (1)
Advancements in Segmentation-Based, Regression-Based, and Boundary Points-Based Methods
Segmentation-based methods have achieved remarkable progress by utilizing pixel-level annotations to accurately delineate text regions. Among these, DSAF [
145] demonstrated the highest precision of 89.6% and an F-measure of 86.8%, underscoring the efficacy of advanced segmentation techniques. Similarly, DBNet++ [
146] and MFECN [
43] exhibited competitive performance with F-measures of 87.3% and 86.18%, respectively. These results highlight the capability of segmentation-based methods in capturing the intricate details of text regions.
Table 5.
Quantitative performance comparison (%) of representative scene text detection methods on the ICDAR 2015 dataset. ‡ means multi-scale. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. The metrics include Precision (P), Recall (R), F-measure (F), and FPS.
Table 5.
Quantitative performance comparison (%) of representative scene text detection methods on the ICDAR 2015 dataset. ‡ means multi-scale. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. The metrics include Precision (P), Recall (R), F-measure (F), and FPS.
Regression-based methods, in contrast, directly predict the bounding boxes of text instances by regressing from anchor boxes or proposal regions. CT-Net [
149] emerged as a standout performer with an F-measure of 88.6%, demonstrating the potential of regression approaches in handling multi-oriented text. MOST [
150] and BDN [
151] also achieved competitive results, emphasizing the importance of robust anchor design and effective loss functions in enhancing the performance of regression-based methods.
Boundary points-based methods, which represent text instances by predicting a set of boundary points or key edges, excel in capturing the precise shape of text instances. Bezier [
148] achieved the highest F-measure of 88.8% within this category, showcasing its superiority in detecting curved or arbitrarily shaped text. CRAFT [
133] and Text-RPN [
134] also performed well, further validating the utility of boundary points-based methods in scene text detection.
- (2)
Hybrid Methods and Overall Performance Trends
Hybrid methods, which integrate elements from both segmentation-based and regression-based approaches, often incorporating end-to-end training strategies, have demonstrated the power of leveraging the strengths of multiple techniques. GNNets [
153] achieved an F-measure of 88.52%, showcasing the effectiveness of hybrid approaches in capturing both the global context and local details of text instances. SPCNET [
135] and CLRS [
136] also achieved competitive results, further substantiating the potential of hybrid methods in scene text detection.
Overall, the performance of scene text detection methods on the ICDAR 2015 dataset has witnessed significant improvements over the years. Segmentation-based methods excel in capturing precise shapes and boundaries, regression-based methods offer efficient localization capabilities, boundary points-based methods provide flexibility in representing arbitrarily shaped text, and hybrid methods combine these advantages to achieve state-of-the-art performance. These advancements reflect the ongoing efforts of researchers to develop more robust and accurate scene text detection algorithms capable of handling the complexities of natural scene text.
8.1.3. MSRA-TD500: Method Performance Overview
The MSRA-TD500 dataset stands as a pivotal benchmark for assessing scene text detection methods, particularly under challenging conditions characterized by multi-oriented and multi-scale text instances. As encapsulated in
Table 6, our comprehensive analysis underscores the remarkable strides facilitated by deep learning, which have substantially bolstered the accuracy and resilience of scene text detection algorithms on this demanding dataset. The performance metrics of various methods evaluated on MSRA-TD500 provide a nuanced understanding of their respective strengths and limitations in handling complex text configurations.
- (1)
Segmentation-Based and Hybrid Methods
Segmentation-based methods have emerged as frontrunners on the MSRA-TD500 dataset, capitalizing on the prowess of deep segmentation networks to accurately delineate text regions. Notably, DSAF [
145] and KAC [
155], both introduced in 2024, have achieved impressive precision and recall rates, yielding F-measures of 87.2% and 90.79%, respectively. These results underscore the efficacy of segmentation-based approaches in capturing the intricate details of text instances, even under complex layouts.
Hybrid methods, which amalgamate elements from both segmentation and regression-based strategies, have also demonstrated promising performance. LayoutFormer [
156], a hybrid method introduced in 2024, achieved an F-measure of 90.10%, highlighting the benefits of integrating multiple techniques to enhance detection accuracy. Similarly, earlier hybrid methods such as CLRS [
136] and MCN [
141] have achieved competitive results, further attesting to the versatility of hybrid methods in adapting to diverse text scenarios.
- (2)
Boundary Points-Based and Regression-Based Methods
Boundary points-based methods, which focus on detecting text boundaries through key points or contours, have also exhibited competitive performance on the MSRA-TD500 dataset. BPDO [
36], introduced in 2024, achieved the highest F-measure of 91.47% among all reviewed methods, demonstrating the effectiveness of boundary points-based approaches in accurately localizing text instances. TextBPN++ [
34] and Bezier [
148], proposed in subsequent years, also achieved high F-measures, emphasizing the potential of boundary points-based methods in scene text detection.
Regression-based methods, traditionally robust in general object detection, have been successfully adapted for scene text detection. CT-Net [
149], introduced in 2023, achieved an F-measure of 87.5%, showcasing the capability of regression-based methods in handling multi-oriented text instances. Similarly, earlier regression-based methods such as MOST [
150] and PCR [
157] have achieved competitive results, reflecting the ongoing advancements in regression-based scene text detection.
Table 6.
Quantitative performance comparison (%) of representative scene text detection methods on the MSRA-TD500 dataset. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. The metrics include Precision (P), Recall (R), F-measure (F), and FPS.
Table 6.
Quantitative performance comparison (%) of representative scene text detection methods on the MSRA-TD500 dataset. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. The metrics include Precision (P), Recall (R), F-measure (F), and FPS.
8.1.4. Total-Text: Method Performance Overview
The Total-Text dataset serves as a crucial benchmark for evaluating scene text detection methods, primarily owing to its inclusion of text instances with diverse shapes, such as curved and arbitrarily oriented text. This section provides a quantitative overview of the performance of representative scene text detection methods on the Total-Text dataset, as summarized in
Table 7. The dataset’s complexity poses significant challenges, rendering it an ideal platform for assessing the robustness and accuracy of various detection algorithms.
- (1)
Segmentation-Based and Boundary Points-Based Methods
Segmentation-based methods have made remarkable progress in detecting text instances on the Total-Text dataset. For instance, KAC [
155], a recent advancement in this category, achieved an impressive F-measure of 90.98% in 2024. This performance underscores the effectiveness of kernel-based approaches in handling complex text layouts. Similarly, DSAF [
145], another segmentation-based method, attained an F-measure of 85.3% in the same year, indicating ongoing improvements in this field.
Boundary points-based methods have also demonstrated competitive performance. TextBPN++ [
34], for example, achieved an F-measure of 90.13% in 2023, highlighting the efficacy of boundary points in capturing the geometric layout of arbitrarily shaped text. DPText-DETR [
35], another boundary points-based method, achieved an F-measure of 89.0% in 2023, further validating the potential of this approach.
- (2)
Hybrid and End-to-End Methods
Hybrid and end-to-end methods have gained increasing attention due to their ability to combine the strengths of different approaches. ERRNet [
159], a hybrid method introduced in 2025, achieved an F-measure of 89.9% on the Total-Text dataset. This performance showcases the effectiveness of explicit relational reasoning in scene text detection. LayoutFormer [
156], an end-to-end method, achieved an F-measure of 87.12% in 2024, demonstrating the potential of transformer-based architectures in this field.
- (3)
Regression-Based Methods
Regression-based methods have also shown promising results on the Total-Text dataset. CT-Net [
149], a regression-based method introduced in 2023, achieved an F-measure of 87.8%, highlighting the effectiveness of contour transformer networks in handling complex text shapes. PCR [
157], another regression-based method, achieved an F-measure of 85.2% in 2021, indicating ongoing improvements in this category.
- (4)
Emerging Trends and Challenges
The performance of various methods on the Total-Text dataset highlights several emerging trends and challenges in scene text detection. Firstly, segmentation-based and boundary points-based methods continue to dominate due to their ability to handle complex text layouts. However, hybrid and end-to-end methods are gaining traction due to their capacity to combine the strengths of different approaches, leading to improved performance.
Secondly, the inclusion of curved and arbitrarily oriented text instances in the Total-Text dataset poses significant challenges for detection algorithms. Methods that can effectively capture the geometric layout of text, such as those based on boundary points or explicit relational reasoning, tend to perform better.
Finally, the fine-tuning and long-tailed problems highlighted in [
160] also impact the performance of scene text detection methods on the Total-Text dataset. Methods that can generalize well across different domains and handle rare and complex text categories are likely to perform better in real-world applications. Addressing these challenges and leveraging emerging trends will be crucial for advancing the field of scene text detection.
8.1.5. CTW1500: Method Performance Overview
The CTW1500 dataset serves as a critical benchmark for evaluating scene text detection methods, particularly due to its focus on long curved text instances that pose significant challenges for conventional detection algorithms. This section provides a quantitative overview of the performance of representative scene text detection methods on the CTW1500 dataset, as summarized in
Table 8. The dataset’s emphasis on complex text geometries makes it an ideal platform for assessing the robustness and adaptability of modern detection techniques across diverse scenarios.
- (1)
Segmentation-Based Methods Dominate Performance with Advanced Architectures
Segmentation-based methods have consistently demonstrated superior performance on the CTW1500 dataset, leveraging advancements in deep learning architectures to handle complex text layouts. KAC [
155], a recent kernel-aware clustering method, achieved an F-measure of 86.84% in 2024, highlighting the efficacy of context-aware segmentation strategies. Notably, MFECN [
43], another segmentation-based approach, attained an F-measure of 87.94% in 2021, further emphasizing the strength of segmentation techniques in capturing intricate text boundaries. DBNet++ [
146], despite being slightly older, still managed to achieve a competitive F-measure of 85.3% in 2022, underscoring the enduring effectiveness of well-engineered segmentation models.
- (2)
Boundary Points-Based Methods Show Competitive Edge with Precise Localization
Boundary points-based methods have also emerged as strong contenders on the CTW1500 dataset, particularly those that leverage precise boundary localization strategies. DPText-DETR [
35], a transformer-based method utilizing dynamic points, achieved an impressive F-measure of 88.8% in 2023, demonstrating the power of boundary-aware attention mechanisms. TextBPN++ [
34], with its adaptive boundary proposal network, secured an F-measure of 86.49% in 2023, showcasing the benefits of iterative boundary refinement. Even older methods like CRAFT [
133], which focuses on character-level boundary detection, managed to achieve an F-measure of 83.5% in 2019, indicating the longevity of boundary points-based approaches.
- (3)
Regression-Based Methods and Hybrid Approaches Offer Alternative Solutions
While segmentation and boundary points-based methods dominate the leaderboard, regression-based and hybrid approaches also provide valuable insights. CT-Net [
149], a regression-based method, achieved an F-measure of 86.1% in 2023, highlighting the potential of contour transformation for complex text shapes. ERRNet [
159], a hybrid and end-to-end method, surpassed many competitors with an F-measure of 89.4% in 2025, demonstrating the effectiveness of explicit relational reasoning networks. LayoutFormer [
156], another hybrid approach, achieved an F-measure of 86.17% in 2024, emphasizing the importance of layout-aware modeling in scene text detection.
- (4)
Performance Trends and Future Directions
Analyzing the performance trends on the CTW1500 dataset reveals several key insights. Firstly, segmentation-based methods continue to push the boundaries of accuracy, leveraging advanced architectures and context-aware strategies. Secondly, boundary points-based methods are gaining traction, particularly those that incorporate transformer-based attention mechanisms for precise localization. Thirdly, regression-based and hybrid approaches offer promising alternatives, especially for handling complex text geometries and layouts. Looking ahead, future research may focus on integrating these diverse methodologies to create more robust and versatile scene text detection systems.
8.1.6. MLT 2017: Method Performance Overview
The MLT 2017 (Multi-Lingual Text) dataset serves as a pivotal benchmark for evaluating scene text detection methods in multilingual environments. It encompasses text instances across nine languages, featuring diverse scripts, orientations, and complexities. The dataset presents unique challenges due to its linguistic diversity and variable text layouts, making it an indispensable platform for assessing the global applicability of modern detection algorithms.
Our analysis of representative deep learning-based methods on the MLT 2017 dataset, summarized in
Table 9, highlights significant progress in addressing these challenges through specialized architectural innovations.
- (1)
Boundary Points-Based Methods Excel in Complex Scenarios
Boundary points-based methods have demonstrated exceptional performance on the MLT 2017 dataset, showcasing remarkable adaptability to multilingual text variations. TextBPN++ [
34], introduced in 2023, achieved an F-measure of 77.48% by leveraging adaptive boundary points to accurately capture text instances with arbitrary shapes and orientations. Similarly, RBOX [
148], which attained a 79.6% F-measure in 2022, employs a rotated bounding box representation to effectively handle multi-oriented text across different scripts.
Segmentation-based methods have also shown strong performance. MFECN [
43], with an F-measure of 78.60% in 2021, integrates multi-level feature enhancement to ensure robust cross-lingual detection. DBNet(ResNet-50) [
44], achieving a 74.7% F-measure in 2020, improves segmentation quality through differentiable binarization. Hybrid approaches, such as GNNets [
153] (74.54% F-measure, 2019), combine geometry-aware normalization with attention mechanisms, while SPCNET [
135] (74.1% F-measure, 2018) enhances feature representations through supervised pyramid contexts.
- (2)
Regression-Based Methods and Cross-Lingual Adaptability Trends
Although regression-based methods are less prevalent in the top rankings, they have demonstrated competitive performance through specialized design choices. BDN [
151], with a 76.3% F-measure in 2019, adopts omnidirectional detection with sequential-free box discretization, while MOST [
150], achieving a 76.7% F-measure in 2021, refines localization through multi-oriented frameworks. These results underscore the potential of regression strategies in cross-lingual settings.
Overall, boundary points-based methods currently lead the performance on the MLT 2017 dataset, closely followed by segmentation and hybrid approaches. This reflects the field’s progress in handling linguistic diversity and complex text layouts. As research continues to prioritize global applicability, future advancements are expected to further bridge performance gaps across scripts and enhance robustness in multilingual environments.
8.1.7. Text-RRC-ArT: Method Performance Overview
The Text-RRC-ArT (Robust Reading Competition on Arbitrary-Shaped Text) dataset has emerged as a cornerstone benchmark for evaluating scene text detection methods, particularly due to its emphasis on arbitrary-shaped, multi-oriented, and low-resolution text instances embedded in complex real-world scenarios. This section presents a quantitative analysis of representative deep learning-based methods on the Text-RRC-ArT dataset, as summarized in
Table 10, to elucidate the current state-of-the-art and persistent challenges in arbitrary-shaped text detection.
- (1)
Segmentation-Based Methods: Robustness in Complex Layouts
Segmentation-based approaches have demonstrated consistent robustness on the Text-RRC-ArT dataset, leveraging advanced architectures to handle irregular text geometries. KAC [
155], proposed in 2024, achieved an F-measure of 80.0% by integrating kernel-aware clustering for precise text region segmentation. This method addresses the challenges of irregular text shapes through a kernel-guided refinement mechanism, balancing localization accuracy with computational efficiency. The results underscore the enduring relevance of segmentation frameworks in capturing fine-grained text boundaries, particularly when combined with adaptive post-processing strategies. While slightly trailing boundary points-based methods in performance, segmentation-based approaches remain competitive due to their robustness in dense text regions and ability to handle overlapping text instances.
- (2)
Boundary Points-Based Methods: Dominance in Geometric Modeling
Boundary points-based methods have emerged as the leading performers on the Text-RRC-ArT dataset, owing to their explicit geometric modeling capabilities and adaptive refinement mechanisms. TextBPN++ [
34], introduced in 2023, attained an F-measure of 80.59% by extending its boundary points network with adaptive sampling and hierarchical refinement. This method effectively captures arbitrary-shaped text through dense boundary point prediction and geometric constraint modeling, demonstrating superior performance in handling curved, multi-oriented, and densely packed text instances. Similarly, DPText-DETR [
35], proposed in the same year, achieved an F-measure of 78.1% by integrating a transformer-based architecture with boundary point detection. This hybrid approach showcases the potential of attention mechanisms for arbitrary-shaped text localization, further emphasizing the dominance of boundary-aware methods in addressing the geometric complexity of the dataset.
- (3)
Regression-Based Methods: Challenges in Geometric Variability
Regression-based methods, while less prevalent in the top rankings, have shown notable performance gaps compared to their boundary points-based and segmentation-based counterparts. PCR [
157], proposed in 2021, achieved an F-measure of 74.0% through progressive contour regression with a multi-stage refinement pipeline. Despite its progressive refinement strategy, this method struggled with highly irregular text shapes, as evidenced by its relatively lower recall and precision. This observation highlights the inherent limitations of regression-based approaches in modeling the geometric variability of arbitrary-shaped text, particularly in low-resolution or heavily occluded scenarios. The challenges faced by regression-based methods suggest a need for architectural innovations to improve their geometric modeling capabilities and adaptability to complex text layouts.
- (4)
Performance Trends and Future Directions
Analyzing the performance trends on the Text-RRC-ArT dataset reveals several key insights. Firstly, boundary points-based methods currently dominate the leaderboard, benefiting from their explicit geometric modeling capabilities and adaptive refinement mechanisms. These methods excel in handling the arbitrary shapes and orientations prevalent in the dataset, outperforming segmentation-based and regression-based alternatives. Secondly, segmentation-based methods, while slightly trailing in performance, remain competitive due to their robustness in dense text regions and ability to capture fine-grained text boundaries. Finally, regression-based methods face challenges in modeling the complex geometries of arbitrary-shaped text, suggesting a need for architectural innovations to improve their geometric modeling capabilities.
Looking ahead, future research may focus on integrating boundary points-based and segmentation-based approaches to leverage their complementary strengths. For instance, hybrid methods that combine the geometric precision of boundary points-based approaches with the robustness of segmentation-based frameworks could offer improved performance across diverse text distributions. Additionally, exploring transformer-based architectures for enhanced global context modeling and developing domain adaptation techniques to improve generalization across real-world scenarios are promising directions. The Text-RRC-ArT dataset continues to serve as a critical benchmark for driving advancements in arbitrary-shaped scene text detection, pushing the boundaries of deep learning-based methods toward real-world applicability.
8.1.8. ICPR-MTWI: Method Performance Overview
The ICPR-MTWI (International Conference on Pattern Recognition-Multi-Type Web Images) dataset stands as a challenging benchmark for scene text detection, focusing on web images that contain text instances in diverse styles, orientations, and complex backgrounds. This dataset evaluates the robustness of detection methods in handling real-world scenarios where text exhibits significant variability in appearance and layout. In this section, we present a quantitative analysis of representative deep learning-based methods on the ICPR-MTWI dataset, as summarized in
Table 11, to highlight current trends and persistent challenges in web-based scene text detection.
- (1)
Boundary Points-Based Methods: Competitive Performance in Geometric Modeling
Boundary points-based methods have demonstrated competitive performance on the ICPR-MTWI dataset, leveraging their ability to model the geometric properties of text instances. RBOX [
148], proposed in 2022, achieved an F-measure of 75.2% by integrating rotated bounding box regression with adaptive sampling strategies. This method effectively captures multi-oriented text through its boundary-aware representation, demonstrating robustness against complex backgrounds and varying text orientations. The results underscore the potential of boundary points-based approaches in accurately localizing text instances in web images, where traditional horizontal or vertical bounding boxes may fall short. The explicit boundary modeling capabilities of RBOX enable precise localization, even in scenarios with overlapping or densely packed text.
- (2)
Regression-Based Methods: Simplicity and Efficiency with Adaptive Enhancements
Regression-based methods have also made significant contributions to performance on the ICPR-MTWI dataset. MOST [
150], introduced in 2021, attained an F-measure of 74.7% through its multi-oriented scene text detector with localization refinement. This method combines a regression-based framework with a post-processing module to enhance detection accuracy, particularly in densely packed or overlapping text regions. Similarly, BDN [
150,
151], proposed in 2019 (with performance reported in 2021), achieved an F-measure of 73.4% by adopting omnidirectional scene text detection with sequential-free box discretization. Despite its slightly lower performance compared to MOST, BDN showcases the adaptability of regression-based strategies to handle text of varying orientations and scales in web images. The simplicity and efficiency of regression-based methods, combined with adaptive enhancements such as post-processing modules, make them competitive alternatives in web-based scene text detection.
- (3)
Performance Trends and Challenges
Analyzing the performance trends on the ICPR-MTWI dataset reveals several key insights. Firstly, boundary points-based methods excel in capturing the geometric variability of text instances, particularly in multi-oriented and curved text scenarios. These methods benefit from their explicit boundary modeling capabilities, which enable precise localization even in complex backgrounds. Secondly, regression-based methods, while slightly trailing in performance, remain competitive due to their simplicity and efficiency. Methods like MOST and BDN demonstrate that regression-based frameworks can be enhanced with post-processing modules or adaptive sampling strategies to improve detection accuracy in challenging web-based scenarios.
However, the dataset also highlights several challenges. The relatively lower F-measures across all methods suggest that web images pose significant difficulties due to their diverse text styles, complex backgrounds, and varying image qualities. Future research may need to focus on developing more robust feature extraction techniques that can handle the variability in text appearance and layout. Additionally, integrating multi-modal information, such as text semantics and image context, could provide valuable cues for improving detection accuracy. Exploring domain adaptation strategies to improve generalization across different web image domains is also a promising direction.
- (4)
Future Directions and Conclusion
In conclusion, the ICPR-MTWI dataset serves as a valuable benchmark for evaluating scene text detection methods in real-world web image scenarios. While boundary points-based and regression-based methods have shown promising performance, there is still room for improvement in handling the diverse and complex nature of web-based text instances. As research continues to advance, we expect to see more innovative approaches that address these challenges and push the boundaries of scene text detection in web images. Future directions may include integrating boundary points-based and regression-based methods to leverage their complementary strengths, exploring transformer-based architectures for enhanced global context modeling, and developing domain adaptation techniques to improve generalization across diverse web image domains. The ICPR-MTWI dataset will continue to drive advancements in scene text detection, pushing the field toward more robust and versatile solutions for real-world applications.
8.1.9. RRC-MLT-2019 and RCTW: Method Performance Overview
The RRC-MLT-2019 and RCTW datasets serve as pivotal benchmarks for assessing the efficacy of scene text detection methods, owing to their distinct focuses on multi-lingual and complex text scenarios, respectively. As detailed in
Table 12, these datasets present formidable challenges due to the myriad of text styles, languages, orientations, scales, and intricate backgrounds encountered in real-world applications. This section endeavors to provide a holistic overview of the performance exhibited by representative scene text detection methods on these two challenging datasets. While specific tables are not displayed here for brevity, it is implied that they exist with appropriate labels, summarizing the key findings. All results presented are directly sourced from their respective original publications, ensuring the accuracy and reproducibility of the benchmarking process.
- (1)
Segmentation-Based Methods Excel in Multi-Lingual and Complex Text Scenarios
The RRC-MLT-2019 dataset, with its emphasis on multi-lingual scene text detection, has witnessed the superior performance of segmentation-based methods. DBNet++ [
146] and DBNet [
44,
146] are notable examples that have demonstrated competitive performance. DBNet++ achieved an F-measure of 71.4% in 2022, while DBNet attained an F-measure of 70.4% in the same year. These results underscore the effectiveness of segmentation-based methods in modeling text regions at a pixel level, enabling them to handle complex text layouts and multi-lingual scenarios with remarkable accuracy.
- (2)
Regression-Based Methods Show Robustness in Handling Challenging Text Scenarios
On the RCTW dataset, which evaluates scene text detection methods in complex real-world scenarios, regression-based methods have shown their robustness. RRD [
138], a regression-based method, achieved an F-measure of 67.0% in 2018. Although slightly lower in performance compared to some segmentation-based methods, RRD demonstrates the simplicity and efficiency of regression-based approaches in handling text instances with varying orientations and scales. This highlights the potential of regression-based methods in scenarios where text geometry is complex and diverse.
- (3)
Performance Trends and Insights Across Datasets
Analyzing the performance trends on both datasets reveals several key insights. Firstly, segmentation-based methods continue to be the dominant force in handling complex text layouts and multi-lingual scenarios, leveraging their ability to accurately model text regions. Secondly, regression-based methods, while slightly trailing in performance, offer a valuable alternative due to their simplicity and efficiency in handling challenging text scenarios.
However, the relatively lower F-measures across all methods on both datasets suggest that multi-lingual and complex text scenarios pose significant challenges. Future research may need to focus on developing more robust feature extraction techniques that can handle the variability in text styles, languages, and complex backgrounds. Additionally, integrating multi-modal information, such as text semantics and image context, could provide valuable cues for improving detection accuracy in these challenging scenarios.
- (4)
Future Directions and Potential Improvements
Looking ahead, the RRC-MLT-2019 and RCTW datasets will continue to serve as valuable benchmarks for evaluating scene text detection methods. As research continues to advance, we expect to see more innovative approaches that integrate diverse methodologies, such as combining segmentation-based and regression-based techniques, to create more robust and versatile scene text detection systems. Additionally, the exploration of new deep learning architectures, attention mechanisms, and multi-modal fusion strategies holds promise for further improving the performance of scene text detection methods in multi-lingual and complex text scenarios.
In conclusion, the RRC-MLT-2019 and RCTW datasets provide critical insights into the performance of scene text detection methods in multi-lingual and complex text scenarios. While segmentation-based and regression-based methods have shown promising performance, there is still significant room for improvement. As research progresses, we anticipate the development of more innovative and robust approaches that can address the diverse and complex nature of these text instances, pushing the boundaries of scene text detection in real-world applications.
8.2. Evaluations of Scene Text Recognition Performance
We present a systematic evaluation of state-of-the-art deep learning approaches for scene text recognition across major benchmark datasets, including: ICDAR 2013 [
118], ICDAR 2015 [
13], Street View Text (SVT) [
29], IIIT5K-Words (IIIT5K) [
15], CUTE80 (CUTE) [
117], COCO-Text [
113], CTW1500 [
105], Total-Text [
14], RCTW [
111], Uber [
161], Text-RRC-ArT (ArT) [
108], ICDAR 2019-LSVT (LSVT) [
107,
122], ReCTS [
162], RRC-MLT-2019 (MLT19) [
82], TextOCR [
85], and HierText [
86]. Our evaluation employs standard metrics, with recognition accuracy as the primary measure complemented by edit distance-based metrics to capture partial recognition errors through character-level operations (insertions, deletions, or substitutions).
8.2.1. Methodological Performance Comparison
The comprehensive evaluation across multiple datasets (
Table 13 and
Table 14) reveals distinct performance characteristics among different methodological approaches. Our analysis focuses on three critical aspects: (1) architectural effectiveness, (2) generalization capability across datasets, and (3) temporal performance evolution.
- (1)
Transformer-Based Architectures
Current transformer-based methods achieve state-of-the-art performance on standard benchmarks, with BUSNet
B [
60] reaching 98.5% accuracy on both SVT and ICDAR 2013 (
Table 13). However, their performance significantly degrades in complex scenarios, as evidenced by LISTER [
91] dropping to 49.0% on Uber and 70.1% on ArT (
Table 14), indicating limitations in handling irregular text distributions. The performance variation across language scripts is notable, with ViTSTR-S [
58] maintaining 89.4% on MLT19 but dropping to 72.9% on RCTW. While MAERec-B [
123] achieves the highest reported accuracy on CUTE (98.6%), its absence from multilingual evaluations suggests potential limitations in cross-lingual scenarios.
Table 13.
Quantitative performance comparison (%) of representative scene text recognition methods on the ICDAR 2013 (IC13), SVT, III5K, ICDAR 2015 (IC15), SVTP, CUTE, COCO-Text (CTT), CTW1500, and Total-Text (TT) datasets. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. * indicates the results are tested with the officially released models.
Table 13.
Quantitative performance comparison (%) of representative scene text recognition methods on the ICDAR 2013 (IC13), SVT, III5K, ICDAR 2015 (IC15), SVTP, CUTE, COCO-Text (CTT), CTW1500, and Total-Text (TT) datasets. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking. * indicates the results are tested with the officially released models.
Table 14.
Quantitative performance comparison (%) of representative scene text recognition methods on the SVT, ICDAR 2013 (IC13), ICDAR 2015 (IC15), COCO-Text (CT), RCTW, Uber, Text-RRC-ArT (ArT), ICDAR 2019-LSVT (LT), ReCTS, RRC-MLT-2019 (MT), TextOCR (TR), and HierText(HT) datasets. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking.
Table 14.
Quantitative performance comparison (%) of representative scene text recognition methods on the SVT, ICDAR 2013 (IC13), ICDAR 2015 (IC15), COCO-Text (CT), RCTW, Uber, Text-RRC-ArT (ArT), ICDAR 2019-LSVT (LT), ReCTS, RRC-MLT-2019 (MT), TextOCR (TR), and HierText(HT) datasets. All results are directly cited from their respective original publications to ensure the accuracy and reproducibility of the benchmarking.
NO | Method | Method’s Category | SVT | IC13 | IC15 | CT | RCTW | Uber | ArT | LT | ReCTS | MT | TR | HT | Year | Code& Dataset URLs (Last Accessed) |
---|
1 | [60] | Transformer-Based | 98.5 | 98.5 | 98.0 | 79.4 | - | 83.2 | 83.4 | - | - | - | - | - | 2024 | https://github.com/jjwei66/BUSNet (1 July 2025) |
2 | LISTER [91] | Transformer-Based | 93.8 | 97.9 | 87.5 | 65.8 | - | 49.0 | 70.1 | - | - | - | - | - | 2023 | https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LISTER (1 July 2025) |
3 | ViTSTR-S [58] (reproduced) [171] | Transformer-Based | 92.3 | 97.0 | 81.8 | 77.0 | 72.9 | 77.4 | 86.9 | 73.7 | 88.5 | 89.4 | 80.4 | 83.2 | 2023 | - |
4 | ABINet-LV [61] (reproduced) [171] | Transformer-Based | 96.6 | 97.6 | 85.1 | 79.4 | 76.7 | 80.8 | 89.2 | 76.6 | 89.4 | 90.2 | 83.1 | 86.6 | 2023 | - |
5 | [130] | Sequence-to-Sequence | 98.0 | 98.0 | 98.1 | 64.5 | - | 47.8 | 69.1 | - | - | - | - | - | 2024 | https://github.com/Xu-Jianjun/OTE (1 July 2025) |
6 | ABINet-LV [61] + CLIPTER [171] | Sequence-to-Sequence | 96.0 | 98.3 | 85.4 | 79.3 | 78.6 | 82.1 | 89.3 | 77.1 | 89.7 | 90.2 | 83.4 | 86.7 | 2023 | - |
7 | PARSeq [180] (reproduced) [171] | Sequence-to-Sequence | 96.1 | 98.9 | 85.7 | 80.5 | 81.4 | 83.2 | 91.2 | 80.2 | 91.8 | 91.5 | 85.2 | 87.4 | 2023 | - |
8 | PARSeq [180] + CLIPTER [171] | Sequence-to-Sequence | 96.6 | 99.1 | 85.9 | 81.0 | 82.1 | 84.4 | 91.7 | 81.8 | 91.8 | 91.6 | 86.0 | 88.0 | 2023 | - |
9 | TRBA-PR [174] (reproduced) [171] | Sequence-to-Sequence | 94.9 | 98.5 | 84.8 | 79.2 | 81.1 | 80.5 | 89.2 | 77.9 | 90.4 | 90.7 | 82.9 | 85.1 | 2023 | - |
10 | TRBA-PR [174] + CLIPTER [171] | Sequence-to-Sequence | 95.4 | 98.8 | 85.3 | 79.3 | 81.3 | 82.0 | 90.2 | 79.4 | 91.1 | 91.1 | 83.9 | 85.8 | 2023 | - |
11 | ViTSTR-S [58] + CLIPTER [171] | Sequence-to-Sequence | 93.4 | 97.1 | 82.3 | 77.7 | 75.3 | 79.6 | 88.2 | 76.0 | 89.5 | 89.9 | 81.8 | 84.0 | 2023 | - |
- (2)
Sequence-to-Sequence Models
This category demonstrates remarkable robustness, particularly when enhanced with multimodal features. PARSeq+CLIPTER [
171,
180] outperforms transformer baselines by 2.5% on ArT and 1.4% on HierText (
Table 14), achieving 91.7% and 88.0% respectively. The CLIPTER augmentation consistently improves performance across all sequence models, exemplified by TRBA-PR+CLIPTER’s 1.5% improvement on MLT19 (
Table 14). However, standalone sequence models like OTE
A/SVTR [
130] exhibit significant performance disparities, excelling on regular text (98.1% on IC15) but struggling with arbitrary-shaped text (47.8% on ArT).
- (3)
CTC-Based Approaches
While demonstrating competitive performance on standard benchmarks (DCTC [
176] achieves 97.4% on ICDAR 2013), CTC methods appear less prevalent in more challenging evaluations (
Table 14). The original CRNN [
20] remains surprisingly relevant with 97.6% on IIIT5K, but its performance on curved text (79.44% on CUTE in TPS-ResNet-BiLSTM-Attn [
85]) lags 15-20% behind contemporary methods.
- (4)
Cross-Methodological Analysis
The analysis reveals three key findings regarding scene text recognition methods. First, all methods demonstrate a clear complexity-performance relationship, exhibiting 20–30% performance degradation as text irregularity increases from IC13 to ArT, with Transformer architectures showing the most pronounced declines (34.5% for LISTER). Second, the consistent 1–3% performance gains across metrics through CLIPTER integration underscores the significant role of multimodal enhancement via visual-linguistic fusion in achieving robust recognition. Third, examining historical progress shows that while the improvement from CRNN (2016) to BUSNetB (2024) represents a modest 3% absolute gain on IC13, this gap widens to 15–20% on challenging datasets like ArT, highlighting accelerated advancements in handling complex real-world scenarios.
Our in-depth analysis reveals that, although transformer-based architectures have demonstrated their supremacy on standard benchmark datasets, hybrid sequence-to-sequence models that incorporate multimodal enhancements exhibit a notably more balanced performance when evaluated across a wide spectrum of real-world conditions. This observation underscores the potential of architectural hybridization strategies, which combine the strengths of different model types, to achieve robustness and versatility in practical applications. Furthermore, our findings highlight the importance of deeper linguistic grounding—integrating richer linguistic knowledge and context-aware mechanisms—as a promising avenue for advancing text recognition systems. These insights collectively point toward future research directions aimed at refining model architectures and enhancing linguistic understanding to drive further improvements in text recognition performance.
8.2.2. Real-World Challenges and Insights
Real-world data presents significant challenges. As demonstrated in
Table 15, when models are trained on the synthetic datasets and subsequently evaluated on the Union14M-Benchmark, there is an average accuracy degradation of approximately 48.5% compared to their performance on common benchmarks. Similarly, when trained on Union14M-L, models still experience an average accuracy drop of 33.0% on the Union14M-Benchmark. These findings underscore the fact that text images in real-world scenarios are considerably more complex than those in the six commonly utilized benchmarks.
Real-world data proves to be highly effective. Models trained on Union14M-L exhibit an average accuracy improvement of 3.9% on common benchmarks and a substantial 19.6% on the Union14M-Benchmark. The significant performance enhancement on the Union14M-Benchmark suggests that synthetic training data struggles to meet the intricate demands of real-world applications. In contrast, training with real data can substantially mitigate this generalization issue. Moreover, the relatively modest performance gains on common benchmarks hint at their saturation, indicating limited room for further improvement within these datasets.
Scene Text Recognition (STR) remains an unsolved problem. When trained solely on Union14M-L, the highest average accuracy achieved on the Union14M-Benchmark (excluding the incomplete text subset) is merely 74.6% (by MATRN [
165] in
Table 15). This result clearly demonstrates that STR is far from being fully resolved. Although leveraging large-scale real data can yield a certain degree of performance improvement, continued efforts and innovations are still imperative to advance the field.
Vocabulary reliance is a prevalent issue. When trained on synthetic datasets, all models experience a substantial performance decline on the incomplete text subset. Notably, language models exhibit a larger performance degradation (10.2%) compared to CTC-based models (5.6%) and attention-based models (5.9%). We hypothesize that the performance drop in language models may be attributed to their error correction behavior, where models attempt to complete incomplete text, treating it as a character missing error. This problem can be significantly alleviated when models are trained on Union14M-L, likely due to the larger vocabulary size in Union14M-L, which prevents models from overfitting the training corpus. However, this issue persists and necessitates further investigation to develop more robust solutions.
8.3. Evaluations of Scene Text Spotting Performance
We conduct a comprehensive evaluation of state-of-the-art deep learning approaches for scene text spotting, which encompasses both text detection and recognition tasks, across major benchmark datasets. These datasets include ICDAR 2013 [
118], ICDAR 2015 [
13], CTW1500 [
105], and Total-Text [
14]. These datasets offer a diverse range of text layouts, fonts, sizes, languages, and scene complexities, enabling us to thoroughly assess the performance of text spotting methods under various conditions.
Our evaluation employs a combination of standard metrics to comprehensively measure the performance of text spotting systems. The primary metrics include detection accuracy, recognition accuracy, and an end-to-end text spotting metric that combines both detection and recognition results. Detection accuracy is typically measured using precision, recall, and F1-score based on the intersection-over-union (IoU) between predicted and ground-truth bounding boxes. Recognition accuracy is assessed using the same metrics as in scene text recognition evaluations, with a focus on character-level accuracy and edit distance-based metrics to capture partial recognition errors. The end-to-end metric provides a holistic view of the system’s performance by considering both the accuracy of text detection and the subsequent recognition of the detected text.
The extensive evaluation across multiple datasets (
Table 16 and
Table 17) highlights the distinct performance characteristics of different methodological approaches in scene text spotting. Our analysis delves into three critical aspects:
- (1)
Architectural Effectiveness: We examine how different architectural designs, such as one-stage vs. two-stage methods, transformer-based vs. CNN-based models, and the integration of attention mechanisms, impact the overall performance of text spotting systems. We assess their ability to accurately detect and recognize text under various conditions, including challenging scenarios with low-resolution, occlusion, and curved text.
- (2)
Generalization Capability Across Datasets: We evaluate the generalization capability of text spotting methods by testing them on datasets with different characteristics, such as varying text layouts, fonts, sizes, languages, and scene complexities. This analysis helps us understand how well a method can adapt to new and unseen data, which is crucial for real-world applications.
- (3)
Temporal Performance Evolution: We track the performance evolution of text spotting methods over time by comparing the results of recent studies with those from earlier works. This analysis allows us to identify trends in the field, such as the impact of new architectural designs, training strategies, or datasets on performance improvements. It also helps us understand the current state-of-the-art and identify areas for future research.
8.3.1. Performance Comparison on ICDAR 2013 and ICDAR 2015 Datasets
In this section, we present a comprehensive performance comparison of various scene text spotting methods on the ICDAR 2013 and ICDAR 2015 datasets. The results are summarized in
Table 16, which includes precision (P), recall (R), and F-measure (F) for detection (using DetEval), as well as end-to-end and word spotting results under strong (S), weak (W), and generic (G) lexicon conditions.
- (1)
ICDAR 2013 Dataset
For the ICDAR 2013 dataset, the FOTS method [
21] achieved an F-measure of 88.30% for detection and 88.81%, 87.11%, and 80.81% for end-to-end recognition under strong, weak, and generic lexicon conditions, respectively. The multi-scale (MS) version of FOTS [
21] further improved the detection F-measure to 92.82% and end-to-end F-measures to 91.99%, 90.11%, and 84.77% under the same lexicon conditions.
TextNet [
183] achieved a detection F-measure of 91.35% and end-to-end F-measures of 89.77%, 88.80%, and 82.96%. Mask TextSpotter [
22] showed strong performance with a detection F-measure of 91.70% and end-to-end F-measures of 92.22%, 91.10%, and 86.50%.
Boundary [
184] achieved a detection F-measure of 90.10% and end-to-end F-measures of 88.20%, 87.70%, and 84.10%. Text Perceptron [
185] and MANGO [
77] also demonstrated competitive performance, with MANGO achieving end-to-end F-measures of 93.40%, 92.30%, and 88.70%. The most recent method, TextTriangle [
7], achieved a detection F-measure of 90.90% and end-to-end F-measures of 91.80%, 90.90%, and 86.00%. Overall, the ICDAR 2013 dataset results indicate that recent methods, such as MANGO and TextTriangle, have significantly improved the end-to-end performance, especially under strong lexicon conditions.
- (2)
ICDAR 2015 Dataset
The ICDAR 2015 dataset poses significant challenges for scene text spotting, primarily due to its complex backgrounds and multi-oriented text instances. Early approaches such as FOTS [
21] achieved notable performance on this benchmark, with reported detection F-measures of 87.99% and end-to-end F-measures of 81.09%, 75.90%, and 60.80%. The multi-scale variant of FOTS further enhanced these results, reaching 89.84% detection F-measure and end-to-end F-measures of 83.55%, 79.11%, and 65.33%, demonstrating the effectiveness of multi-scale feature integration in handling diverse text orientations and scales.
As illustrated in
Figure 25, FOTS demonstrates robust performance in detecting and recognizing text under challenging conditions. These improvements highlight the importance of multi-scale feature learning in scene text spotting tasks.
Subsequent methods continued to refine performance on this challenging dataset. TextNet [
183] achieved a detection F-measure of 87.37% and end-to-end F-measures of 78.66%, 74.90%, and 60.45%, while Mask TextSpotter [
22] and its successor Mask TextSpotter v3 [
5] exhibited robustness, with the latter attaining end-to-end F-measures of 83.30%, 78.10%, and 74.20%. Notably, MANGO [
77] pushed the boundaries further, achieving end-to-end F-measures of 85.40%, 80.10%, and 73.90%, demonstrating its effectiveness in addressing the dataset’s complexities. PAN++ [
186] and PGNet [
187] also delivered competitive results, with PAN++ achieving an end-to-end F-measure of 82.70% under strong lexicon conditions, underscoring the importance of lexicon-aware modeling for improving recognition accuracy.
More recently, advancements in scene text spotting have been driven by innovative architectures and training strategies. Methods such as SwinTextSpotter [
59], SRSTS [
188], GLASS [
189], TTS [
3], ABINet++MS [
190], DeepSolo [
191], and ESTextSpotter [
192] have significantly pushed the performance envelope. For instance, DeepSolo [
191] achieved an impressive end-to-end F-measure of 88.10% under strong lexicon conditions, marking a substantial improvement over earlier methods. These advancements reflect the ongoing progress in designing more effective and robust models for scene text spotting, particularly in handling the intricate challenges posed by real-world text detection and recognition tasks.
- (3)
Comparison and Analysis
From the results, it is evident that the performance of scene text spotting methods has steadily improved over the years. The use of multi-scale features, attention mechanisms, and more sophisticated architectures has contributed to these improvements. Additionally, the integration of detection and recognition modules into a unified framework has led to better end-to-end performance.
The ICDAR 2015 dataset, being more challenging, has seen a slower but steady improvement in performance. Recent methods, such as DeepSolo [
191] and ESTextSpotter [
192], have demonstrated robust performance on this dataset, indicating that the field is making significant progress in handling complex scenes.
In conclusion, the performance comparison on the ICDAR 2013 and ICDAR 2015 datasets highlights the rapid advancements in scene text spotting technology. Future research should focus on further improving the robustness and accuracy of these methods, especially in challenging environments.
8.3.2. Performance Comparison of End-to-End Scene Text Spotting Methods
In this section, we present a comprehensive comparison of end-to-end scene text spotting methods, as summarized in
Table 17, focusing on the Total-Text and CTW1500 datasets. The evaluation metrics encompass precision (P), recall (R), and F-measure (F) for detection, along with end-to-end performance under both lexicon-free (’None’) and lexicon-based (’Full’) conditions. On the Total-Text dataset, we observe a significant evolution in end-to-end performance over the years. Early methods, such as Mask TextSpotter [
22] and TextDragon [
76], achieved relatively lower F-measures in the lexicon-free setting, with values of 52.90% and 48.80%, respectively. In contrast, recent methods like DeepSolo [
191] and ESTextSpotter [
192] have demonstrated remarkable progress, attaining F-measures of 83.60% and 80.80%, respectively, significantly outperforming their predecessors.
In the lexicon-based setting (’Full’), while the performance gap between methods is less pronounced, recent approaches still exhibit superior performance. For instance, ABINet++MS [
190] achieved an F-measure of 85.40%, which ranks among the highest reported on the Total-Text dataset. Similarly, on the CTW1500 dataset, the trends in end-to-end performance mirror those observed on Total-Text. Methods such as TextDragon [
76] and ABCNet v2 [
193] recorded relatively lower F-measures in the lexicon-free setting, with values of 39.70% and 57.50%, respectively. However, recent advancements, exemplified by TPSNet [
194] and ESTextSpotter [
192], have achieved F-measures of 60.50% and 64.90%, respectively, indicating substantial improvements.
Regarding detection performance, most methods achieve high precision and recall values on both datasets. Nevertheless, end-to-end performance remains more challenging, as it necessitates accurate detection and recognition of text instances. Recent methods that integrate advanced detection and recognition techniques, such as attention mechanisms and transformer architectures, have demonstrated superior end-to-end performance. Overall, the results in
Table 17 underscore the significant progress made in end-to-end scene text spotting over the years, with recent methods achieving remarkable improvements in both detection and recognition performance, particularly in the lexicon-free setting. However, there remains ample room for further enhancement, especially in addressing complex text layouts and challenging lighting conditions.
9. Efficiency and Implementation Details
While achieving high accuracy remains the paramount objective in scene text processing research, the practical deployment of these systems in real-world applications necessitates a careful balance between accuracy, computational efficiency, inference speed, and hardware requirements. This section offers a thorough analysis of these critical aspects across scene text detection, recognition, and spotting methods, complementing the accuracy-centric discussions presented in the preceding sections.
9.1. Computational Efficiency Across Tasks
Table 18 provides a summary of key efficiency metrics and implementation details for several representative methods in scene text processing. From this analysis, several notable trends emerge:
Training requirements for scene text processing models vary significantly depending on model complexity. For instance, detection models like DBNet [
44] achieve real-time performance with 32 FPS on relatively modest hardware configurations (dual 1080Ti GPUs), while transformer-based approaches such as SwinTextSpotter [
59] necessitate high-end A100 GPUs for training, albeit delivering competitive inference speeds of 10.2 FPS. The parameter count also exhibits a wide range, from 8.3M for lightweight recognition models like CRNN to 223.5M for large transformer architectures, directly influencing both training time and deployment feasibility.
9.2. Inference Speed Analysis
Inference speed, quantified in frames per second (FPS), provides valuable insights into the trade-offs between accuracy and efficiency. Regression-based detection methods, exemplified by EAST, maintain high FPS (13.2) through simplified processing pipelines, whereas segmentation-based approaches like PSENet achieve superior accuracy at the expense of reduced speed (1.6 FPS). In the realm of recognition tasks, transformer architectures such as ViTSTR surprisingly outperform traditional attention-based models with respect to inference speed (21.4 vs 10.1 FPS), owing to their parallel computation capabilities despite a larger parameter count.
9.3. Implementation Optimizations
To enhance the efficiency of scene text processing systems, several strategies have been proposed and successfully implemented:
The judicious selection of architecture design plays a pivotal role in balancing accuracy and computational efficiency. Lightweight backbones, such as MobileNet integrated in DBNet [
44], effectively reduce computation while preserving accuracy.
Quantization techniques, particularly 8-bit quantization, have been shown to achieve speedups ranging from 2 to 4 times with minimal accuracy degradation [
196], making them a viable option for deploying scene text processing models on resource-constrained devices.
Furthermore, multi-task learning frameworks, such as FOTS [
21], demonstrate the benefits of sharing features between detection and recognition tasks, thereby improving overall system efficiency by eliminating redundant computations and enhancing feature utilization.
These optimization strategies collectively underscore the potential for careful system design to bridge the gap between achieving high accuracy and meeting the practical deployment requirements of scene text processing systems.
10. Advancements in Visual-Language Text Detection
Recent advancements in scene text processing have witnessed a significant paradigm shift towards integrating visual and linguistic modalities, leading to more robust and context-aware text detection models. This section discusses the notable progress in visual-language-based text detection methods and outlines potential directions for leveraging Large Language Models (LLMs) to enhance text detection performance.
10.1. Current Progress in Visual-Language Text Detection
Visual-language models, particularly those leveraging transformers, have demonstrated remarkable success in capturing both visual and semantic information from scene images. These models utilize self-attention mechanisms to establish connections between visual features and language embeddings, thereby improving text detection accuracy, especially in complex and ambiguous scenarios.
Multimodal Fusion: Recent approaches have focused on multimodal fusion strategies, where visual features extracted from CNNs are combined with linguistic features derived from pre-trained language models (PLMs) such as BERT [
78] or GPT [
16]. This fusion enables models to disambiguate visually similar characters or words by leveraging contextual knowledge from the language domain. For instance, the integration of BERT embeddings with visual features has been shown to significantly enhance text recognition performance in low-resolution or occluded text scenarios [
6]. Transformer-Based Architectures:] Transformers have become the de facto standard for modeling sequential data, including scene text. Vision Transformers (ViTs) [
57] and their variants, such as Swin Transformers [
67], have been adapted for text detection tasks by processing images as sequences of patches. These models excel in capturing long-range dependencies and contextual information, which is crucial for accurately localizing and recognizing text in natural scenes.
End-to-End Text Spotting: End-to-end trainable models that jointly perform text detection and recognition have gained popularity due to their ability to optimize feature sharing between the two tasks. Methods like ABCNet [
1] and SwinTextSpotter [
59] leverage transformer architectures to achieve state-of-the-art performance on benchmark datasets. These models utilize Bezier curves or polygon representations to handle arbitrarily shaped text, demonstrating superior adaptability to real-world scenarios.
10.2. Future Directions: Leveraging LLMs for Text Detection
The integration of Large Language Models (LLMs) presents a promising avenue for advancing text detection methods. LLMs, pre-trained on massive text corpora, possess rich linguistic knowledge that can be harnessed to improve text detection in several ways:
Contextual Understanding: LLMs can provide contextual priors that help disambiguate text instances in complex scenes. By incorporating language embeddings from LLMs into the text detection pipeline, models can better understand the semantic meaning of text, leading to more accurate localization and recognition.
Few-Shot and Zero-Shot Learning: LLMs can facilitate few-shot and zero-shot learning scenarios, where models are required to generalize to new text classes or styles with minimal labeled data. By leveraging the language understanding capabilities of LLMs, text detection models can quickly adapt to unseen text domains, reducing the dependency on large-scale annotated datasets.
Multilingual Support: LLMs trained on multilingual corpora can significantly enhance the cross-lingual generalization capabilities of text detection models. This is particularly beneficial for applications involving multilingual text detection, where models need to handle diverse scripts and languages seamlessly.
In conclusion, the integration of visual and linguistic modalities, coupled with the power of LLMs, holds great promise for advancing scene text detection. Future research should focus on developing efficient multimodal fusion strategies, exploring the potential of LLMs in few-shot and zero-shot learning scenarios, and enhancing the cross-lingual capabilities of text detection models.
11. Challenges and Future Directions
11.1. Current Challenges
11.1.1. Handling Diverse Text Styles, Fonts, and Languages
Scene text detection and recognition systems encounter substantial difficulties when confronted with diverse text styles, fonts, and languages. Text in natural scenes exhibits a vast array of visual characteristics, encompassing varying fonts, sizes, colors, and orientations. This variability poses a significant challenge for models to generalize effectively across different datasets and real-world scenarios. For example, a model trained predominantly on Latin scripts may exhibit poor performance when applied to images containing Chinese or Arabic characters, which possess distinct structural features.
Furthermore, the presence of artistic fonts, decorative elements, and text overlays exacerbates the recognition complexity. These variations necessitate models with robust feature extraction capabilities and resilience to visual distortions. Current research endeavors are dedicated to developing more adaptable architectures capable of learning to represent and recognize text across a wide range of styles and languages. Transformer-based models [
59,
195] have shown promise in capturing long-range dependencies and contextual information, which is crucial for handling diverse text appearances.
11.1.2. Real-Time Performance and Efficiency
Real-time performance and efficiency are paramount for numerous practical applications of scene text detection and recognition, such as autonomous driving, augmented reality, and video surveillance. However, achieving high accuracy while maintaining real-time speed remains a formidable challenge. Many state-of-the-art methods, despite their accuracy, are computationally intensive and require powerful hardware for efficient operation.
To tackle this issue, researchers are exploring various strategies, including model compression, quantization, and the development of lightweight architectures. For instance, PAN++ [
186] proposes an efficient and accurate end-to-end text spotting framework by incorporating lightweight components, such as a feature enhancement network with stacked Feature Pyramid Enhancement Modules (FPEMs) and a lightweight detection head. Similarly, DeepSolo [
191] introduces a simple DETR-like baseline that utilizes a single transformer decoder with explicit points for text detection and recognition, achieving better training efficiency and competitive performance.
11.1.3. Robustness Against Adversarial Attacks and Noise
Robustness against adversarial attacks and noise is another critical challenge in scene text detection and recognition. Adversarial examples, deliberately crafted to deceive machine learning models, can severely degrade the performance of text spotting systems. These attacks can manifest in various forms, such as adding imperceptible perturbations to images or introducing visual noise that mimics text patterns.
To enhance the robustness of text spotting models, researchers are investigating adversarial training techniques, where models are trained on adversarial examples to improve their resilience to attacks. Additionally, the development of noise-robust feature extraction methods and the integration of multi-modal information (e.g., combining visual and linguistic cues) can help mitigate the impact of noise and adversarial perturbations. For example, MATRN [
165] introduces a multi-modal text recognition network that facilitates interactions between visual and semantic features, potentially improving the model’s ability to handle noisy inputs.
11.2. Future Directions
11.2.1. Integration of Multi-Modal Information
The integration of multi-modal information represents a highly promising avenue for advancing scene text detection and recognition. Traditional methods predominantly rely on visual features from RGB images, often overlooking rich contextual cues from other modalities. Recent advancements demonstrate that incorporating linguistic, semantic, and even auditory information can significantly enhance system robustness and accuracy. For instance, Vision-Language Pre-Training (VLP) models, as discussed in [
81], leverage large-scale text-image pair pre-training to learn joint visual-linguistic representations, enabling more accurate and context-aware text spotting. Similarly, semantic information from knowledge graphs or external databases, such as word embeddings or entity recognition results, can disambiguate visually similar characters or words, particularly in complex scenes with overlapping or occluded text, as explored in [
3].
Depth information emerges as another valuable modality, especially for text localization and recognition in 3D scenes. Depth maps, generated from stereo cameras or LiDAR sensors, help distinguish text from background clutter and improve detection robustness under challenging lighting conditions or occlusions. Additionally, while less explored, auditory information—such as speech or environmental sounds—offers unique complementary context. For example, audio descriptions in video-based applications could confirm text recognition results through speech synthesis or audio-visual alignment techniques, providing an additional layer of verification.
Future research should prioritize developing sophisticated fusion strategies to effectively leverage the complementary strengths of these modalities. Challenges include addressing data heterogeneity, computational complexity, and model interpretability. Early studies, such as those combining visual features with linguistic knowledge from pre-trained language models [
165,
190], provide promising foundations. By expanding these approaches to incorporate additional modalities and refining fusion mechanisms, the field can push the state-of-the-art in scene text understanding toward more accurate, context-aware, and robust systems.
11.2.2. Development of More Efficient and Lightweight Models
The growing demand for real-time and resource-efficient scene text detection and recognition systems, driven by the proliferation of mobile and embedded devices, underscores the critical need for developing lightweight and efficient deep learning models. While deep learning has achieved remarkable success in this field, the computational complexity and memory requirements of existing models often hinder their deployment on resource-constrained platforms. To address this challenge, researchers have explored two primary strategies: designing inherently lightweight architectures and applying model compression techniques.
Lightweight architectures, such as those leveraging depthwise separable convolutions [
197] or specialized attention mechanisms, aim to reduce parameter counts and computational overhead without sacrificing performance. For instance, the SVTR model [
26] introduced a single visual model for scene text recognition, eliminating sequential modeling to enhance efficiency. Similarly, [
145] proposed a multi-scale detection method using attention feature extraction and cascade feature fusion, achieving high accuracy while maintaining a lightweight profile through improved modules like DSAF and PFFM. These approaches demonstrate the potential of architecture-level optimizations for resource-efficient deployment.
Model compression techniques, including pruning, quantization, and knowledge distillation, offer complementary avenues for reducing model size and computational demands. Pruning removes redundant connections, quantization lowers weight precision, and knowledge distillation transfers knowledge from larger teacher models to smaller student models, enabling comparable performance with fewer parameters. Additionally, adaptive and dynamic inference mechanisms, such as adjusting model complexity based on input difficulty or dynamically selecting relevant features, further enhance efficiency. Future research should focus on integrating these strategies—combining lightweight architectures with compression techniques and dynamic inference—to develop scene text detection and recognition systems that are both efficient and deployable on resource-constrained devices.
11.2.3. Exploitation of Unsupervised and Semi-Supervised Learning Techniques
The success of deep learning models in scene text detection and recognition hinges on large-scale annotated datasets, yet collecting and labeling such data is time-consuming, labor-intensive, and costly. To mitigate this dependency, unsupervised and semi-supervised learning techniques offer promising alternatives. Unsupervised methods, such as self-supervised and contrastive learning, leverage unlabeled data to learn robust representations, which can then be fine-tuned on smaller labeled datasets for competitive performance [
6]. Semi-supervised approaches combine limited labeled data with abundant unlabeled data to enhance model generalization, reducing the need for extensive annotations. These strategies are critical for advancing scene text understanding while minimizing reliance on costly labeled datasets.
Simultaneously, the deployment of scene text systems on resource-constrained devices—such as mobile phones, embedded systems, and drones—demands efficient and lightweight models. Recent advancements focus on specialized neural architectures, such as depthwise separable convolutions [
197], which reduce computational costs while maintaining accuracy. Techniques like model pruning [
68], quantization [
198], and knowledge distillation [
199] further optimize inference speed and model size, enabling complex models to run on edge devices with limited resources. Additionally, lightweight attention mechanisms [
182] and hardware-aware neural architecture search (NAS) [
200] are being explored to balance feature extraction capabilities with computational efficiency, ensuring real-time performance across diverse platforms.
Future research should integrate these dual objectives: reducing annotation dependency through unsupervised/semi-supervised learning and enhancing model efficiency via architecture optimization and hardware-aware design. For instance, combining self-supervised pre-training with NAS could yield models that generalize well with minimal labeled data while being tailored for specific hardware constraints. Such synergies will enable the deployment of advanced scene text systems on resource-constrained devices, fostering ubiquitous and real-time text recognition capabilities.
12. Conclusions
The field of scene text detection and recognition has witnessed transformative progress through deep learning innovations. This systematic review synthesizes methodological advancements across detection paradigms, recognition architectures, and emerging challenges, while proposing actionable research trajectories for next-generation systems.
Methodological Taxonomy in Text Detection: In the methodological landscape of scene text detection, three primary paradigms have evolved with distinct performance profiles and computational trade-offs. Two-stage frameworks, exemplified by Faster R-CNN derivatives, establish state-of-the-art accuracy for small and densely packed text instances through iterative region proposal refinement, though this precision comes at the cost of 2–3× greater computational overhead compared to single-stage alternatives. Conversely, YOLO/SSD-inspired single-stage detectors achieve real-time throughput (30–50 FPS) by integrating localization and classification in a unified network, but exhibit 8–12% relative performance degradation on irregularly shaped text due to their reliance on axis-aligned bounding box representations. To address these geometric limitations, parametric modeling approaches like TextSnake and ABCNet employ spline curves or pixel-level segmentation masks to capture arbitrary text contours, demonstrating 92-95% precision on curved text benchmarks—albeit requiring specialized post-processing to maintain character-level structural integrity. This taxonomy highlights an ongoing tension between detection accuracy, computational efficiency, and geometric adaptability in contemporary scene text systems.
Recognition Advances and Persistent Challenges: Transformer-based recognizers, including ViTSTR and TRBA, have achieved state-of-the-art performance on mainstream benchmarks such as ICDAR2015 and Total-Text, demonstrating 3–5% accuracy improvements over CNN-RNN hybrids particularly for curved text recognition. However, three critical challenges remain unresolved: First, the domain adaptation gap persists, with synthetic-to-real accuracy drops of 15–20% attributed to mismatches in typographic styles and illumination conditions; this necessitates physically-based rendering techniques, unsupervised feature alignment methods, and curriculum learning frameworks to enhance cross-domain generalization. Second, cross-script generalization exhibits 12–18% performance disparities on multilingual benchmarks like RRC-MLT-2019, arising from script-specific structural complexities and low-resource data scarcity; future research should prioritize graph-based character representation learning and meta-learning approaches for effective cross-lingual transfer. Third, edge deployment constraints significantly hinder practical applications, as high-accuracy models like Mask TextSpotter++ (85–120 ms inference on V100) experience 5–8× slower processing on mobile devices; hardware-aware neural architecture search, 4-bit quantization, and dynamic early exiting mechanisms represent promising optimization directions to balance accuracy and efficiency. These challenges collectively underscore the need for holistic solutions that address both algorithmic robustness and computational practicality in scene text recognition systems.
Convergent Research Directions: Three interdisciplinary research trajectories are emerging as transformative pathways for advancing scene text understanding through cross-domain synergy: First, cognitive-inspired multimodal fusion leverages pre-trained language models (BERT/GPT) to inject contextual knowledge into visual processing pipelines, effectively resolving character ambiguities (e.g., “0” vs. “O”) through cross-modal alignment mechanisms that bridge semantic gaps between visual patterns and linguistic representations. Second, edge-AI co-design integrates hardware-aware neural architecture search (NAS) with knowledge distillation to optimize models for specific processors (e.g., ARM Cortex-A78, Apple M1), where progressive shrinking strategies enable real-time inference (<50 ms/image) on commodity devices while maintaining >92% accuracy on Total-Text—though challenges persist in gradient-preserving quantization and hardware-specific pruning methodologies. Third, continual adaptation draws inspiration from neurobiological mechanisms (e.g., elastic weight consolidation, generative replay) to support incremental learning of new domains (e.g., handwritten prescriptions) without catastrophic forgetting; hybrid approaches combining parameter isolation and memory replay demonstrate 89% retention of original RRC-MLT-2019 accuracy after domain shifts, significantly outperforming standard fine-tuning (62%). These trajectories collectively highlight the potential of interdisciplinary convergence—merging cognitive science, hardware engineering, and neuroscience—to create robust, adaptable, and efficient scene text understanding systems.
Beyond algorithmic innovations, the recently introduced Union14M dataset—comprising 14 million annotated instances spanning 120 languages—provides an unprecedented training corpus to scale model generalization capabilities. Integrating insights from cognitive science reveals three complementary pathways toward achieving human-like text comprehension: First, cross-modal attention mechanisms that emulate multisensory integration in biological systems can resolve visual-linguistic ambiguities (e.g., distinguishing “B” from “8” in noisy contexts) by dynamically aligning perceptual features with semantic constraints. Second, neuro-symbolic architectures that fuse connectionist recognition networks with symbolic reasoning modules offer explainable decision pathways, enabling traceable inference chains for complex text structures like mathematical expressions or legal documents. Third, dynamic knowledge graphs updated through entity linking during inference support contextual disambiguation, allowing models to resolve homographs (e.g., “Apple” as fruit vs. company) by leveraging world knowledge from multilingual corpora. These interdisciplinary advances collectively represent a paradigm shift from purely statistical pattern matching to semantically grounded understanding, equipping systems to handle the compositional complexity and linguistic diversity inherent in real-world scene text applications.