This section outlines the experimental setup, including datasets, preprocessing, training/evaluation protocols, and metrics. We then present results and compare the proposed method with state-of-the-art (SOTA) approaches using these metrics and computational cost indicators. Ablation studies are conducted to assess the impact of training permutations on test accuracy. Finally, case studies illustrate the model’s specific performance.
4.1. Datasets
In this paper, we evaluate our EDIM method on multiple benchmark test sets, which include the following: 1. Six commonly used regular and irregular test sets: ICDAR 2013 (IC13 [
19]), Street View Text (SVT [
20]), IIIT5K-Words (IIIT5K [
21]), ICDAR 2015 ([
22]), Street View Text-Perspective (SVTP [
23]), and CUTE80 [
24]. IC13 and IC15 each have two versions—one containing IC13 857 and IC15 1811, and another containing IC13 1015 and IC15 2077. We chose the former versions for this work. 2. The latest Union14M-L [
25] benchmark (U14M) test set, which contains seven challenging subsets: Curved Text (Curve), Multi-Oriented Text (MO), Artistic Text (Artistic), Background-Free Text (Cless), Salient Text (Salient), Multi-Word Text (MW), and General Text (General).
During the training phase, we trained our model on real-world datasets due to the increase in real-world scene datasets, enabling the model to exhibit stronger recognition capability. We mainly trained it on two large-scale real-world training sets: the Real dataset [
5] and the Union14M-Filterr [
26] dataset.
4.3. Comparative Analysis with Existing Methods
Since the training sets in this paper include the Real training [
5] set and the Union14M-Filter [
26] training set, we conduct training on both datasets separately and then evaluate the relevant metrics. The results for these two training sets are presented in distinct tables to ensure clarity:
Training on the Real training set:
Table 1 compares our method with state-of-the-art approaches on six standard English benchmark test sets.
Training on the Union14M-Filter training set:
Table 2 compares the six standard test sets and seven challenging subsets from the Union14M-L benchmark.
This separate presentation allows for a comprehensive evaluation under different training data conditions and ensures transparency in our experimental design.
First, we train our model on the Real training set and compare it with state-of-the-art methods on English datasets, such as PARSeq [
5] and MGP [
6], as well as some classic methods on English datasets like ViTSTR [
7], ABINet [
16], and SRN [
27]. The results are shown in
Table 1.
Table 1 summarizes the experimental results on several widely used English benchmark datasets: IIIT5K, SVT, IC13, IC15, SVTP, and CUTE80. Since our model is trained on the Real training set, we compare it only with methods that are also trained exclusively on the same dataset. As shown in
Table 1, our method, EDIM, achieves strong performance across all six standard benchmarks, particularly on the two irregular text datasets—SVTP and CUTE80. This indicates that our MSFA module significantly enhances the encoder’s feature extraction capability, enabling superior performance on challenging, non-horizontal text instances. Moreover, EDIM achieves competitive accuracy with relatively low computational cost, as evidenced by its favorable FLOPs compared to other state-of-the-art methods.
Furthermore, we conduct additional experiments using Union14M-Filter [
26] from SVTRv2 [
26] as the second training set, a real-world dataset derived from Union14M-L. Due to potential data overlap between the Real set and Union14M-L, which may lead to data leakage and compromise the fairness of comparison, we adopt the filtered version—Union14M-Filter—to ensure a more rigorous evaluation. The corresponding results are presented in
Table 2.
Table 2.
Models trained on the Union14M-Filter training set, tested on six commonly used regular and irregular test sets and seven challenging test subsets, and compared with other existing methods.
Table 2.
Models trained on the Union14M-Filter training set, tested on six commonly used regular and irregular test sets and seven challenging test subsets, and compared with other existing methods.
| Method | Venue | Common Benchmarks | Union14M Benchmarks | Size/FPS |
|---|
|
IIIT5K
|
SVT
|
IC13
|
IC15
|
SVTP
|
CUTE80
|
Curve
|
Multi-O
|
Artistic
|
Context
|
Salient
|
Multi-W
|
General
|
Avg
|
Size (M)
|
FPS
|
|---|
| ASTER [28] | TPAMI19 | 96.1 | 93.0 | 94.9 | 86.1 | 87.9 | 92.0 | 70.9 | 82.2 | 56.7 | 62.9 | 73.9 | 58.5 | 76.3 | 68.77 | 19.0 | 67.1 |
| MORAN [29] | PR19 | 96.7 | 91.7 | 94.6 | 84.6 | 85.7 | 90.3 | 51.2 | 15.5 | 51.3 | 61.2 | 43.2 | 64.1 | 69.3 | 50.83 | 17.4 | 59.5 |
| AutoSTR [30] | ECCV20 | 96.8 | 92.4 | 95.7 | 86.6 | 88.2 | 93.4 | 72.1 | 81.7 | 56.7 | 64.8 | 75.4 | 64.0 | 75.9 | 70.09 | 6.0 | 82.6 |
| RoScanner [31] | ECCV20 | 98.5 | 95.8 | 97.7 | 88.2 | 90.1 | 97.6 | 79.4 | 68.1 | 70.5 | 79.6 | 71.6 | 82.5 | 80.8 | 76.07 | 48.0 | 64.1 |
| ABINet [16] | CVPR21 | 98.5 | 98.1 | 97.7 | 90.1 | 94.1 | 96.5 | 80.4 | 69.0 | 71.7 | 74.7 | 77.6 | 76.8 | 79.8 | 75.71 | 36.9 | 73.0 |
| PARSeq [5] | ECCV22 | 98.9 | 98.1 | 98.4 | 90.1 | 94.3 | 98.6 | 87.6 | 88.8 | 76.5 | 83.4 | 84.4 | 84.8 | 84.3 | 84.26 | 23.8 | 52.6 |
| MATRN [32] | ECCV22 | 98.8 | 98.3 | 97.9 | 90.3 | 95.2 | 97.2 | 82.2 | 73.0 | 73.4 | 76.9 | 79.4 | 77.4 | 81.0 | 77.61 | 44.3 | 46.9 |
| MGP-STR [6] | ECCV22 | 97.9 | 97.8 | 97.1 | 89.6 | 95.2 | 96.9 | 85.2 | 83.7 | 72.6 | 75.1 | 79.8 | 71.1 | 83.1 | 78.65 | 148 | 120 |
| CPPD [33] | Preprint | 99.0 | 97.8 | 98.2 | 90.4 | 94.0 | 99.0 | 88.8 | 78.7 | 76.5 | 92.8 | 85.3 | 81.9 | 83.5 | 81.93 | 27.0 | 125 |
| LPV [34] | IJCAI23 | 98.6 | 97.8 | 98.1 | 89.8 | 93.7 | 97.6 | 86.2 | 78.7 | 75.8 | 80.2 | 82.9 | 81.6 | 82.9 | 81.18 | 30.5 | 82.6 |
| MAERec [25] | ICCV23 | 99.2 | 97.8 | 98.2 | 90.4 | 94.3 | 98.3 | 89.1 | 87.1 | 79.0 | 84.2 | 86.3 | 85.9 | 84.6 | 85.17 | 35.7 | 17.1 |
| LISTER [35] | ICCV23 | 98.8 | 97.5 | 98.6 | 90.0 | 94.4 | 96.9 | 78.7 | 68.8 | 73.7 | 81.6 | 74.8 | 82.4 | 83.5 | 77.64 | 51.1 | 44.6 |
| CDistNet [15] | IJCV24 | 98.7 | 97.1 | 97.8 | 89.6 | 94.0 | 95.9 | 81.7 | 71.2 | 72.6 | 78.2 | 79.9 | 79.7 | 81.1 | 77.77 | 43.3 | 15.9 |
| BUSNet [36] | AAAI24 | 98.3 | 98.1 | 97.8 | 90.2 | 95.3 | 96.5 | 83.0 | 82.3 | 70.8 | 77.9 | 78.8 | 71.2 | 82.6 | 78.09 | 32.1 | 83.3 |
| OTE [37] | CVPR24 | 98.6 | 96.6 | 98.0 | 90.1 | 94.0 | 97.2 | 86.0 | 75.8 | 74.6 | 74.7 | 81.0 | 65.3 | 82.3 | 77.10 | 20.3 | 55.2 |
| SVTRv2 [26] | ICCV25 | 99.2 | 98.0 | 98.7 | 91.1 | 93.5 | 99.0 | 90.6 | 89.0 | 79.3 | 86.1 | 86.2 | 86.7 | 85,1 | 86.14 | 19.8 | 143 |
| EDIM (Ours) | – | 99.0 | 99.0 | 98.6 | 89.7 | 96.6 | 97.9 | 89.4 | 91.6 | 78.7 | 85.5 | 86.2 | 86.3 | 85.9 | 86.23 | 23.8 | 14.8 |
Observing the data in
Table 2, we see that our method continues to demonstrate strong performance. On the six commonly used regular and irregular test sets, SVTP [
23] surpasses existing methods by an average of approximately two percentage points, while on SVT [
20], it exceeds others by about one point on average. On IIIT5K [
21], IC13 [
19], IC15 [
22], and CUTE80 [
24], our results are comparable to other state-of-the-art approaches. Furthermore, our method also performs well on the seven challenging test subsets. On the Multi-Oriented Text (MO) subset, it outperforms other methods by about one percentage point on average, while achieving broadly comparable results to other strong approaches across the remaining six challenging subsets. Compared with the latest SVTRv2 (2025), our approach achieves higher accuracy—1% on SVT [
20], 3% on SVTP [
23], and 2.6% on the Multi-Oriented Text subset—while maintaining comparable results on the other test datasets. In addition, our model contains fewer parameters, providing better efficiency and deployment advantages.
To enhance the interpretability of the SeqEDCF mechanism, we conduct a qualitative analysis of its adaptive behavior under various challenging scenarios. In cases of low-resolution or heavily occluded text, the gating mechanism consistently assigns high weights to contextual information from the decoder state (gate values > 0.7), leveraging linguistic context to resolve visual ambiguities. In contrast, for clear and well-aligned text, the mechanism shifts focus toward visual features from the encoder (gate values < 0.3), ensuring precise character-level localization. In curved text recognition, SeqEDCF maintains consistent attention along the text trajectory throughout the decoding process, effectively modeling spatial progression along nonlinear paths. For semantically ambiguous cases—such as visually similar characters (‘c’ and ‘e’) under poor illumination—the mechanism achieves balanced gate values (approximately 0.4–0.6), allowing complementary cues from both modalities to jointly contribute to the final decision. These observations demonstrate that SeqEDCF facilitates a dynamic and context-sensitive dialogue between the encoder and decoder throughout the decoding process.
Comprehensive Metric Analysis for Imbalanced Data
To address the challenge of evaluating model performance under imbalanced data distributions, we conduct a detailed analysis using complementary metrics beyond overall accuracy. This analysis provides deeper insights into the model’s behavior across diverse data characteristics and subset sizes. We employed the Union14M-L dataset for training.
Analysis on Highly Imbalanced Union14M-L Dataset: As shown in
Table 3, the Union14M-L dataset exhibits significant class imbalance, with the General subset containing 387,287 samples while other challenging subsets range from only 779 to 2426 samples. Despite this imbalance, EDIM demonstrates remarkable consistency across all metrics:
The Artistic subset (898 samples), while achieving 78.73% accuracy, maintains a high 1 - NED of 93.95%, indicating that most recognition errors are minor character-level deviations rather than complete failures.
On the Multi-Oriented subset (1369 samples), EDIM achieves excellent performance across all metrics (accuracy: 91.60%, 1 - NED: 97.22%, confidence: 92.19%), demonstrating robust handling of oriented text despite limited training data.
The Multi-Word subset shows the highest 1 - NED (98.02%), suggesting exceptional character-level accuracy in recognizing longer text sequences.
Confidence scores remain stable across all subsets (83.26–92.19%), indicating consistent model calibration regardless of subset size or difficulty level.
Analysis on Standard Benchmarks:
Table 4 presents results on balanced benchmark datasets. The close alignment between accuracy (96.22%) and 1 - NED (98.76%) demonstrates that EDIM’s predictions are character-wise accurate even when not perfectly matching the ground truth. The high confidence scores (95.77%) further validate the model’s well-calibrated uncertainty estimation.
Cross-Dataset Consistency: The complementary metrics reveal EDIM’s consistent performance patterns across both imbalanced and balanced datasets. The minimal gaps between accuracy and 1 - NED scores across all evaluations confirm that the model’s advantages are genuine and not artifacts of specific data distributions. This comprehensive analysis substantiates EDIM’s robustness and reliability for real-world deployment scenarios involving naturally imbalanced data.
4.4. Analysis of Computational Efficiency
Beyond recognition accuracy, we also evaluated the computational efficiency of our EDIM. As summarized in
Table 2, our model contains 25.8 million parameters and achieves an inference speed of 14.8 FPS. EDIM strikes a favorable balance between performance and efficiency compared to other state-of-the-art methods. For instance, while achieving superior or comparable accuracy on irregular text benchmarks (SVTP [
23] and CUTE80 [
24]), our model is significantly more efficient than methods that rely on extensive pre-trained vision–language models (e.g., CLIP4STR [
8], TrOCR [
11]), which often have parameter counts an order of magnitude higher. The lightweight design of our core components—the MSFA module for efficient multi-scale feature extraction and the SeqEDCF mechanism for dynamic feature fusion without recurrent connections—contributes to this manageable computational burden. This makes EDIM accurate and more practical for real-world deployment scenarios with limited computational resources.
To directly address the efficiency in processing a batch of images, we estimate the total processing time for 100 images. Our EDIM, with an inference speed of 14.8 FPS, would require approximately 6.76 s (100 images/14.8 FPS ≈ 6.76 s). This demonstrates a highly competitive throughput. In contrast, methods with lower FPS (e.g., around 10 FPS) would require about 10 s for the same task. In comparison, those with higher FPS might trade off recognition accuracy, especially on challenging irregular text datasets, as shown in
Table 2. The performance of our model on such a batch would be fast and accurate, maintaining the high recognition rates reported in
Table 1 and
Table 2 across both regular and irregular text. This balance is crucial for practical applications where speed and accuracy are paramount.
The comprehensive metrics presented in
Table 3 and
Table 4 provide additional validation of EDIM’s robustness. The strong performance across all three metrics–particularly the high 1 - NED scores–confirms that our model’s advantages extend beyond simple sequence matching to include superior character-level recognition capability. This is especially evident in challenging scenarios like curved and multi-oriented text, where EDIM maintains high 1 - NED scores despite the inherent difficulties of these categories.
Furthermore, the well-calibrated confidence scores across all subsets indicate that EDIM provides reliable uncertainty estimation, making it suitable for real-world applications where understanding prediction certainty is crucial for downstream decision-making.
4.5. Comprehensive Ablation Studies
To thoroughly validate the effectiveness of our proposed EDIM framework, we conducted extensive ablation studies focusing on three key aspects: (1) the individual and combined contributions of MSFA and SeqEDCF modules, (2) analysis of different dilation rate configurations in MSFA, and (3) direct comparison between SeqEDCF and standard attention mechanisms.
4.5.1. Component-Wise Analysis
The component-wise analysis reveals several key insights in
Table 5:
Individual Contributions: MSFA and SeqEDCF significantly improve performance when used independently. MSFA contributes primarily through enhanced multi-scale feature representation, while SeqEDCF improves semantic modeling and contextual consistency.
Synergistic Effect: The combination of MSFA and SeqEDCF yields the highest performance gains (+0.30%), demonstrating that the enhanced visual features from MSFA are particularly well-utilized by the sophisticated fusion mechanism in SeqEDCF.
Challenging Scenarios: The performance improvements are more pronounced on the seven challenging datasets, indicating that our modules are particularly effective for complex text recognition scenarios involving irregular shapes, distortions, and complex backgrounds.
4.5.2. MSFA Dilation Rate Analysis
The performance advantage of the MSFA module can be attributed to the statistical characteristics of text scale distribution: its multi-scale design directly addresses the extreme scale variations of text instances in natural scenes. Meanwhile, the improvement of the SeqEDCF mechanism stems from its dynamic gating fusion capability, which enables the model to adaptively adjust the contribution weights of visual and semantic information based on input complexity—a behavior validated through our qualitative analysis in
Table 6.
Note: Dilation Set A: [1, 2, 3, 4]; Set B: [1, 3, 5, 7]; Set C: [1, 6, 12, 18]; Set D: [6, 12, 18, 24]; Set E: [1, 12, 24, 36].
The dilation rate analysis provides crucial insights into multi-scale feature design:
Small Rates (Set A): Effective for local detail preservation but insufficient for capturing broader contextual information, particularly limiting performance on datasets with significant scale variations.
Moderate Rates (Set C): Our selected configuration achieves optimal balance, with progressive scaling from character-level (rate 1) to word-level (rate 6) and scene-level (rates 12–18) context capture.
Large Rates (Set E): Capture extensive contextual information but sacrifice local detail precision, particularly affecting performance on small or distorted text instances.
The chosen configuration [1, 6, 12, 18] explicitly addresses the multi-scale nature of scene text, where individual characters, words, and text blocks coexist within the same image while maintaining computational efficiency.
4.5.3. SeqEDCF vs. Standard Attention Mechanisms
The attention mechanism comparison reveals distinct advantages of our SeqEDCF approach in
Table 7:
Standard Cross-Attention: It provides basic encoder–decoder interaction but lacks adaptive capabilities, particularly struggling with irregular text (SVTP: 92.3%, CUTE80: 96.1%).
Enhanced Variants: Multi-head and residual variants offer incremental improvements but maintain the fundamental limitation of static context transfer.
SeqEDCF Advantages: The gating mechanism enables dynamic, context-aware feature fusion throughout decoding, achieving substantial improvements on challenging datasets (SVTP: +3.6%, CUTE80: +2.0% over standard cross-attention).
4.5.4. Computational Efficiency and Practical Considerations
The computational analysis demonstrates that our EDIM framework achieves an excellent balance between performance and efficiency in
Table 8:
Reasonable Overhead: The complete EDIM introduces approximately 20% additional parameters while maintaining competitive inference speed (14.8 FPS).
Performance-9Efficiency Trade-off: The 2.21% accuracy gain justifies the computational cost, particularly for applications requiring high recognition accuracy on challenging text.
Practical Viability: The maintained FPS rate ensures real-time applicability in practical deployment scenarios such as autonomous driving and document digitization.
In summary, our comprehensive ablation studies validate both proposed modules’ effectiveness and synergistic combination. The MSFA module significantly enhances multi-scale feature representation through optimized dilation rate selection, while SeqEDCF provides sophisticated dynamic fusion capabilities that substantially outperform standard attention mechanisms. The combined EDIM framework achieves state-of-the-art performance while maintaining practical computational efficiency.
4.6. Case Study
By conducting recognition experiments on irregular, handwritten, curved, occluded, distorted, and perspective text images across four models (CRNN [
2], ViT [
7], CDistNet [
15], and our method), we evaluated the accuracy differences between our model and other approaches, as demonstrated in
Table 9.
Table 9 reveals that while the conventional CRNN method struggles to recognize text in complex scenarios accurately, ViT significantly improves recognition performance through its superior encoder–decoder architecture. Our model achieves outstanding accuracy comparable to state-of-the-art CDistNet [
15] when processing irregularly shaped, handwritten, curved, occluded, and distorted images. This indicates that our MSFA (Multi-Scale Feature Aggregation) module effectively assists the model in extracting corresponding features from images, enabling precise recognition of such challenging text. Furthermore, the SEQEDCF (Semantic-Enhanced Query-Based Dynamic Context Fusion) module associates global semantic information, allowing the model to fully leverage semantic cues for more accurate recognition of occluded images.
4.7. Cross-Lingual Applicability Analysis
To evaluate the generalization capability of EDIM in multi-lingual scenarios, we conducted text recognition experiments on Uyghur-language images in addition to English. The study employed a self-constructed Uyghur dataset collected from various real-world environments in Urumqi, Xinjiang, China. Outdoor scenarios included bus stops, road signs, storefront signage, banners, product packaging, and hospital boards, while indoor environments encompassed printed notices, exhibition labels, and shopping mall directories. The dataset was rigorously partitioned into training and validation sets at an 8:2 ratio, with the validation set simultaneously serving as the test set.
To address data scarcity limitations, we implemented extensive augmentation techniques—including geometric transformations, imaging perturbations, color adjustments, and grid distortion—generating 1.76 million synthetic images, of which 1.31 million were retained for training. This comprehensive dataset authentically captures characteristic challenges such as typographic diversity, character ligatures, complex backgrounds, and low-resolution conditions, thereby establishing a valuable benchmark for low-resource scene text recognition.
Furthermore, to accommodate the distinctive character set of the Uyghur script, we replaced the original 94-character English vocabulary (comprising 0–9, a–z, A–Z, and 32 special characters) with 32 Uyghur alphabetic letters and appended an additional special token <end>. The input image size was standardized to 32 × 128 pixels, with a maximum sequence length set to 25 tokens. All experiments were conducted on a single Tesla V100 GPU, utilizing mixed-precision training through PyTorch Lightning. The model was optimized using AdamW with a OneCycleLR learning rate scheduler.
As demonstrated in
Table 10, our model achieves competitive performance in the Uyghur comparative study, with an accuracy margin of approximately one percentage point compared to the most recent baseline models. When evaluated against 2024 benchmarks, our method shows improvements of 2–3 percentage points, while significantly outperforming earlier approaches by considerable margins. These results substantiate that our approach maintains consistent effectiveness not only for English text recognition but also for the morphologically distinct Uyghur script, demonstrating its cross-lingual applicability. This robust performance across linguistically diverse writing systems suggests strong potential for future extension to other languages.