Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution
Abstract
:1. Introduction
- This study proposes a two-branch feature aggregation strategy, which integrates independently cropped single-character image features with corresponding character probability sequences. This approach ensures that high-level prior information focuses on individual character structures, effectively mitigating complex background interference. It also significantly reduces interference from neighboring densely distributed characters.
- To leverage the complementary capabilities of convolutional kernels with varying receptive fields, an improved inception module is introduced in shallow layers for dynamic multi-scale feature extraction. By dynamically weighting scaled convolutional kernels, the global overview features and fine-grained features are adaptively adjusted for each input, thus enriching the feature expressions to comprehensively understand the salient vision content.
- Leveraging adaptive normalization to learn cross-domain mapping relationships, a color correction operation adaptively adjusts the mean and standard deviation of target images pixels. This enhances super-resolution quality without altering the original image content. Experiments are performed on the public dataset TextZoom, and the results show the superiority of the proposed model compared to the existing baselines. The average recognition accuracy on the test sets of CRNN, MORAN, and ASTER is improved by 1%, 1.5%, and 0.9%, respectively.
2. Related Works
2.1. Image Super-Resolution
2.2. Scene Text Recognition for STISR
2.3. Scene Text Super-Resolution
3. The Proposed Network Architecture
3.1. Image Preprocessing
3.1.1. Dynamic Inception Feature Extraction
3.1.2. Single-Character Boundary Detection
3.1.3. Text Recognizer
3.2. Dual-Branch Feature Aggregation
3.3. Reconstructed Module
3.4. Loss Function
4. Experimental Results and Discussion
4.1. Dataset and Experimental Details
4.2. Ablation Experiment
4.2.1. The Role of Dual-Branch Feature Aggregation
4.2.2. The Role of Dynamic Inception Feature Extraction
4.2.3. Validity of the CCB Module
4.2.4. Effectiveness and Efficiency of Different Components
4.3. Comparison with State-of-the-Art Results
4.3.1. TextZoom Quantitative Research
4.3.2. TextZoom Qualitative Research
4.3.3. Quantitative Research of Text Recognition Datasets
4.3.4. Research on Densely Connected Datasets
4.3.5. Robustness Test
4.3.6. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, B.; Chen, K.; Peng, S.-L.; Zhao, M. Adaptive Aggregate Stereo Matching Network with Depth Map Super-Resolution. Sensors 2022, 22, 4548. [Google Scholar] [CrossRef] [PubMed]
- Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [PubMed]
- Luo, C.; Jin, L.; Sun, Z. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognit. 2019, 90, 109–118. [Google Scholar]
- Sheng, F.; Chen, Z.; Mei, T.; Xu, B. A single-shot oriented scene text detector with learnable anchors. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1516–1521. [Google Scholar]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar]
- Wang, W.; Xie, E.; Sun, P.; Wang, W.; Tian, L.; Shen, C.; Luo, P. Textsr: Content-aware text super-resolution guided by recognition. arXiv 2019, arXiv:1909.07113. [Google Scholar]
- Wang, W.; Xie, E.; Liu, X.; Wang, W.; Liang, D.; Shen, C.; Bai, X. Scene text image super-resolution in the wild. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16, 2020. pp. 650–666. [Google Scholar]
- Chen, J.; Li, B.; Xue, X. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 12026–12035. [Google Scholar]
- Ma, J.; Guo, S.; Zhang, L. Text prior guided scene text image super-resolution. IEEE Trans. Image Process. 2023, 32, 1341–1353. [Google Scholar]
- Ma, J.; Liang, Z.; Zhang, L. A text attention network for spatial deformation robust scene text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5911–5920. [Google Scholar]
- Chen, J.; Yu, H.; Ma, J.; Li, B.; Xue, X. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 285–293. [Google Scholar]
- Guo, H.; Dai, T.; Meng, G.; Xia, S.-T. Towards robust scene text image super-resolution via explicit location enhancement. arXiv 2023, arXiv:2307.09749. [Google Scholar]
- Guo, K.; Zhu, X.; Schaefer, G.; Ding, R.; Fang, H. Self-supervised memory learning for scene text image super-resolution. Expert Syst. Appl. 2024, 258, 125247. [Google Scholar]
- Shi, Q.; Zhu, Y.; Liu, Y.; Ye, J.; Yang, D. Perceiving Multiple Representations for scene text image super-resolution guided by text recognizer. Eng. Appl. Artif. Intell. 2023, 124, 106551. [Google Scholar]
- TomyEnrique, L.; Du, X.; Liu, K.; Yuan, H.; Zhou, Z.; Jin, C. Efficient scene text image super-resolution with semantic guidance. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 3160–3164. [Google Scholar]
- Zhang, X.-G. A new kind of super-resolution reconstruction algorithm based on the ICM and the bilinear interpolation. In Proceedings of the 2008 International Seminar on Future BioMedical Information Engineering, Wuhan, China, 18–20 December 2008; pp. 183–186. [Google Scholar]
- Akhtar, P.; Azhar, F. A single image interpolation scheme for enhanced super resolution in bio-medical imaging. In Proceedings of the 2010 4th International Conference on Bioinformatics and Biomedical Engineering, Chengdu, China, 18–20 June 2010; pp. 1–5. [Google Scholar]
- Badran, Y.K.; Salama, G.I.; Mahmoud, T.A.; Mousa, A.; Moussa, A. Single Image Super Resolution Using Discrete Cosine Transform Driven Regression Tree. In Proceedings of the 2020 37th National Radio Science Conference (NRSC), Cairo, Egypt, 8–10 September 2020; pp. 128–136. [Google Scholar]
- Park, S.C.; Park, M.K.; Kang, M.G. Super-resolution image reconstruction: A technical overview. IEEE Signal Process. Mag. 2003, 20, 21–36. [Google Scholar]
- Faramarzi, A.; Ahmadyfard, A.; Khosravi, H. Adaptive image super-resolution algorithm based on fractional Fourier transform. Image Anal. Stereol. 2022, 41, 133–144. [Google Scholar]
- Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution via sparse representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [PubMed]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar]
- Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
- Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
- Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
- Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 2022, 479, 47–59. [Google Scholar]
- Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV 13, 2014. pp. 184–199. [Google Scholar]
- Mou, Y.; Tan, L.; Yang, H.; Chen, J.; Liu, L.; Yan, R.; Huang, Y. Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16, 2020. pp. 158–174. [Google Scholar]
- Zhao, M.; Wang, M.; Bai, F.; Li, B.; Wang, J.; Zhou, S. C3-stisr: Scene text image super-resolution with triple clues. arXiv 2022, arXiv:2204.14044. [Google Scholar]
- Qin, R.; Wang, B. Scene text image super-resolution via content perceptual loss and criss-cross transformer blocks. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–10. [Google Scholar]
- Liu, B.; Yang, Z.; Wang, P.; Zhou, J.; Liu, Z.; Song, Z.; Liu, Y.; Xiong, Y. Textdiff: Mask-guided residual diffusion models for scene text image super-resolution. arXiv 2023, arXiv:2308.06743. [Google Scholar]
- Zhou, Y.; Gao, L.; Tang, Z.; Wei, B. Recognition-guided diffusion model for scene text image super-resolution. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 2940–2944. [Google Scholar]
- Noguchi, C.; Fukuda, S.; Yamanaka, M. Scene text image super-resolution based on text-conditional diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1485–1495. [Google Scholar]
- Zhao, Z.; Xue, H.; Fang, P.; Zhu, S. Pean: A diffusion-based prior-enhanced attention network for scene text image super-resolution. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 9769–9778. [Google Scholar]
- Zhao, M.; Xu, Y.; Li, B.; Wang, J.; Guan, J.; Zhou, S. HiREN: Towards higher supervision quality for better scene text image super-resolution. Neurocomputing 2025, 623, 129309. [Google Scholar]
- Zhao, C.; Feng, S.; Zhao, B.N.; Ding, Z.; Wu, J.; Shen, F.; Shen, H.T. Scene text image super-resolution via parallelly contextual attention network. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 2908–2917. [Google Scholar]
- Zhu, S.; Zhao, Z.; Fang, P.; Xue, H. Improving scene text image super-resolution via dual prior modulation network. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3843–3851. [Google Scholar]
- Fu, M.; Man, X.; Xu, Y.; Shao, J. ESTISR: Adapting efficient scene text image super-resolution for real-scenes. arXiv 2023, arXiv:2306.02443. [Google Scholar]
- Zhang, W.; Deng, X.; Jia, B.; Yu, X.; Chen, Y.; Ma, J.; Ding, Q.; Zhang, X. Pixel adapter: A graph-based post-processing approach for scene text image super-resolution. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 2168–2179. [Google Scholar]
- Ma, J.; Liang, Z.; Xiang, W.; Yang, X.; Zhang, L. A benchmark for Chinese-English scene text image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 19452–19461. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4-9 December 2017. [Google Scholar]
- Li, X.; Zuo, W.; Loy, C.C. Learning generative structure prior for blind text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10103–10113. [Google Scholar]
- Stamatopoulos, N.; Gatos, B.; Louloudis, G.; Pal, U.; Alaei, A. ICDAR 2013 handwriting segmentation contest. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1402–1406. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, France, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar]
- Phan, T.Q.; Shivakumara, P.; Tian, S.; Tan, C.L. Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 569–576. [Google Scholar]
- Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar]
- Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 606–615. [Google Scholar]
Fusion Strategy | Easy | Medium | Hard | avgAcc ↑ | PSNR ↑ | SSIM ↑ |
---|---|---|---|---|---|---|
w/o DBFA | 51.2% | 41.9% | 31.7% | 41.6% | 21.02 | 0.7690 |
61.8% | 52.1% | 37.9% | 50.6% | 21.10 | 0.7819 | |
60.3% | 50.4% | 36.9% | 49.2% | 20.87 | 0.7783 | |
TPI | 62.9% | 53.5% | 39.8% | 52.8% | 21.52 | 0.7930 |
LTFA | 63.1% | 53.8% | 39.8% | 53.1% | 21.43 | 0.7954 |
DBFA | 63.5% | 55.3% | 39.9% | 53.6% | 21.84 | 0.7997 |
DIFE Parameter | Easy | Medium | Hard | avgAcc | |
---|---|---|---|---|---|
1 | 9 × 9 | 62.8% | 53.6% | 38.7% | 52.6% |
2 | 1 × 1, 1 × 1 + 5 × 5 | 62.4% | 52.1% | 38.6% | 52.5% |
3 | 1 × 1, 1 × 1 + 7 × 7 | 63.2% | 53.7% | 38.9% | 52.7% |
4 | 1 × 1, 1 × 1 + 9 × 9 | 63.4% | 53.9% | 39.1% | 52.9% |
5 | 1 × 1, 1 × 1 + 3 × 3, 7 × 7 + 1 × 1 | 62.9% | 53.5% | 39.5% | 53.4% |
6 | 1 × 1, 1 × 1 + 9 × 9, 5 × 5 + 1 × 1 | 63.6% | 54.6% | 39.7% | 53.2% |
7 | 1 × 1, 1 × 1 + 5 × 5, 1 × 1 + 7 × 7, 1 × 1 + 9 × 9, 3 × 3 + 1 × 1 | 63.8% | 54.8% | 39.8% | 53.4% |
8 | 1 × 1, 1 × 1 + 5 × 5, 1 × 1 + 7 × 7, 1 × 1 + 9 × 9, 3 × 3 + 1 × 1 (dynamic) | 63.5% | 55.3% | 39.9% | 53.6% |
Approach | CCB | Easy | Medium | Hard | avgAcc | PSNR | SSIM |
---|---|---|---|---|---|---|---|
TPGSR | × | 61.0% | 49.9% | 36.7% | 49.8% | 21.02 | 0.7690 |
√ | 62.1% | 51.6% | 36.7% | 50.4% | 21.32 | 0.7705 | |
TATT | × | 62.6% | 53.4% | 39.8% | 52.6% | 21.52 | 0.7930 |
√ | 62.4% | 54.4% | 39.6% | 52.7% | 20.95 | 0.7951 | |
C3-STISR | × | 65.2% | 53.6% | 39.8% | 53.7% | 21.51 | 0.7721 |
√ | 65.1% | 54.0% | 39.6% | 53.8% | 21.37 | 0.7853 | |
MNTSR | × | 64.3% | 54.5% | 38.7% | 53.3% | 21.53 | 0.7946 |
√ | 64.0% | 54.8% | 38.9% | 53.2% | 21.67 | 0.7964 | |
SCE-STISR | × | 63.3% | 53.9% | 39.8% | 53.0% | 21.43 | 0.7982 |
√ | 63.5% | 55.3% | 39.9% | 53.6% | 21.84 | 0.7997 | |
LEMMA | × | 67.1% | 58.8% | 40.6% | 56.3% | 21.43 | 0.7543 |
√ | 67.2% | 58.6% | 40.8% | 56.4% | 21.59 | 0.7623 | |
PEAN | × | 68.9% | 60.2% | 45.9% | 59.0% | 21.57 | 0.7946 |
√ | 68.8% | 60.3% | 46.0% | 59.1% | 21.78 | 0.8017 |
DBFA | DIFA | CCB | Recognition Accuracy | FPS | Parameters | |||
---|---|---|---|---|---|---|---|---|
Easy | Medium | Hard | avgAcc | |||||
- | - | - | 62.26% | 52.73% | 39.09% | 52.1% | 44 | 30.5 M |
√ | - | - | 62.94% | 52.73% | 39.46% | 52.4% | 40 | 50.5 M |
√ | √ | - | 63.25% | 53.93% | 39.76% | 53.0% | 39 | 50.9 M |
√ | √ | √ | 63.53% | 55.31% | 39.95% | 53.6% | 39 | 51.2 M |
Method | CRNN | MORAN | ASTER | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Easy (%) | Medium (%) | Hard (%) | Avg (%) | Easy (%) | Medium (%) | Hard (%) | Avg (%) | Easy (%) | Medium (%) | Hard (%) | Avg (%) | |
Bicubic | 36.4 | 21.1 | 21.1 | 26.8 | 60.6 | 37.9 | 30.8 | 44.1 | 67.4 | 42.4 | 31.2 | 48.2 |
SRCNN | 41.1 | 22.3 | 22.0 | 29.2 | 63.9 | 40.0 | 29.4 | 45.6 | 70.6 | 44.0 | 31.5 | 50.0 |
SRResNet | 45.2 | 32.6 | 25.5 | 35.1 | 66.0 | 47.1 | 33.4 | 49.9 | 69.4 | 50.5 | 35.7 | 53.0 |
EDSR | 42.7 | 29.3 | 24.1 | 32.7 | 63.6 | 45.4 | 32.2 | 48.1 | 72.3 | 48.6 | 34.3 | 53.0 |
RCAN | 46.8 | 27.9 | 26.5 | 34.5 | 63.1 | 42.9 | 33.6 | 47.5 | 67.3 | 46.6 | 35.1 | 50.7 |
CARN | 40.7 | 27.4 | 24.3 | 31.4 | 58.8 | 42.3 | 31.1 | 45.0 | 62.3 | 44.7 | 31.5 | 47.1 |
HAN | 51.6 | 35.8 | 29.0 | 39.6 | 67.4 | 48.5 | 35.4 | 51.5 | 71.1 | 52.8 | 39.0 | 55.3 |
TSRN | 52.5 | 38.3 | 31.4 | 41.4 | 70.1 | 55.3 | 37.9 | 55.4 | 75.1 | 56.3 | 40.1 | 58.3 |
PCAN | 59.6 | 45.4 | 34.8 | 47.4 | 73.7 | 57.6 | 41.0 | 58.5 | 77.5 | 60.7 | 43.1 | 61.5 |
TBSRN | 59.6 | 47.1 | 35.3 | 48.1 | 74.1 | 57.0 | 40.8 | 58.4 | 75.7 | 59.9 | 41.6 | 60.1 |
Gestalt | 61.2 | 47.6 | 35.5 | 48.9 | 75.8 | 57.8 | 41.4 | 59.4 | 77.9 | 60.2 | 42.4 | 61.3 |
TPGSR | 63.1 | 52.0 | 38.6 | 51.8 | 74.9 | 60.5 | 44.1 | 60.5 | 78.9 | 62.7 | 44.5 | 62.8 |
TATT | 62.6 | 53.4 | 39.8 | 52.6 | 72.5 | 60.2 | 43.1 | 59.5 | 78.9 | 63.4 | 45.4 | 63.6 |
C3-STISR | 65.2 | 53.6 | 39.8 | 53.7 | 74.2 | 61.0 | 43.2 | 59.5 | 79.1 | 63.3 | 46.8 | 64.1 |
PerMR | 65.1 | 50.4 | 37.8 | 52.0 | 76.7 | 58.9 | 42.9 | 60.6 | 80.8 | 62.9 | 45.5 | 64.2 |
MNTSR | 64.3 | 54.5 | 38.7 | 53.3 | 76.7 | 61.2 | 44.9 | 61.9 | 79.5 | 64.6 | 45.8 | 64.4 |
TEAN | 63.7 | 52.5 | 38.1 | 52.2 | 76.8 | 60.8 | 43.4 | 61.4 | 80.4 | 64.5 | 45.6 | 64.6 |
DPMN | 64.3 | 54.1 | 39.2 | 53.3 | 73.2 | 61.4 | 43.8 | 60.4 | 79.2 | 64.0 | 45.0 | 63.8 |
TCDM | 67.3 | 57.3 | 42.7 | 55.7 | 77.6 | 62.9 | 45.9 | 62.2 | 81.3 | 65.1 | 50.1 | 65.5 |
PEAN | 68.9 | 60.2 | 45.9 | 59.0 | 79.4 | 67.0 | 49.1 | 66.1 | 84.5 | 71.4 | 52.9 | 70.6 |
LEMMA | 67.1 | 58.8 | 40.6 | 56.3 | 77.7 | 64.4 | 44.6 | 63.2 | 81.1 | 66.3 | 47.4 | 66.0 |
SCE-STISR | 63.5 | 55.3 | 39.9 | 53.6 | 73.9 | 59.5 | 44.7 | 60.9 | 80.9 | 63.4 | 45.8 | 64.5 |
Method | PSNR | SSIM | ||||||
---|---|---|---|---|---|---|---|---|
Easy | Medium | Hard | Avg | Easy | Medium | Hard | Avg | |
Bicubic | 22.35 | 18.98 | 19.39 | 20.35 | 0.7884 | 0.6254 | 0.6592 | 0.6961 |
SRCNN | 23.48 | 19.06 | 19.34 | 20.78 | 0.8379 | 0.6323 | 0.6791 | 0.7227 |
SRResNet | 24.36 | 18.88 | 19.29 | 21.03 | 0.8681 | 0.6406 | 0.6911 | 0.7403 |
EDSR | 24.26 | 18.63 | 19.14 | 20.68 | 0.8633 | 0.6440 | 0.7108 | 0.7394 |
RCAN | 22.15 | 18.81 | 19.83 | 20.26 | 0.8525 | 0.6465 | 0.7227 | 0.7406 |
CARN | 22.70 | 19.15 | 20.02 | 20.62 | 0.8384 | 0.6412 | 0.7172 | 0.7323 |
HAN | 23.30 | 19.02 | 20.16 | 20.95 | 0.8691 | 0.6537 | 0.7387 | 0.7596 |
TSRN | 25.07 | 18.86 | 19.71 | 21.42 | 0.8897 | 0.6676 | 0.7302 | 0.7690 |
PCAN | 24.57 | 19.14 | 20.26 | 21.49 | 0.8830 | 0.6781 | 0.7475 | 0.7752 |
TBSRN | 23.46 | 19.17 | 19.68 | 20.91 | 0.8729 | 0.6455 | 0.7452 | 0.7603 |
Gestalt | 23.95 | 18.58 | 19.74 | 20.76 | 0.8611 | 0.6621 | 0.7520 | 0.7584 |
TPGSR | 23.73 | 18.68 | 20.06 | 20.97 | 0.8805 | 0.6738 | 0.7440 | 0.7719 |
TATT | 24.72 | 19.02 | 20.31 | 21.52 | 0.9006 | 0.6911 | 0.7703 | 0.7930 |
C3-STISR | 24.71 | 19.03 | 20.09 | 21.51 | 0.8545 | 0.6674 | 0.7639 | 0.7721 |
PerMR | 24.89 | 18.98 | 20.42 | 21.43 | 0.9102 | 0.6921 | 0.7658 | 0.7894 |
MNTSR | 24.93 | 19.28 | 20.38 | 21.50 | 0.9173 | 0.6860 | 0.7806 | 0.7946 |
TEAN | - | - | - | 21.70 | - | - | - | 0.7850 |
DPMN | 24.84 | 19.08 | 20.51 | 21.49 | 0.9013 | 0.6902 | 0.7695 | 0.7925 |
LEMMA | 24.67 | 19.21 | 20.37 | 21.43 | 0.8734 | 0.6783 | 0.5601 | 0.7543 |
PEAN | 24.89 | 19.46 | 20.41 | 21.75 | 0.9157 | 0.6901 | 0.7837 | 0.7946 |
SCE-STISR | 24.99 | 19.13 | 20.78 | 21.84 | 0.9038 | 0.6955 | 0.7859 | 0.7951 |
Method | SCE-STISR | TPGSR | TATT | C3-STISR | MNTSR | |
---|---|---|---|---|---|---|
PSNR | mean | 21.84 | 20.95 | 21.51 | 21.51 | 21.52 |
p-value | - | 3.7 × 10−6 | 4.5 × 10−5 | 9.1 × 10−5 | 2.8 × 10−3 | |
SSIM | mean | 0.7951 | 0.7719 | 0.7940 | 0.7716 | 0.7941 |
p-value | - | 1.4 × 10−5 | 1.4 × 10−1 | 2.3 × 10−5 | 6.7 × 10−1 | |
avgAcc | mean | 53.6 | 51.9 | 52.6 | 53.7 | 53.4 |
p-value | - | 1.5 × 10−4 | 2.8 × 10−4 | 1.6 × 10−2 | 1.8 × 10−2 |
Method | STR Datasets | ||||
---|---|---|---|---|---|
IC13 | IC15 | CUTE80 | SVT | SVTP | |
Bicubic | 9.6% | 10.1% | 35.8% | 3.3% | 10.2% |
SRResNet | 11.4% | 13.4% | 50.5% | 9.3% | 13.8% |
TSRN | 15.6% | 18.6% | 66.9% | 10.0% | 16.4% |
TBSRN | 17.7% | 21.3% | 75.0% | 12.2% | 17.4% |
TPGSR | 22.7% | 24.2% | 72.6% | 13.7% | 16.5% |
TATT | 27.6% | 28.6% | 74.7% | 14.2% | 25.9% |
C3-STISR | 24.7% | 22.7% | 71.5% | 10.2% | 17.7% |
SCE-STISR | 28.9% | 30.7% | 74.9% | 15.1% | 26.5% |
Method | CRNN | MORAN | ASTER |
---|---|---|---|
MNTSR | 38.9% | 49.3% | 52.0% |
DPMN | 35.4% | 46.2% | 49.6% |
LEMMA | 40.8% | 53.3% | 55.7% |
PEAN | 39.5% | 52.4% | 54.8% |
SCE-STISR | 42.7% | 54.1% | 57.3% |
Kernel Width r | Method | |||
---|---|---|---|---|
r = 1 | TATT | 58.1% | 51.4% | 47.3% |
Ours | 58.8% | 52.9% | 49.2% | |
r = 3 | TATT | 47.4% | 42.8% | 37.7% |
Ours | 48.5% | 44.7% | 40.1% | |
r = 5 | TATT | 39.8% | 35.5% | 31.6% |
Ours | 41.3% | 37.5% | 34.7% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, M.; Li, Q.; Liu, H. Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution. Sensors 2025, 25, 2228. https://doi.org/10.3390/s25072228
Wang M, Li Q, Liu H. Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution. Sensors. 2025; 25(7):2228. https://doi.org/10.3390/s25072228
Chicago/Turabian StyleWang, Meng, Qianqian Li, and Haipeng Liu. 2025. "Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution" Sensors 25, no. 7: 2228. https://doi.org/10.3390/s25072228
APA StyleWang, M., Li, Q., & Liu, H. (2025). Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution. Sensors, 25(7), 2228. https://doi.org/10.3390/s25072228