Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search
Abstract
1. Introduction
2. Related Work
2.1. End-to-End OCR Models
2.2. Visual Text Generation and Data Synthesis
2.3. Multilingual Text Retrieval
3. Methodology
3.1. MultiLang-OCR-30K Dataset Construction
3.1.1. Theoretical Framework for Data Synthesis Based on Diffusion Models
3.1.2. Multi-Modal Conditional Control Mechanism
3.1.3. Cross-Lingual Text Embedding Representation
3.1.4. Data Distribution and Diversity Assurance
3.2. Multilingual Extension of the GOT Model
3.2.1. Parameter-Efficient Model Adaptation Strategy
3.2.2. Cross-Lingual Text Representation Learning
3.3. Hybrid Retrieval Framework
3.3.1. Dual-Path Feature Extraction
3.3.2. Feature Optimization and Fusion
4. Experiment
4.1. Experimental Setup
4.2. Comprehensive Comparison of Multilingual OCR Performance
4.2.1. Limitations of Validation Using Real-World Datasets
4.2.2. Synthetic-to-Real Domain Gap Analysis
4.2.3. Error Analysis and Performance Limitations
4.3. Ablation Experiments
4.4. Detailed Comparison with State-of-the-Art OCR Methods
4.5. Comparison of Retrieval Performance
5. Conclusions
6. Discussion
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A

References
- Wei, H.; Liu, C.; Chen, J.; Wang, J.; Kong, L.; Xu, Y.; Ge, Z.; Zhao, L.; Sun, J.; Peng, Y. General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. arXiv 2024, arXiv:2409.01704. [Google Scholar] [CrossRef]
- Tuo, Y.; Xiang, W.; He, J.Y.; Geng, Y.; Xie, X. AnyText: Multilingual Visual Text Generation and Editing. arXiv 2023, arXiv:2311.03054. [Google Scholar] [CrossRef]
- Smith, R. An Overview of the Tesseract OCR Engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition, Curitiba, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar] [CrossRef]
- Du, Y.; Li, C.; Guo, R.; Cui, C.; Liu, W.; Zhou, J.; Lu, B.; Yang, Y.; Liu, Q.; Hu, X.; et al. PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System. arXiv 2021, arXiv:2109.03144. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 31 January 2021).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 12 June 2017).
- Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; Wei, F. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. Proc. AAAI Conf. Artif. Intell. 2023, 37, 13094–13102. [Google Scholar] [CrossRef]
- Salehudin, M.A.M.; Basah, S.N.; Yazid, H.; Basaruddin, K.S.; Safar, M.J.A.; Som, M.H.M.; Sidek, K.A. Analysis of Optical Character Recognition Using EasyOCR under Image Degradation. Proc. J. Phys. Conf. Ser. 2023, 2641, 012001. [Google Scholar] [CrossRef]
- Ma, J.; Zhao, M.; Chen, C.; Wang, R.; Niu, D.; Lu, H.; Lin, X. GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently. arXiv 2023, arXiv:2303.17870. [Google Scholar] [CrossRef]
- Chen, J.; Huang, Y.; Lv, T.; Cui, L.; Chen, Q.; Wei, F. TextDiffuser: Diffusion Models as Text Painters. Adv. Neural Inf. Process. Syst. 2023, 36, 9353–9387. [Google Scholar] [CrossRef]
- Yang, Y.; Gui, D.; Yuan, Y.; Liang, W.; Ding, H.; Hu, H.; Chen, K. GlyphControl: Glyph Conditional Control for Visual Text Generation. Adv. Neural Inf. Process. Syst. 2023, 36, 44050–44066. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Ramos, J. Using TF-IDF to Determine Word Relevance in Document Queries. Proc. First Instr. Conf. Mach. Learn. 2003, 242, 29–48. Available online: https://www.semanticscholar.org/paper/Using-TF-IDF-to-Determine-Word-Relevance-in-Queries-Ramos/b3bf6373ff41a115197cb5b30e57830c16130c2c (accessed on 1 January 2003).
- Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found Trends Inf. Retr. 2009, 3, 333–389. Available online: https://www.researchgate.net/publication/220613776_The_Probabilistic_Relevance_Framework_BM25_and_Beyond (accessed on 20 November 2020). [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.S.H.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
- Conneau, A.; Lample, G. Cross-lingual Language Model Pretraining. Adv. Neural Inf. Process. Syst. 2019, 32, 7059–7069. Available online: https://proceedings.neurips.cc/paper_files/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html (accessed on 1 November 2019).
- Chi, Z.; Huang, S.; Dong, L.; Ma, S.; Zheng, B.; Singhal, S.; Bajaj, P.; Song, X.; Mao, X.L.; Huang, H.Y.; et al. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 6170–6182. Available online: https://aclanthology.org/2022.acl-long.427/ (accessed on 23 May 2022).
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 483–498. [Google Scholar] [CrossRef]
- Yang, Y.; Cer, D.; Ahmad, A.; Guo, M.; Law, J.; Constant, N.; Abrego, G.H.; Yuan, S.; Tar, C.; Sung, Y.H.; et al. Multilingual Universal Sentence Encoder for Semantic Retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 87–94. [Google Scholar] [CrossRef]
- Koilada, D.K. Hybrid Semantic Retrieval: Augmenting Weighted TF-IDF with BERT for Enhanced Question Answering. Eng. Arch. 2023. Available online: https://www.atlantis-press.com/proceedings/icsiaiml-25/126021198 (accessed on 6 January 2026).
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale Similarity Search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming Catastrophic Forgetting in Neural Networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
- Zhuang, L.; Wayne, L.; Ya, S.; Jun, Z. A Robustly Optimized BERT Pre-training Approach with Post-Training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, Huhhot, China, 13–15 August 2021; Chinese Information Processing Society of China: Beijing, China, 2021; pp. 1218–1227. Available online: https://aclanthology.org/2021.ccl-1.108 (accessed on 15 August 2021).
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 6 June 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. Proc. Track 2010, 9, 249–256. Available online: http://proceedings.mlr.press/v9/glorot10a.html (accessed on 5 September 2010).
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]
- Ye, J.; Hu, A.; Xu, H.; Ye, Q.; Yan, M.; Xu, G.; Li, C.; Tian, J.; Qian, Q.; Zhang, J.; et al. UReader: Universal OCR-Free Visually-Situated Language Understanding with Multimodal Large Language Model. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 2841–2858. Available online: https://arxiv.org/abs/2310.05126 (accessed on 10 December 2023).
- Wei, H.; Kong, L.; Chen, J.; Zhao, L.; Ge, Z.; Yang, J.; Sun, J.; Han, C.; Zhang, X. Vary: Scaling Up the Vision Vocabulary for Large Vision-Language Model. In Proceedings of the Computer Vision—ECCV 2024 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 408–424. [Google Scholar] [CrossRef]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring Plain Vision Transformer Backbones for Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 280–296. [Google Scholar] [CrossRef]
- Kim, W.; Kruglik, S.; Kiah, H.M. Verifiable Coded Computation of Multiple Functions. IEEE Trans. Inf. Theory 2024, 70, 528–549. [Google Scholar] [CrossRef]
- Zhou, X.; Shroff, N. Cost-Efficient Distributed Learning via Combinatorial Multi-Armed Bandits. Entropy 2025, 27, 541. [Google Scholar] [CrossRef]





| Method | Parameter | EditDistance | F1-Score | Precision | Recall | BLEU | METEOR |
|---|---|---|---|---|---|---|---|
| PaddleOCRv4 [5] | – | 0.245 | 0.720 | 0.780 | 0.670 | 0.650 | 0.690 |
| GOT [1] | 580 M | 0.648 | 0.352 | 0.487 | 0.275 | 0.189 | 0.312 |
| TextDiffuser [9] | – | 0.315 | 0.685 | 0.752 | 0.628 | 0.598 | 0.642 |
| GlyphControl [10] | – | 0.298 | 0.702 | 0.768 | 0.645 | 0.623 | 0.668 |
| Ours | 580 M | 0.173 | 0.823 | 0.856 | 0.785 | 0.781 | 0.812 |
| Method | Parameter | EditDistance | F1-Score | Precision | Recall | BLEU | METEOR |
|---|---|---|---|---|---|---|---|
| PaddleOCRv4 [5] | – | 0.231 | 0.735 | 0.823 | 0.665 | 0.698 | 0.723 |
| GOT [1] | 580 M | – | – | – | – | – | – |
| TextDiffuser [9] | – | 0.349 | 0.651 | 0.789 | 0.554 | 0.567 | 0.601 |
| GlyphControl [10] | – | 0.327 | 0.673 | 0.801 | 0.581 | 0.589 | 0.634 |
| Ours | 580 M | 0.195 | 0.765 | 0.836 | 0.705 | 0.723 | 0.758 |
| Method | Parameter | EditDistance | F1-Score | Precision | Recall | BLEU | METEOR |
|---|---|---|---|---|---|---|---|
| PaddleOCRv4 [5] | – | 0.218 | 0.742 | 0.831 | 0.673 | 0.685 | 0.711 |
| GOT [1] | 580 M | – | – | – | – | – | – |
| TextDiffuser [9] | – | 0.332 | 0.668 | 0.795 | 0.573 | 0.582 | 0.617 |
| GlyphControl [10] | – | 0.314 | 0.685 | 0.809 | 0.592 | 0.604 | 0.649 |
| Ours | 580 M | 0.187 | 0.781 | 0.842 | 0.728 | 0.735 | 0.769 |
| Configuration | Edit-Distance | F1-Score | Precision | Recall | BLEU | METEOR |
|---|---|---|---|---|---|---|
| complete model (Ours) | 0.173 | 0.823 | 0.856 | 0.793 | 0.781 | 0.812 |
| Remove text perception loss | 0.225 | 0.775 | 0.812 | 0.742 | 0.732 | 0.765 |
| CLIP-embedding (non-OCR) | 0.292 | 0.708 | 0.752 | 0.671 | 0.654 | 0.698 |
| Unfrozen-encoder | 0.209 | 0.791 | 0.823 | 0.763 | 0.751 | 0.783 |
| no synthetic data | 0.388 | 0.612 | 0.689 | 0.549 | 0.523 | 0.578 |
| Method | Parameter | Edit-Distance | F1-Score | Precision | Recall | BLEU | METEOR |
|---|---|---|---|---|---|---|---|
| UReader [31] | 7 B | 0.718 | 0.344 | 0.296 | 0.469 | 0.103 | 0.287 |
| LLaVA-NeXT [22] | 34 B | 0.430 | 0.647 | 0.573 | 0.881 | 0.478 | 0.582 |
| InternVL-ChatV1.5 [13] | 26 B | 0.265 | 0.816 | 0.784 | 0.866 | 0.622 | 0.717 |
| Nougat [5] | 250 M | 0.255 | 0.745 | 0.720 | 0.809 | 0.665 | 0.761 |
| TextMonkey [11] | 7 B | 0.265 | 0.821 | 0.778 | 0.906 | 0.671 | 0.762 |
| DocOw1.5 [21] | 7 B | 0.258 | 0.862 | 0.835 | 0.962 | 0.788 | 0.858 |
| Vary [32] | 7 B | 0.113 | 0.952 | 0.951 | 0.944 | 0.754 | 0.873 |
| Qwen-VL-Max [33] | >72 B | 0.091 | 0.931 | 0.917 | 0.946 | 0.756 | 0.885 |
| Fox [25] | 1.8 B | 0.061 | 0.954 | 0.964 | 0.946 | 0.842 | 0.908 |
| Ours | 580 M | 0.042 | 0.968 | 0.972 | 0.964 | 0.865 | 0.925 |
| Method | Parameter | Edit-Distance | F1-Score | Precision | Recall | BLEU | METEOR |
|---|---|---|---|---|---|---|---|
| UReader [31] | 7 B | 0.718 | 0.344 | 0.296 | 0.469 | 0.103 | 0.287 |
| LLaVA-NeXT [22] | 34 B | 0.430 | 0.647 | 0.573 | 0.881 | 0.478 | 0.582 |
| InternVL-ChatV1.5 [13] | 26 B | 0.393 | 0.751 | 0.698 | 0.917 | 0.568 | 0.663 |
| Nougat [5] | 250 M | 0.255 | 0.745 | 0.720 | 0.890 | 0.665 | 0.761 |
| TextMonkey [11] | 7 B | 0.265 | 0.821 | 0.778 | 0.906 | 0.671 | 0.762 |
| DocOw1.5 [21] | 7 B | 0.258 | 0.862 | 0.835 | 0.962 | 0.788 | 0.858 |
| Vary [32] | 7 B | 0.092 | 0.918 | 0.906 | 0.956 | 0.885 | 0.926 |
| Qwen-VL-Max [33] | >72 B | 0.057 | 0.964 | 0.955 | 0.977 | 0.942 | 0.971 |
| Fox [25] | 1.8 B | 0.046 | 0.952 | 0.957 | 0.948 | 0.930 | 0.954 |
| Ours | 580 M | 0.038 | 0.972 | 0.971 | 0.973 | 0.947 | 0.958 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, S.; Liu, Z.; Li, K.; Song, R.; Li, Y.; Qi, X. Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search. Appl. Sci. 2026, 16, 1771. https://doi.org/10.3390/app16041771
Yang S, Liu Z, Li K, Song R, Li Y, Qi X. Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search. Applied Sciences. 2026; 16(4):1771. https://doi.org/10.3390/app16041771
Chicago/Turabian StyleYang, Shuo, Zhandong Liu, Ke Li, Ruixia Song, Yong Li, and Xiangwei Qi. 2026. "Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search" Applied Sciences 16, no. 4: 1771. https://doi.org/10.3390/app16041771
APA StyleYang, S., Liu, Z., Li, K., Song, R., Li, Y., & Qi, X. (2026). Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search. Applied Sciences, 16(4), 1771. https://doi.org/10.3390/app16041771

