Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search

Yang, Shuo; Liu, Zhandong; Li, Ke; Song, Ruixia; Li, Yong; Qi, Xiangwei

doi:10.3390/app16041771

Open AccessArticle

Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search

by

Shuo Yang

^1,2,

Zhandong Liu

^1,2,*

,

Ke Li

^1,2,

Ruixia Song

³,

Yong Li

^1,2

and

Xiangwei Qi

^1,2

¹

School of Computer Science and Technology, Xinjiang Normal University, Urumqi 830017, China

²

Xinjiang Engineering Research Center for Smart Education and Applications, Urumqi 830054, China

³

Library of Xinjiang Normal University, Xinjiang Normal University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 1771; https://doi.org/10.3390/app16041771

Submission received: 18 January 2026 / Revised: 4 February 2026 / Accepted: 7 February 2026 / Published: 11 February 2026

Download

Browse Figures

Versions Notes

Abstract

This study addresses the limitations of current Optical Character Recognition (OCR) systems in supporting minority languages and integrating intelligent retrieval functions. We propose an integrated system that combines an advanced end-to-end OCR model with a novel hybrid search approach. First, we developed the MultiLang-OCR-30K dataset containing 30,000 annotated samples of handwritten Chinese, Tibetan, and Uyghur texts. Second, we extended the GOT model using a freeze encoder–fine-tune decoder strategy to enhance multilingual capabilities. Finally, we designed a character-level hybrid retrieval framework integrating TF-IDF efficiency with Sentence-BERT semantic strength. Experimental results show our extended GOT model achieves sentence accuracies of 82.3%, 76.5%, and 78.1% for handwritten Chinese, Tibetan, and Uyghur, respectively. The hybrid search improves F1 score by 28.7% over TF-IDF alone while maintaining 23 ms average response time. This system provides a practical solution for multilingual document digitization and management, thereby bridging the technological gap for minority languages.

Keywords:

unified end-to-end OCR; GOT model; text retrieval; hybrid search; minority languages; intelligent document management

1. Introduction

Optical Character Recognition (OCR) technology plays a pivotal role in document digitization. However, contemporary OCR systems including state-of-the-art end-to-end models based on Transformers such as GOT [1] are predominantly tailored to pre-training data and model architectures optimized for major languages like Chinese and English. Consequently, these systems exhibit markedly low recognition accuracy for minority languages, such as Tibetan and Uyghur, particularly when processing handwritten scripts. This limitation severely hinders the digitization and preservation of ethnic cultural heritage and undermines the practicality and inclusivity of multilingual information management systems.

The digitization of documents in minority languages holds significant cultural and scholarly value. Many precious historical, religious, and literary works exist only in manuscript form, facing risks of physical deterioration and loss. Effective OCR technology is fundamental for their permanent preservation, extensive research, and digital accessibility. Furthermore, in multilingual regions such as Western China, efficient OCR for minority languages is crucial for applications in public administration, educational publishing, and cross-lingual information retrieval. It serves as a technical prerequisite for promoting linguistic equality and ensuring information accessibility.

In recent years, OCR technology has evolved from traditional multi-module pipeline systems to unified end-to-end deep learning models. The “OCR 2.0” paradigm, represented by GOT, reduces system complexity and error propagation by directly mapping visual encodings to textual sequences via a decoder. However, generalizing such models to other languages, especially low-resource ones, remains a significant challenge. On one hand, the scarcity of large-scale, high-quality annotated data is a core bottleneck. On the other hand, the text representation modules commonly used in these models (e.g., the CLIP text encoder) are not optimized for fine-grained, character-level visual feature extraction, making it difficult to capture the unique structural patterns of minority scripts, such as the stacked composition of Tibetan characters or the right-to-left cursive style of Uyghur.

Specifically, our analysis reveals that the original ViTDet encoder in GOT struggles with Uyghur cursive due to attention mechanism limitations when processing complex ligatures and connected characters. The self-attention mechanism, while powerful for global context modeling, fails to adequately capture the fine-grained stroke connections and directional dependencies inherent in cursive scripts, leading to recognition errors particularly in handwritten documents where character connectivity varies significantly.

Meanwhile, recent advances in visual text generation offer new solutions to the data scarcity problem. Diffusion Models can synthesize high-fidelity and diverse text images. Works like AnyText [2] achieve precise control over multilingual text content, glyphs, and layout by introducing a dedicated text embedding module and a text-aware loss function, providing a systematic framework for generating large-scale, high-quality annotated data. In information retrieval, models have progressed from traditional lexical matching to deep learning-based semantic matching, while hybrid retrieval strategies attempt to combine the strengths of both for a balance of efficiency and accuracy.

However, a unified, end-to-end system capable of achieving both high-accuracy OCR for minority languages and efficient cross-lingual retrieval remains absent. To this end, this paper proposes and implements a multilingual intelligent retrieval system. The core idea is the deep integration of AnyText’s powerful data synthesis and feature enhancement capabilities with the robust end-to-end recognition architecture of GOT, coupled with the design of a character-level hybrid retrieval framework, to systematically address the key challenges in the digitization and retrieval of minority language documents. The main contributions of this paper are as follows:

(1) Construction of the MultiLang-OCR-30K multilingual handwritten text synthesis dataset. Following the technical pipeline of AnyText [2], we synthesized a high-quality dataset of handwritten text images containing 30,000 annotated samples across three languages—Chinese, Tibetan, and Uyghur—effectively alleviating the data scarcity issue for minority language OCR tasks.

(2) Proposal of a multilingual extension scheme for the GOT model. We designed a parameter-efficient adaptation strategy termed “freeze encoder–fine-tune decoder,” integrated with a dedicated OCR text embedding module and a text-aware loss function. This significantly improves the model’s recognition performance for Tibetan and Uyghur while maintaining its original high accuracy for Chinese and English.

(3) Design of a character-level hybrid retrieval framework. To address the lack of explicit word segmentation in Tibetan and Uyghur, we constructed a two-stage retrieval pipeline that combines fast character-level TF-IDF recall with deep semantic embedding re-ranking, substantially improving query efficiency while ensuring retrieval accuracy.

Through systematic comparative experiments and ablation studies, we validate the significant advantages of the proposed method in terms of OCR accuracy and retrieval performance. Experiments show that the extended GOT model achieves sentence accuracy rates of 82.3%, 76.5%, and 78.1% for handwritten Chinese, Tibetan, and Uyghur, respectively. The designed hybrid retrieval framework improves the average F1@10 metric by 28.7% compared to the pure TF-IDF method, while maintaining an average response time of only 23 ms. This work provides a practical and efficient end-to-end solution for the digital management and intelligent retrieval of minority language documents.

2. Related Work

2.1. End-to-End OCR Models

Optical Character Recognition (OCR) has evolved from traditional modular systems (OCR-1.0) to contemporary end-to-end deep learning models. Conventional frameworks decompose recognition into sequential stages: text detection, region cropping, and character recognition [3,4], suffering from error propagation and limited scalability. Document-level models like Nougat [5] demonstrated whole-page processing but lacked general-purpose versatility. The advent of Transformer architectures [6] catalyzed end-to-end OCR models. The GOT model [1] proposed the “OCR-2.0” paradigm with a streamlined encoder–decoder structure, achieving impressive performance on major languages.

However, generalizing these models to low-resource and minority languages remains challenging. A key bottleneck is the text representation module. Many models rely on vision-language models like CLIP [5] for token alignment. While CLIP excels at semantic alignment, it is not optimized for character-level visual feature extraction, especially for scripts with complex glyph structures. Other notable models include TrOCR [7] and EasyOCR [8]. Their performance on minority languages remains unsatisfactory without substantial adaptation, highlighting the need for inherently multilingual architectures.

2.2. Visual Text Generation and Data Synthesis

Data scarcity is a primary impediment for minority language OCR. Visual text generation offers a powerful solution through diffusion models, which synthesize realistic images conditioned on inputs. The AnyText model [2] represents a significant advancement with its text embedding module integrating semantic information with glyph features from an auxiliary OCR model. This generates text images that are both semantically correct and visually accurate. Its text-aware loss function uses OCR feedback during training to ensure recognizability.

This offers a systematic framework for creating synthetic data. Controlling text content, font, layout, background, and degradation effects generates diverse datasets mirroring real conditions. Related works include GlyphDraw [9], TextDiffuser [10], and GlyphControl [11], which, combined with traditional augmentation [12], offer a comprehensive toolkit. We adapt AnyText’s pipeline to construct the MultiLang-OCR-30K dataset, fundamental for extending GOT’s capabilities.

2.3. Multilingual Text Retrieval

Efficient retrieval from digitized documents follows OCR. Retrieval systems evolved from traditional statistical methods to neural approaches. Traditional models like TF-IDF [13] and BM25 [14] measure lexical overlap, offering efficiency and interpretability but suffering from vocabulary mismatch. This is acute in agglutinative languages like Uyghur or non-segmented scripts like Tibetan. For highly agglutinative languages like Uyghur, where words are formed by adding multiple suffixes to roots, and for Tibetan which lacks explicit word boundaries, classic TF-IDF approaches struggle due to the absence of clear segmentation points and the combinatorial explosion of possible morphological forms.

Neural retrieval models like Sentence-BERT [15] and Dense Passage Retrieval [16] provide semantic understanding through dense embeddings but incur higher computational costs. Cross-lingual models like XLM [17], XLM-E [18], mT5 [19], and multilingual sentence encoders [20] map texts into shared semantic spaces.

Hybrid retrieval strategies [21] balance efficiency and effectiveness through multi-stage pipelines: fast lexical recall followed by neural re-ranking. Libraries like FAISS [22] enable efficient similarity search. Our system adapts this hybrid approach to character-level OCR output and minority language requirements.

3. Methodology

3.1. MultiLang-OCR-30K Dataset Construction

The scarcity of high-quality training data is a critical bottleneck for improving OCR performance on minority languages. Traditional data collection methods for low-resource languages like Tibetan and Uyghur face significant challenges, including high annotation costs, limited sample sizes, and inconsistent data quality. To address these issues, we adopt the data synthesis methodology proposed by AnyText to construct the MultiLang-OCR-30K dataset, comprising 30,000 annotated samples.

3.1.1. Theoretical Framework for Data Synthesis Based on Diffusion Models

Our data synthesis pipeline is built on Denoising Diffusion Probabilistic Models (DDPMs) [23]. The process is illustrated in Figure 1.

Diffusion models learn data distributions through a forward process that adds noise and a reverse process that removes it [24]. The forward diffusion process incrementally introduces Gaussian noise via a Markov chain:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(1)

where

β_{t}

is the noise scheduling parameter at step t, following

0 < β_{t} < 1

. This transforms the clean image

x_{0}

into a noisy image

x_{t}

. The reverse generation process for text-controlled image synthesis is:

p_{θ} (x_{t - 1} ∣ x_{t}, c) = N (x_{t - 1}; μ_{θ} (x_{t}, t, c), σ_{θ} (x_{t}, t, c) I),

(2)

where c is the text condition vector encoding glyph characteristics, positional information, and semantic guidance. The functions

μ_{θ}

and

σ_{θ}

, parameterized by deep learning models, estimate the mean and variance.

3.1.2. Multi-Modal Conditional Control Mechanism

We develop a multi-modal conditional control mechanism integrating various inputs through an auxiliary latent module:

z_{a} = f (G (l_{g}) \oplus P (l_{p}) \oplus ε (l_{m})),

(3)

where G encodes glyph information (

l_{g}

) using a font rendering engine, P processes positional information (

l_{p}

) for diverse layouts, and

ε (l_{m})

is a VAE encoder for masked regions in text editing. Here, ⊕ denotes feature concatenation, and f is a fusion function that integrates these features into a unified latent representation

z_{a}

. This process is illustrated in Figure 2.

3.1.3. Cross-Lingual Text Embedding Representation

The text embedding module provides multilingual support by converting textual data into numerical representations:

c t e = τ_{θ} (Φ (y), ξ (γ_{θ} (e_{g}))),

(4)

where

Φ (y)

processes input text y using a placeholder mechanism. The component

γ_{θ} (e_{g})

is an OCR-driven feature extractor based on PP-OCRv4:

γ_{θ} (e_{g}) = PP - OCRv 4 (e_{g}) = CNN - Backbone (e_{g}) \to BiLSTM \to CTC - Head,

(5)

capturing stroke-level features for scripts like Tibetan (complex composite characters) and Uyghur (cursive script). The linear transformation layer

ξ

ensures dimensional compatibility, and

τ_{θ}

integrates all embeddings into a unified text condition representation. This process is illustrated in Figure 3.

3.1.4. Data Distribution and Diversity Assurance

To ensure dataset diversity and representativeness, we employ systematic strategies across multiple dimensions. Textual content is drawn from diverse domains such as news, the literature, academic writing, and everyday language. For visual variation, we simulate different writing instruments, ink effects, and paper textures. Degradation effects—including blur, noise, and perspective distortion—are intentionally incorporated to enhance model robustness. This comprehensive synthesis approach meets both the volume demands of deep learning and the quality standards necessary for practical application, establishing a solid foundation for model training.

3.2. Multilingual Extension of the GOT Model

Based on the constructed dataset, we conduct an extensive multilingual extension of the GOT model. While the original GOT excels at Chinese and English OCR, its architecture is tailored to mainstream languages, resulting in performance limitations for minority languages like Tibetan and Uyghur. To overcome this, we propose a systematic enhancement scheme shown in Figure 4.

3.2.1. Parameter-Efficient Model Adaptation Strategy

Freezing the encoder is a core strategy in our model adaptation. Based on transfer learning principles, we argue that the pretrained encoder has already acquired generic visual features valuable for cross-lingual character recognition [25]. Accordingly, we maintain the encoder parameters as:

θ_{E}^{new} = θ_{E}^{pretrained}, \frac{\partial L}{\partial θ_{E}} = 0,

(6)

This strategy leverages the cross-lingual universality of low- and mid-level visual features (edges, textures, shapes) learned during large-scale image pretraining. By freezing the encoder, we significantly reduce trainable parameters, improve training efficiency, and mitigate catastrophic forgetting when adapting to new languages [25]. Empirically, this approach enables rapid adaptation to novel language characteristics while maintaining Chinese and English performance, achieving approximately 40% reduction in training time compared to full-parameter fine-tuning.

3.2.2. Cross-Lingual Text Representation Learning

The enhanced text embedding module is the technical core of our multilingual extension. Although CLIP excels at multi-modal understanding, its capacity for character-level visual feature extraction is limited, especially for underrepresented writing systems. We redesign the text embedding generation as follows [26,27]:

c_{t e} = LayerNorm (W_{1} \cdot γ_{θ} (e_{g}) + W_{2} \cdot Φ (\dot{y}) + W_{3} \cdot e_{lang}),

(7)

in this architecture,

W_{1}, W_{2}, and W_{3} \in R^{d \times d}

are learnable weight matrices initialized using Xavier initialization [28] to ensure stable gradient flow during training. This initialization scheme proved crucial for avoiding convergence difficulties when fusing the visual OCR features with semantic embeddings from the CLIP encoder. Here,

e_{lang} \in R^{d}

is a language label embedding that explicitly informs the model of language identity, essential for distinguishing visually similar characters across languages. The OCR feature extractor

γ_{θ} (e_{g})

is implemented as a multi-stage pipeline based on PaddleOCRv4:

γ_{θ} (e_{g}) = PP - OCRv 4 (e_{g}) = \underset{Visual feature extraction}{\underset{︸}{CNN - Backbone (e_{g})}} \to \underset{sequence modeling}{\underset{︸}{BiLSTM}} \to \underset{sequence alignment}{\underset{︸}{CTC - Head}},

(8)

This design enables the model to capture unique structural characteristics: Tibetan stacking patterns and Uyghur cursive connections. LayerNorm ensures numerical stability during feature fusion.

It is important to clarify that our use of PP-OCRv4 for feature extraction creates a dependency on an external system, but this is limited to the training phase. During inference, the extended GOT model operates purely end-to-end without requiring PP-OCRv4. The OCR-driven features are distilled into the model during training, maintaining the “pure” end-to-end characteristic of the original GOT framework during deployment.

OCR-based embeddings provide crucial stroke-level features for scripts with unique visual structures (Tibetan stacking, Uyghur cursive), which CLIP’s semantic embeddings miss. Encoder freezing preserves transferable visual features while fine-tuning the decoder adapts to new linguistic patterns, preventing catastrophic forgetting for low-resource languages.

3.3. Hybrid Retrieval Framework

We introduce a two-stage hybrid retrieval framework that systematically integrates lexical matching with semantic understanding, enhancing efficiency while maintaining high accuracy [21]. Figure 5 illustrates the construction process.

As shown in Figure 5, the framework involves: Data Preparation and Preprocessing: cleaning and standardizing OCR-recognized multilingual text; Construction and Annotation of Question–Answer Pairs: manually annotating query-document pairs for relevance evaluation.

3.3.1. Dual-Path Feature Extraction

To address the unique characteristics of Tibetan and Uyghur texts—lacking explicit word boundaries and exhibiting complex morphology—we design a dual-path feature extraction framework combining lexical matching with semantic understanding.

The lexical pathway employs character-level TF-IDF vectorization with n-gram models (n = 1, 2) to capture character combination features and local contextual patterns [13]. The choice of n = 1 and n = 2 represents a practical trade-off: while longer n-grams could potentially capture more morphological information for highly agglutinative languages like Uyghur, they also exponentially increase the feature space dimensionality and computational cost. Our experiments with n = 3 showed diminishing returns with significantly increased resource requirements.This approach recognizes frequent character sequences and common morphological constructions.

The semantic pathway utilizes the DistilUSE multilingual model to generate 768-dimensional semantic embeddings [20], enabling deep understanding of semantic associations between queries and documents. The combination ensures handling of both exact keyword matching and conceptual similarity searches.

3.3.2. Feature Optimization and Fusion

We implement a feature optimization and fusion pipeline. Dimensionality reduction is applied to TF-IDF features to improve computational efficiency while preserving discriminative patterns. Optimized lexical features and semantic embeddings are fused into a unified representation space. For efficient similarity search, we use FAISS to construct optimized indices [22]. The retrieval process follows a two-stage strategy: lightweight character-level TF-IDF screening identifies top-K candidates, followed by deep semantic re-ranking.

The final retrieval score is a weighted combination:

S (q, d) = α \cdot S_{lex} (q, d) + (1 - α) \cdot S_{sem} (q, d),

(9)

where

S_{lex}

is lexical matching score,

S_{sem}

is semantic similarity score, and

α = 0.4

balances lexical precision and semantic understanding.The value

α = 0.4

was determined through systematic grid search over the range [0, 1] with step 0.1, where this setting consistently yielded the best F1@10 scores across all three languages. A sensitivity analysis reveals that performance remains stable for

α

values between 0.3 and 0.5, with F1@10 varying by less than 2%. Values outside this range lead to more significant performance degradation, either from over-reliance on lexical matching (

α > 0.6

) or semantic matching (

α < 0.2

).

4. Experiment

4.1. Experimental Setup

To ensure a fair comparison, we follow the experimental setup outlined in the GOT paper. All experiments are conducted using

2 \times

L20 GPUs within the PyTorch 2.8.0 framework. The expanded GOT model is optimized using the AdamW [29] optimizer, with an initial learning rate of

2 \times 10^{- 5}

and a cosine annealing schedule [30]. The model is trained for 20 epochs on a combined dataset consisting of MultiLang-OCR-30K MultiLang-OCR-30K (Appendix A) and AnyWord-3 M with a batch size of 32. The text-aware loss weight

λ

is set to 0.01.The retrieval corpus consists of 30,000 OCR-recognized text lines from our dataset. We constructed 150 query-document pairs per language (450 total) by manually selecting queries and annotating relevant documents. Queries were designed to cover diverse topics and lengths to ensure robustness. For a comprehensive evaluation, we compare our approach with the following representative methods:

(1) Paddle-OCR v4 [5]: a widely used multi-module OCR system in industry.

(2) The original GOT [1]: the baseline model without the multilingual extension.

(3) TextDiffuser [9]: a method for text generation and recognition based on diffusion models.

(4) GlyphControl: a text generation model that emphasizes glyph control.

In order to clearly identify the optimal performance of each method, the bold data in the table corresponds to the best results under each indicator.

4.2. Comprehensive Comparison of Multilingual OCR Performance

We conducted a rigorous evaluation using our custom-constructed test set, with the results detailed in Table 1, Table 2 and Table 3. Our approach demonstrated optimal performance in recognizing handwritten texts across three ethnic languages.

Results on the handwritten Chinese dataset show that our extended GOT model significantly outperforms existing methods across all evaluation metrics. Compared to the industry-standard multi-module OCR system PaddleOCRv4, our approach reduces the Edit Distance by 29.4% and improves the F1-score by 10.3%. The particularly poor performance of the original GOT model highlights its inherent limitations in recognizing handwritten minority language text. Compared to diffusion model-based methods, our approach reduces Edit Distance by 45.1% and 42.0% against TextDiffuser and GlyphControl, respectively, demonstrating the effectiveness of our proposed “freeze encoder–fine-tune decoder” strategy combined with the OCR text embedding module.The superior performance across Precision, Recall, BLEU, and METEOR metrics further confirms the comprehensive advantages of our method in capturing both character-level accuracy and semantic coherence in handwritten Chinese text recognition.

The experimental results on the Tibetan handwritten dataset validate the generalization of our method. The original GOT model fails completely on this task, underscoring the challenges of data scarcity and adaptation for minority languages. Compared to PaddleOCRv4, our method reduces Edit Distance by 15.6% and improves F1-score by 3.0 percentage points. Against generative methods, it achieves Edit Distance reductions of 44.1% and 40.4% versus TextDiffuser and GlyphControl, respectively. Tibetan’s complex composite structure demands robust feature extraction. Our OCR-driven feature extractor effectively captures character-level strokes, while language label embeddings help distinguish visually similar cross-lingual characters, demonstrating the efficacy of our multilingual extension for Tibetan.

Results on the Uyghur handwritten dataset further verify our method’s robustness. The original GOT fails completely, revealing its limitations with non-mainstream writing directions and low-resource languages. Compared to PaddleOCRv4, our approach reduces Edit Distance by 14.2% and improves F1-score by 3.9%. It achieves substantial Edit Distance reductions of 43.7% and 40.4% against TextDiffuser and GlyphControl, respectively. The cursive, right-to-left Uyghur script challenges sequence modeling. Our BiLSTM-based OCR extractor captures continuous stroke patterns, while language label embedding disambiguates cross-lingual characters, validating our adaptation strategy for scripts with distinct directional and structural traits.

4.2.1. Limitations of Validation Using Real-World Datasets

It is important to acknowledge that our current evaluation primarily relies on synthetic datasets generated by the AnyText framework. While these datasets are carefully constructed to mimic real-world conditions, there remains a gap between synthetic data and genuine handwritten documents. Real-world documents often exhibit greater variability in writing styles, paper quality, degradation patterns, and contextual noise. To partially address this limitation, we incorporated various degradation effects during data synthesis, including blurriness, noise, and perspective distortions. However, future work should include validation on authentic historical documents and contemporary handwritten materials to further verify the practical applicability of our approach. Such real-world validation would provide more comprehensive insights into the system’s robustness and generalization capabilities.

4.2.2. Synthetic-to-Real Domain Gap Analysis

To quantitatively assess the domain gap between synthetic and real data, we evaluated our model on a small collection of 500 real handwritten samples per language. Performance decreased compared to synthetic test results: Chinese F1-score decreased by 8.2%, Tibetan by 12.7%, and Uyghur by 11.4%. Detailed error analysis revealed distinct patterns: Approximately 65% of Chinese errors involve structurally similar characters or dense stroke patterns; 70% of Tibetan errors involve misrecognizing stacked character components; and 60% of Uyghur errors involve distinguishing connected letter forms in cursive writing.

4.2.3. Error Analysis and Performance Limitations

A detailed error analysis reveals several patterns in the model’s performance limitations. For Chinese text recognition, errors most frequently occur with highly cursive handwriting styles where character strokes are connected or partially obscured. In Tibetan recognition, the primary challenges involve stacked character components that may be mis-segmented or incorrectly recognized when written with inconsistent spacing. Uyghur text presents unique difficulties due to its right-to-left cursive nature, with errors often arising from diacritic mark placement and letter connectivity variations.

Across all three languages, we observe that performance degrades significantly under the following conditions: (1) severe image degradation (blur levels exceeding Gaussian blur with

σ

= 3.0), (2) extreme perspective distortions (angles beyond 30 degrees), (3) complex background textures that interfere with text-background separation, and (4) unusually small character sizes (below 12 pixels in height). These limitations highlight areas for future improvement, particularly in enhancing the model’s robustness to challenging visual conditions and expanding the diversity of writing styles in training data.

4.3. Ablation Experiments

We validated the necessity of each component through an ablation study, with the results presented in Table 4.

Excluding the text perception loss (

L_{tp}

) led to a marked deterioration in performance across all metrics, with the Edit Distance rising by 30.1% and the F1-score falling by 5.8%. This underscores the critical importance of

L_{tp}

in enhancing the accuracy of visual text recognition. In addition, substituting OCR text embeddings with CLIP embeddings resulted in a substantial performance decline, as indicated by a 68.8% increase in the Edit Distance, which highlights the necessity of incorporating character-level stroke features for accurately recognizing minority languages. Moreover, freezing the encoder not only preserved performance but also mitigated catastrophic forgetting; when the encoder was not frozen, the Edit Distance increased by 20.8%. Finally, the reliance on synthetic data for model training was evident, as the use of limited real data led to a severe performance drop, with the Edit Distance escalating by 124.3%.

4.4. Detailed Comparison with State-of-the-Art OCR Methods

To further validate the advanced nature of our method, we compare it with the current state-of-the-art approaches in document-level OCR tasks. The results are presented in Table 5 and Table 6.

Despite having a significantly smaller parameter size than other large-scale models, our extended GOT method exhibits competitive and often superior performance in both Chinese and English document recognition tasks [34,35]. In particular, on Chinese documents, our approach achieves state-of-the-art results, with an Edit Distance of 0.042 and an F1-score of 0.968.

Our method surpasses or rivals state-of-the-art approaches across most metrics in the English documentation, thereby demonstrating that our extended solution enhances minority language capabilities without compromising overall performance.

4.5. Comparison of Retrieval Performance

We systematically evaluated the performance of various retrieval methods using the optical character recognition (OCR) outputs generated by our system. The results of this evaluation are summarized in Table 7.

Our hybrid retrieval framework achieves accuracy comparable to pure semantic search (average F1@10 of 0.789) while reducing response times by approximately 73%. Consequently, it attains an optimal balance between accuracy and efficiency. It should be clarified that the reported average response time of 23 ms refers specifically to the retrieval step after OCR completion, measured using FAISS indexing and excluding the OCR processing time. The total end-to-end system latency, including OCR, averages 850 ms per page on our hardware setup. While our comparison includes traditional methods (TF-IDF) and neural approaches (Sentence-BERT), we acknowledge that more recent cross-lingual retrieval methods such as mDPR or LaBSE exist. However, given our focus on resource-efficient deployment and the specific characteristics of minority languages, we selected Sentence-BERT for its balance of performance and efficiency. A comprehensive comparison with state-of-the-art cross-lingual retrieval methods remains an area for future work.

5. Conclusions

In this paper, we have systematically proposed and implemented an intelligent text retrieval system tailored for minority languages. Our primary contribution lies in the deep integration of AnyText’s advanced data synthesis capability with GOT’s unified end-to-end recognition architecture, which successfully overcomes critical OCR bottlenecks in low-resource languages. First, the high-quality MultiLang-OCR-30 K dataset we constructed provides a robust solution to data scarcity. Second, our extended “freeze the encoder–fine-tune the decoder” strategy, combined with the OCR text embedding module and the text-aware loss function, significantly enhances the GOT model’s recognition capabilities for Tibetan and Uyghur while maintaining its high performance for Chinese and English. Finally, the proposed character-level hybrid retrieval framework achieves an effective balance between retrieval accuracy and efficiency. Comprehensive experiments show that our approach substantially outperforms existing methods in both OCR accuracy and retrieval performance. This work offers a practical and efficient end-to-end solution for multilingual and cross-modal document intelligence processing, thereby contributing to the digital preservation and utilization of ethnic cultural heritage.

6. Discussion

Despite promising results, our work has several limitations. The current study focuses primarily on Chinese, Tibetan, and Uyghur, leaving many other minority languages unaddressed and necessitating further evaluation of the model’s generalizability across a wider linguistic spectrum. Future work should expand to include additional writing systems, particularly those with different structural properties such as Arabic, Devanagari, or Georgian scripts, to test the true cross-lingual adaptability of our approach. Furthermore, while the GOT-2.0 framework theoretically supports shape recognition, our model has not been optimized for processing documents containing mixed text and diagrams, which is a valuable direction for future research. The model’s performance also degrades under challenging real-world conditions like severe blur or occlusion, indicating a need for more robust preprocessing or training techniques. Although we have used carefully constructed synthetic datasets, validation on authentic historical and contemporary materials would provide more reliable performance assessments.

In terms of computational requirements, our extended model increases training complexity through the text-aware loss function and multi-stage training pipeline. Compared to the original GOT which required ∼20 h on 2 × L20 GPUs, our training time increased by approximately 30% to ∼26 h. During inference, the enhanced text embedding module adds ∼15 ms per image, bringing total inference time to ∼85 ms for a

1024 \times 1024

image. The ’freeze encoder–fine-tune decoder’ strategy helps mitigate some computational costs, but the overall training still requires substantial resources. For deployment in resource-constrained environments, optimization strategies such as model compression (e.g., quantization, pruning) and advanced coded distributed computing approaches—such as those enabling verifiable computation [37] and optimizing resource allocation [38]—offer promising avenues to distribute loads and potentially reduce training time by 25–40% while maintaining accuracy. Finally, while efficient, the hybrid retrieval framework could benefit from more sophisticated fusion strategies beyond linear weighting to further improve accuracy across diverse queries.

Author Contributions

Writing—original draft preparation, S.Y.; validation, S.Y.; Conceptualization, Z.L.; methodology, Z.L.; supervision, Z.L.; funding acquisition, Z.L.; visualization, K.L.; writing—review and editing, R.S.; formal analysis, Y.L.; investigation, X.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Natural Science Foundation of China (Grant No. 62162061), the Doctoral Research Foundation of Xinjiang Normal University (Grant No. XJNUBS2115), and the Xinjiang Normal University Youth Top Talents Project (Grant No. XJNUQB2022-21). Additional funding came from the General Program of the Natural Science Foundation of Xinjiang Uygur Autonomous Region (Grant No. 2024D01A94), the General Program of the Science and Technology Program of Xinjiang Uygur Autonomous Region (Grant No. 2022D01A228), and the Xinjiang Key Research and Development Program (Grant No. 2022B01007-1).

Data Availability Statement

The original contributions presented in this study include the MultiLang-OCR-30K dataset. Representative samples of the dataset are provided in the Appendix A as shown in Figure A1. The complete dataset is available from the corresponding author(s) upon reasonable request.

Acknowledgments

We gratefully acknowledge the exemplary open-source contributions from the AnyText and GOT teams, whose work provided a solid foundation for our research. We further thank the institutions that supplied computational resources and the reviewers for their constructive suggestions, as well as our project team members for their close collaboration and unwavering dedication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To demonstrate the quality and diversity of the MultiLang-OCR-30K dataset constructed by our research institute, this appendix presents representative examples from the dataset. The dataset contains handwritten text images in three languages—Chinese, Tibetan, and Uyghur—totaling 30,000 samples. The following are some examples for each language, with each example including the original image and its corresponding annotated text. For some datasets, please click https://pan.baidu.com/s/1MimvVhpmW5o2KsfswMa8dQ (accessed on 6 February 2026).

Figure A1. Handwritten text samples generated on a common background.

References

Wei, H.; Liu, C.; Chen, J.; Wang, J.; Kong, L.; Xu, Y.; Ge, Z.; Zhao, L.; Sun, J.; Peng, Y. General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. arXiv 2024, arXiv:2409.01704. [Google Scholar] [CrossRef]
Tuo, Y.; Xiang, W.; He, J.Y.; Geng, Y.; Xie, X. AnyText: Multilingual Visual Text Generation and Editing. arXiv 2023, arXiv:2311.03054. [Google Scholar] [CrossRef]
Smith, R. An Overview of the Tesseract OCR Engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition, Curitiba, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar] [CrossRef]
Du, Y.; Li, C.; Guo, R.; Cui, C.; Liu, W.; Zhou, J.; Lu, B.; Yang, Y.; Liu, Q.; Hu, X.; et al. PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System. arXiv 2021, arXiv:2109.03144. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 31 January 2021).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 12 June 2017).
Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; Wei, F. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. Proc. AAAI Conf. Artif. Intell. 2023, 37, 13094–13102. [Google Scholar] [CrossRef]
Salehudin, M.A.M.; Basah, S.N.; Yazid, H.; Basaruddin, K.S.; Safar, M.J.A.; Som, M.H.M.; Sidek, K.A. Analysis of Optical Character Recognition Using EasyOCR under Image Degradation. Proc. J. Phys. Conf. Ser. 2023, 2641, 012001. [Google Scholar] [CrossRef]
Ma, J.; Zhao, M.; Chen, C.; Wang, R.; Niu, D.; Lu, H.; Lin, X. GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently. arXiv 2023, arXiv:2303.17870. [Google Scholar] [CrossRef]
Chen, J.; Huang, Y.; Lv, T.; Cui, L.; Chen, Q.; Wei, F. TextDiffuser: Diffusion Models as Text Painters. Adv. Neural Inf. Process. Syst. 2023, 36, 9353–9387. [Google Scholar] [CrossRef]
Yang, Y.; Gui, D.; Yuan, Y.; Liang, W.; Ding, H.; Hu, H.; Chen, K. GlyphControl: Glyph Conditional Control for Visual Text Generation. Adv. Neural Inf. Process. Syst. 2023, 36, 44050–44066. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Ramos, J. Using TF-IDF to Determine Word Relevance in Document Queries. Proc. First Instr. Conf. Mach. Learn. 2003, 242, 29–48. Available online: https://www.semanticscholar.org/paper/Using-TF-IDF-to-Determine-Word-Relevance-in-Queries-Ramos/b3bf6373ff41a115197cb5b30e57830c16130c2c (accessed on 1 January 2003).
Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found Trends Inf. Retr. 2009, 3, 333–389. Available online: https://www.researchgate.net/publication/220613776_The_Probabilistic_Relevance_Framework_BM25_and_Beyond (accessed on 20 November 2020). [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.S.H.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
Conneau, A.; Lample, G. Cross-lingual Language Model Pretraining. Adv. Neural Inf. Process. Syst. 2019, 32, 7059–7069. Available online: https://proceedings.neurips.cc/paper_files/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html (accessed on 1 November 2019).
Chi, Z.; Huang, S.; Dong, L.; Ma, S.; Zheng, B.; Singhal, S.; Bajaj, P.; Song, X.; Mao, X.L.; Huang, H.Y.; et al. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 6170–6182. Available online: https://aclanthology.org/2022.acl-long.427/ (accessed on 23 May 2022).
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 483–498. [Google Scholar] [CrossRef]
Yang, Y.; Cer, D.; Ahmad, A.; Guo, M.; Law, J.; Constant, N.; Abrego, G.H.; Yuan, S.; Tar, C.; Sung, Y.H.; et al. Multilingual Universal Sentence Encoder for Semantic Retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 87–94. [Google Scholar] [CrossRef]
Koilada, D.K. Hybrid Semantic Retrieval: Augmenting Weighted TF-IDF with BERT for Enhanced Question Answering. Eng. Arch. 2023. Available online: https://www.atlantis-press.com/proceedings/icsiaiml-25/126021198 (accessed on 6 January 2026).
Johnson, J.; Douze, M.; Jégou, H. Billion-scale Similarity Search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming Catastrophic Forgetting in Neural Networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Zhuang, L.; Wayne, L.; Ya, S.; Jun, Z. A Robustly Optimized BERT Pre-training Approach with Post-Training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, Huhhot, China, 13–15 August 2021; Chinese Information Processing Society of China: Beijing, China, 2021; pp. 1218–1227. Available online: https://aclanthology.org/2021.ccl-1.108 (accessed on 15 August 2021).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 6 June 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. Proc. Track 2010, 9, 249–256. Available online: http://proceedings.mlr.press/v9/glorot10a.html (accessed on 5 September 2010).
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]
Ye, J.; Hu, A.; Xu, H.; Ye, Q.; Yan, M.; Xu, G.; Li, C.; Tian, J.; Qian, Q.; Zhang, J.; et al. UReader: Universal OCR-Free Visually-Situated Language Understanding with Multimodal Large Language Model. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 2841–2858. Available online: https://arxiv.org/abs/2310.05126 (accessed on 10 December 2023).
Wei, H.; Kong, L.; Chen, J.; Zhao, L.; Ge, Z.; Yang, J.; Sun, J.; Han, C.; Zhang, X. Vary: Scaling Up the Vision Vocabulary for Large Vision-Language Model. In Proceedings of the Computer Vision—ECCV 2024 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 408–424. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring Plain Vision Transformer Backbones for Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 280–296. [Google Scholar] [CrossRef]
Kim, W.; Kruglik, S.; Kiah, H.M. Verifiable Coded Computation of Multiple Functions. IEEE Trans. Inf. Theory 2024, 70, 528–549. [Google Scholar] [CrossRef]
Zhou, X.; Shroff, N. Cost-Efficient Distributed Learning via Combinatorial Multi-Armed Bandits. Entropy 2025, 27, 541. [Google Scholar] [CrossRef]

Figure 1. The framework has two core components: a text-controlled diffusion pipeline and a text-aware loss. The pipeline encodes text via CLIP, conducts iterative denoising with U-Net cross-attention, and uses regional masking for precise text rendering. The text-aware loss relies on OCR to ensure generated text clarity and accuracy, balancing global image quality and local text precision.

Figure 2. Depicts the auxiliary generation framework comprising a CLIP text encoder, a U-Net denoiser, and a VAE decoder. The CLIP encoder provides semantic guidance, the U-Net performs iterative denoising in latent space, and the VAE decoder reconstructs high-quality images. This pipeline, augmented by a ControlNet for conditional injection, enables controllable text-to-image synthesis.

Figure 3. Illustrates the Text Embedding Module for Tibetan text: Tokenizer segments Tibetan text into tokens, and Text Encoder converts them to embeddings. Token Replace and linear projection optimize these embeddings, which are then fused with glyph features from the OCR Encoder to produce refined embeddings for accurate text rendering.

Figure 4. Outline of the three-stage GOT framework: (1) Pre-training a visual encoder with OPT-125 M; (2) Joint-training by integrating it with Qwen-0.5 B for extended OCR; (3) Post-training the language decoder for new character recognition.

Figure 5. Construction process of the hybrid retrieval framework.

Table 1. Performance comparison on handwritten Chinese dataset.

Method	Parameter	EditDistance	F1-Score	Precision	Recall	BLEU	METEOR
PaddleOCRv4 [5]	–	0.245	0.720	0.780	0.670	0.650	0.690
GOT [1]	580 M	0.648	0.352	0.487	0.275	0.189	0.312
TextDiffuser [9]	–	0.315	0.685	0.752	0.628	0.598	0.642
GlyphControl [10]	–	0.298	0.702	0.768	0.645	0.623	0.668
Ours	580 M	0.173	0.823	0.856	0.785	0.781	0.812

Table 2. Performance comparison on Tibetan datasets.

Method	Parameter	EditDistance	F1-Score	Precision	Recall	BLEU	METEOR
PaddleOCRv4 [5]	–	0.231	0.735	0.823	0.665	0.698	0.723
GOT [1]	580 M	–	–	–	–	–	–
TextDiffuser [9]	–	0.349	0.651	0.789	0.554	0.567	0.601
GlyphControl [10]	–	0.327	0.673	0.801	0.581	0.589	0.634
Ours	580 M	0.195	0.765	0.836	0.705	0.723	0.758

Table 3. Performance comparison on the Uyghur script dataset.

Method	Parameter	EditDistance	F1-Score	Precision	Recall	BLEU	METEOR
PaddleOCRv4 [5]	–	0.218	0.742	0.831	0.673	0.685	0.711
GOT [1]	580 M	–	–	–	–	–	–
TextDiffuser [9]	–	0.332	0.668	0.795	0.573	0.582	0.617
GlyphControl [10]	–	0.314	0.685	0.809	0.592	0.604	0.649
Ours	580 M	0.187	0.781	0.842	0.728	0.735	0.769

Table 4. Ablation experiments on the Chinese handwritten data set.

Configuration	Edit-Distance	F1-Score	Precision	Recall	BLEU	METEOR
complete model (Ours)	0.173	0.823	0.856	0.793	0.781	0.812
Remove text perception loss	0.225	0.775	0.812	0.742	0.732	0.765
CLIP-embedding (non-OCR)	0.292	0.708	0.752	0.671	0.654	0.698
Unfrozen-encoder	0.209	0.791	0.823	0.763	0.751	0.783
no synthetic data	0.388	0.612	0.689	0.549	0.523	0.578

Table 5. Comparison with state-of-the-art methods for Chinese document-level OCR tasks.

Method	Parameter	Edit-Distance	F1-Score	Precision	Recall	BLEU	METEOR
UReader [31]	7 B	0.718	0.344	0.296	0.469	0.103	0.287
LLaVA-NeXT [22]	34 B	0.430	0.647	0.573	0.881	0.478	0.582
InternVL-ChatV1.5 [13]	26 B	0.265	0.816	0.784	0.866	0.622	0.717
Nougat [5]	250 M	0.255	0.745	0.720	0.809	0.665	0.761
TextMonkey [11]	7 B	0.265	0.821	0.778	0.906	0.671	0.762
DocOw1.5 [21]	7 B	0.258	0.862	0.835	0.962	0.788	0.858
Vary [32]	7 B	0.113	0.952	0.951	0.944	0.754	0.873
Qwen-VL-Max [33]	>72 B	0.091	0.931	0.917	0.946	0.756	0.885
Fox [25]	1.8 B	0.061	0.954	0.964	0.946	0.842	0.908
Ours	580 M	0.042	0.968	0.972	0.964	0.865	0.925

Table 6. Comparison with state-of-the-art methods for English document-level OCR tasks.

Method	Parameter	Edit-Distance	F1-Score	Precision	Recall	BLEU	METEOR
UReader [31]	7 B	0.718	0.344	0.296	0.469	0.103	0.287
LLaVA-NeXT [22]	34 B	0.430	0.647	0.573	0.881	0.478	0.582
InternVL-ChatV1.5 [13]	26 B	0.393	0.751	0.698	0.917	0.568	0.663
Nougat [5]	250 M	0.255	0.745	0.720	0.890	0.665	0.761
TextMonkey [11]	7 B	0.265	0.821	0.778	0.906	0.671	0.762
DocOw1.5 [21]	7 B	0.258	0.862	0.835	0.962	0.788	0.858
Vary [32]	7 B	0.092	0.918	0.906	0.956	0.885	0.926
Qwen-VL-Max [33]	>72 B	0.057	0.964	0.955	0.977	0.942	0.971
Fox [25]	1.8 B	0.046	0.952	0.957	0.948	0.930	0.954
Ours	580 M	0.038	0.972	0.971	0.973	0.947	0.958

Table 7. Comparison of Retrieval Performance (F1@10/Response Time (ms)).

Search Method	Chinese Handwriting	Tibetan	Uyghur	Average F1@10	Ave Response Time
Simple search	0.621	0.583	0.592	0.578	8
TF-IDF [3]	0.735	0.698	0.705	0.695	12
Sentence-BERT [36]	0.782	0.751	0.763	0.756	85
Ours	0.813	0.785	0.792	0.789	23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, S.; Liu, Z.; Li, K.; Song, R.; Li, Y.; Qi, X. Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search. Appl. Sci. 2026, 16, 1771. https://doi.org/10.3390/app16041771

AMA Style

Yang S, Liu Z, Li K, Song R, Li Y, Qi X. Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search. Applied Sciences. 2026; 16(4):1771. https://doi.org/10.3390/app16041771

Chicago/Turabian Style

Yang, Shuo, Zhandong Liu, Ke Li, Ruixia Song, Yong Li, and Xiangwei Qi. 2026. "Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search" Applied Sciences 16, no. 4: 1771. https://doi.org/10.3390/app16041771

APA Style

Yang, S., Liu, Z., Li, K., Song, R., Li, Y., & Qi, X. (2026). Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search. Applied Sciences, 16(4), 1771. https://doi.org/10.3390/app16041771

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search

Abstract

1. Introduction

2. Related Work

2.1. End-to-End OCR Models

2.2. Visual Text Generation and Data Synthesis

2.3. Multilingual Text Retrieval

3. Methodology

3.1. MultiLang-OCR-30K Dataset Construction

3.1.1. Theoretical Framework for Data Synthesis Based on Diffusion Models

3.1.2. Multi-Modal Conditional Control Mechanism

3.1.3. Cross-Lingual Text Embedding Representation

3.1.4. Data Distribution and Diversity Assurance

3.2. Multilingual Extension of the GOT Model

3.2.1. Parameter-Efficient Model Adaptation Strategy

3.2.2. Cross-Lingual Text Representation Learning

3.3. Hybrid Retrieval Framework

3.3.1. Dual-Path Feature Extraction

3.3.2. Feature Optimization and Fusion

4. Experiment

4.1. Experimental Setup

4.2. Comprehensive Comparison of Multilingual OCR Performance

4.2.1. Limitations of Validation Using Real-World Datasets

4.2.2. Synthetic-to-Real Domain Gap Analysis

4.2.3. Error Analysis and Performance Limitations

4.3. Ablation Experiments

4.4. Detailed Comparison with State-of-the-Art OCR Methods

4.5. Comparison of Retrieval Performance

5. Conclusions

6. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI