FLIP: A Novel Feedback Learning-Based Intelligent Plugin Towards Accuracy Enhancement of Chinese OCR

Xinyue Tao; Yueyue Han; Yakai Jin; Yunzhi Wu

doi:10.3390/math13152372

,

and

¹

School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China

²

Anhui Beidou Precision Agriculture Information Engineering Research Center, Anhui Agricultural University, Hefei 230036, China

^*

Author to whom correspondence should be addressed.

Mathematics2025, 13(15), 2372;https://doi.org/10.3390/math13152372

This article belongs to the Special Issue Crowdsourcing Learning: Theories, Algorithms, and Applications

Version Notes

Order Reprints

Abstract

Chinese Optical Character Recognition (OCR) technology is essential for digital transformation in Chinese regions, enabling automated document processing across various applications. However, Chinese OCR systems struggle with visually similar characters, where subtle stroke differences lead to systematic recognition errors that limit practical deployment accuracy. This study develops FLIP (Feedback Learning-based Intelligent Plugin), a lightweight post-processing plugin designed to improve Chinese OCR accuracy across different systems without external dependencies. The plugin operates through three core components as follows: UTF-8 encoding-based output parsing that converts OCR results into mathematical representations, error correction using information entropy and weighted similarity measures to identify and fix character-level errors, and adaptive feedback learning that optimizes parameters through user interactions. The approach functions entirely through mathematical calculations at the character encoding level, ensuring universal compatibility with existing OCR systems while effectively handling complex Chinese character similarities. The plugin’s modular design enables seamless integration without requiring modifications to existing OCR algorithms, while its feedback mechanism adapts to domain-specific terminology and user preferences. Experimental evaluation on 10,000 Chinese document images using four state-of-the-art OCR models demonstrates consistent improvements across all tested systems, with precision gains ranging from 1.17% to 10.37% and overall Chinese character recognition accuracy exceeding 98%. The best performing model achieved 99.42% precision, with ablation studies confirming that feedback learning contributes additional improvements from 0.45% to 4.66% across different OCR architectures.

Keywords:

optical character recognition; post-processing; text recognition; machine learning

MSC:

68T05

1. Introduction

Modern Optical Character Recognition (OCR) systems operate through a sophisticated pipeline comprising the following three interconnected components: text detection, text recognition, and post-processing optimization. Text detection algorithms identify and localize text regions within images, with recent advances including edge-based approaches such as Edge Approximation Text Detector [1] and growth-based methods like Text Growing on Leaf [2], which demonstrate superior performance in complex scene text scenarios. Text recognition modules then convert detected text regions into machine-readable characters, with transformer-based architectures such as Text Spotting Transformers [3] and SwinTextSpotter [4] achieving state-of-the-art performance through improved synergy between detection and recognition components [5]. Recent advances in post-OCR processing have also explored competition-based evaluation frameworks [6] and multi-lingual optimization approaches [7].

Chinese OCR technology has emerged as a fundamental enabler of digital transformation across Chinese regions, serving as the technological backbone for automated document processing in diverse domains including legal contract analysis, financial record management, and administrative workflow automation [8,9]. The widespread adoption of Chinese OCR systems has revolutionized how organizations handle document digitization, data extraction, and information retrieval, directly impacting business efficiency and legal compliance in sectors where accurate Chinese character recognition is mission-critical [5].

However, Chinese OCR systems face unprecedented challenges that significantly exceed those encountered in alphabetic writing systems [10,11]. The fundamental complexity of Chinese characters presents systematic recognition difficulties through thousands of logographic characters, where visually similar characters frequently share common radicals, stroke patterns, and structural components while differing only in subtle positioning, stroke count, or component arrangements [11,12]. These inherent characteristics of Chinese writing systems create persistent recognition ambiguities that conventional OCR approaches struggle to resolve effectively, with radical-based similarities being a primary source of recognition errors [10,12].

Current approaches to addressing Chinese OCR limitations encompass traditional image preprocessing and algorithm optimization, sophisticated deep learning architectures, and large language model integration [13,14]. While these advances have yielded substantial improvements in overall recognition performance, they consistently fall short when confronted with the nuanced challenges of distinguishing visually similar Chinese characters [10,15]. Moreover, existing solutions suffer from critical limitations including dependency on external Chinese linguistic resources, limited adaptability to domain-specific Chinese terminology, substantial computational overhead, and inability to maintain consistent performance across different Chinese OCR systems and deployment environments [16,17].

Traditional Chinese OCR post-processing methods, primarily relying on Chinese dictionary-based corrections and rule-based systems, struggle with specialized terminology in legal contracts, technical documents, and domain-specific applications [18]. Statistical and machine learning approaches, while more sophisticated, often require extensive Chinese training corpora and computational resources that limit their practical applicability in real-time Chinese document processing scenarios [19]. Natural language processing-based post-processing approaches have demonstrated effectiveness in improving OCR accuracy through linguistic analysis and correction [20,21], though these methods typically require language-specific knowledge bases and may not generalize effectively across different domains or writing systems. Language-specific OCR post-processing solutions, such as those developed for Myanmar [22], have shown the importance of tailored approaches for complex scripts, yet they highlight the challenge of developing universally applicable correction mechanisms that can adapt to diverse linguistic requirements without extensive language-specific customization.

Recent developments have explored large language model integration for post-OCR correction [14,23], natural language processing pipelines [24], and reinforcement learning optimization [25], yet these approaches face deployment challenges due to computational requirements and external dependencies.

To address these fundamental challenges, this study introduces FLIP (Feedback Learning-based Intelligent Plugin), a novel lightweight and adaptive post-processing plugin specifically designed to improve Chinese OCR accuracy across different systems without requiring external dependencies. FLIP represents a paradigm shift in Chinese OCR post-processing by operating entirely through mathematical calculations at the character encoding level, ensuring universal compatibility with existing Chinese OCR systems while effectively addressing complex character similarities inherent in Chinese writing systems. The plugin’s innovative approach eliminates dependencies on external linguistic resources while providing mathematically grounded similarity analysis that is particularly effective for Chinese character relationships and adaptive learning capabilities.

2. Methodology

The FLIP implementation is fundamentally a classification problem that determines whether two characters are visually similar (a binary classification of similar/dissimilar). The core challenge lies in effectively quantifying uncertainty and establishing reliable decision boundaries. Information entropy has been extensively proven as one of the optimal measures for handling classification problems. The tremendous success of ID3 and C4.5 algorithms validates the effectiveness of information entropy in classification tasks by identifying the most discriminative attributes in complex feature spaces. Based on this well-established theoretical foundation, we selected information entropy as the core algorithmic basis for character similarity computation.

2.1. ID3 and C4.5 Algorithms

ID3 Algorithm Foundation: The ID3 (Iterative Dichotomiser 3) algorithm employs information entropy as the fundamental criterion for attribute selection and tree construction. The algorithm’s effectiveness stems from its ability to systematically reduce uncertainty through optimal attribute partitioning, making it particularly well-suited for classification problems involving high-dimensional feature spaces with subtle inter-class differences.

For a given dataset S with c distinct classes, the information entropy quantifies the impurity or uncertainty as follows:

H (S) = - \sum_{i = 1}^{c} p_{i} {log}_{2} p_{i},

(1)

where

p_{i}

represents the proportion of examples belonging to class i in dataset S.

ID3’s attribute selection mechanism employs information gain to measure the effectiveness of each attribute in reducing entropy as follows:

I G (S, A) = H (S) - \sum_{j = 1}^{v} \frac{| S_{j} |}{| S |} H (S_{j}),

(2)

where attribute A partitions dataset S into subsets

{S_{1}, S_{2}, \dots, S_{v}}

.

Threshold-Based Decision Making in ID3: The algorithm employs threshold comparison for decision node creation. At each internal node, ID3 selects the attribute with maximum information gain as follows:

A^{*} = arg max_{A \in A} I G (S, A),

(3)

where

A

represents the set of available attributes. The algorithm terminates when information gain falls below a predefined threshold

τ_{I G}

, expressed as follows:

Stop if max_{A \in A} I G (S, A) < τ_{I G} .

(4)

C4.5 Algorithm Enhancement: Building upon ID3’s foundation, the C4.5 algorithm introduces significant improvements that address key limitations while maintaining the core entropy-based approach. C4.5 enhances the original framework through gain ratio normalization, continuous attribute handling, and improved pruning mechanisms, establishing it as one of the most successful entropy-based classification algorithms.

Gain Ratio Normalization: C4.5 addresses ID3’s bias toward attributes with many values by introducing the following gain ratio measure:

G R (S, A) = \frac{I G (S, A)}{I V (S, A)},

(5)

where

I V (S, A)

represents the intrinsic value of attribute A, expressed as follows:

I V (S, A) = - \sum_{j = 1}^{v} \frac{| S_{j} |}{| S |} {log}_{2} \frac{| S_{j} |}{| S |} .

(6)

This normalization ensures that the algorithm selects attributes based on their true discriminative power rather than simply the number of possible values.

2.2. Inspiration for FLIP Design

ID3 and C4.5 inspire FLIP’s design through their entropy-based discrimination capabilities that use information entropy to identify the most discriminative attributes, informing FLIP’s employment of entropy measures to quantify the discriminative power of character differences. Their threshold-based decision making mechanisms provide reference for our similarity threshold approach. Additionally, C4.5’s gain ratio mechanism, which normalizes information gain by intrinsic information to prevent bias toward multi-valued attributes, provides valuable insights for our similarity measurement computation.

The broader applicability of entropy-based approaches has been validated across diverse domains in text processing and OCR optimization. Information-theoretic principles have proven effective in post-OCR statistical analysis [15], cross-language text processing [26], and concept extraction for structured text [27], consistently demonstrating the universal applicability of entropy measures in handling similarity-based classification problems. Building upon these foundations, recent advances have explored more sophisticated applications, including large language model integration for post-OCR correction [23], natural language processing pipelines for enhanced text recognition [24], and reinforcement learning optimization for complex OCR scenes [25]. These reinforcement learning approaches provide similar research backgrounds by demonstrating how adaptive learning mechanisms can be effectively integrated with entropy-based similarity analysis to improve OCR performance, establishing important precedents for FLIP’s feedback learning component that continuously optimizes similarity computation through user interactions.

3. FLIP

FLIP implements the theoretical framework described above through the following three algorithmic components: Output Parsing, Error Correction, and Feedback Learning. Each component operates through mathematical calculations at the character encoding level, enabling efficient error recognition and correction without external dependencies (Figure 1).

Figure 1. FLIP (Feedback Learning-based Intelligent Plugin) architecture showing the three core components: Output Parsing for UTF-8 character encoding, Error Correction for similarity-based character analysis, and Feedback Learning for adaptive parameter optimization. The plugin is illustrated with an example of visually similar Chinese character pairs.

3.1. Output Parsing

The Output Parsing module converts OCR text output into mathematical representations for similarity computation. It processes character sequences through UTF-8 encoding analysis. This enables universal compatibility across different OCR architectures.

The module automatically captures OCR recognition output and performs character-level decomposition. Each character is individually extracted and prepared for encoding analysis. Complete traceability of character positions is maintained for precise error localization.

1: Given OCR output text, we first convert each character into its UTF-8 byte representation to enable mathematical analysis of character similarities. For any OCR-recognized character pair $(c h a r_{1}, c h a r_{2})$ , the encoding process is as follows:

$\begin{matrix} B_{1} = \{b_{1, 1}, b_{1, 2}, \dots, b_{1, n}\}, \\ B_{2} = \{b_{2, 1}, b_{2, 2}, \dots, b_{2, m}\} . \end{matrix}$

(7)

where $b_{i, j}$ represents the j th byte of the i th character.
2: To quantify the encoding-level differences between characters, we perform XOR operations on corresponding bytes, enabling the precise measurement of character similarity at the binary level.

$D (b_{1, j}, b_{2, j}) = c o u n t_b i t s (b_{1, j} \oplus b_{2, j}) .$

(8)

where the function $c o u n t_b i t s ()$ is to calculate the number of binary 1s in the XOR result, $\oplus$ means that the binary in each byte is XORed.

3.2. Error Correction

The Error Correction module implements the core similarity computation algorithms. It combines information-theoretic measures with encoding-level analysis to identify and correct character recognition errors.

The module computes encoding-level similarities using byte-level differences from the output parsing component. Information-theoretic analysis enhances these computations by capturing the statistical properties of character relationships.

The algorithmic implementation encompasses several key computational steps, as follows:

1: We employ information entropy to quantify the uncertainty and similarity between character pairs, providing a principled approach to measuring character relationships. The entropy calculation captures the distribution of bit differences across byte positions, offering insights into the structural similarities between characters, as follows:

$H (x) = - \sum_{i = 1}^{n} p_{i} (x) {log}_{2} p_{i} (x),$

(9)

where n represents the number of bytes in the UTF-8 encoding (typically n = 3 for Chinese characters), and $p_{i} (x)$ represents the probability of the i-th byte position, calculated as follows:

$p_{i} (x) = \frac{D (b_{1, j}, b_{2, j})}{8} .$

(10)
2: For any given character pair requiring similarity analysis, let $t_{i}$ denote the target character (the character output by the OCR system) and $s_{i}$ denote the source character (a candidate correct character from a reference set or user correction). To balance different aspects of character similarity between these two characters, we implement the following weighted combination approach:

$sim (t_{i}, s_{i}) = w_{1} \cdot S_{1} + w_{2} \cdot S_{2},$

(11)

where $w 1$ and $w 2$ are weighting coefficients, and $S 1$ and $S 2$ represent two different similarity measure components, expressed as follows:

$S_{1} = 1 - \frac{errorbits}{n \times 8},$

(12)

$S_{2} = e^{- \frac{H (x)}{scale}},$

(13)

where $e r r o r b i t s$ represents the total number of differing bits between character pairs, calculated as follows:

$e r r o r b i t s = \sum_{i = 1}^{n} D (b_{1, j}, b_{2, j}) .$

(14)

where n represents the number of bytes in the UTF-8 encoding determined by max(len(B₁), len(B₂)) denotes the total count of differing bits between the two characters, $H (x)$ is the information entropy calculated from Equation (9), and scale is a normalization parameter for entropy-based similarity. The factor 8 in Equation (12) represents the number of bits per byte, ensuring the proper normalization of the bit-level similarity measure.
3: Character pairs are classified as visually similar based on the following threshold:

$isSimilar (t_{i}, s_{i}) = \{\begin{matrix} 1, & if sim (t_{i}, s_{i}) \geq θ . \\ 0, & if sim (t_{i}, s_{i}) < θ . \end{matrix}$

(15)

when $isSimilar (t_{i}, s_{i}) = 1$ , the character pair is flagged for potential correction. The threshold $θ$ is configured according to the characteristics of different OCR models and specific application requirements.

3.3. Feedback

The Feedback Learning module implements gradient-based parameter optimization algorithms that adapt similarity computation based on user corrections. It maintains statistical records of user interactions and employs mean squared error minimization to optimize similarity function parameters.

Unlike static post-processing approaches, this mechanism enables FLIP to continuously refine its capabilities through user interaction, creating a personalized post-processing system.

The algorithmic foundation encompasses several critical computational components, as follows:

1: The system maintains comprehensive statistics for each character pair encountered, as follows:

${Num}_{{correct}_{i}} = {Num}_{{correct}_{i}} + 1,$

(16)

${Num}_{{wrong}_{i}} = {Num}_{{wrong}_{i}} + 1,$

(17)

${Num}_{{total}_{i}} = {Num}_{{correct}_{i}} + {Num}_{{wrong}_{i}} .$

(18)

these counters track correct identifications, incorrect identifications, and total feedback instances for each character pair.
2: The system employs a mean squared error loss function to measure the discrepancy between predicted similarity scores and actual user feedback as follows:

$L (t_{i}, s_{i}) = \frac{1}{N_{{total}_{i}}} \sum_{k = 1}^{N_{{total}_{i}}} {(y_{k} - sim (t_{i}, s_{i}))}^{2} .$

(19)

where $y_{k}$ represents the ground truth similarity label (1 for similar pairs, 0 for dissimilar pairs) based on user feedback.
3: The similarity function parameters are updated using gradient descent optimization to minimize the loss function as follows:

${sim}_{new} (t_{i}, s_{i}) = sim (t_{i}, s_{i}) + α \frac{\partial L}{\partial sim (t_{i}, s_{i})}, (0 < α \leq 1) .$

(20)

where $α$ is the learning rate (0 < $α \leq 1$ ), and the gradient is computed as follows:

$\frac{\partial L}{\partial sim (t_{i}, s_{i})} = - \frac{2}{N_{{total}_{i}}} \sum_{k = 1}^{N_{{total}_{i}}} (y_{k} - sim (t_{i}, s_{i})) .$

(21)
4: Updated parameters undergo validation against a feedback-specific threshold to ensure robust performance, expressed as follows:

$isValid (t_{i}, s_{i}) = \{\begin{matrix} 1, & if {sim}_{new} (t_{i}, s_{i}) \geq θ_{feedback} . \\ 0, & if {sim}_{new} (t_{i}, s_{i}) < θ_{feedback} . \end{matrix}$

(22)

where $θ_{feedback} \geq θ$ to maintain conservative correction behavior.

4. Experiment

To evaluate the effectiveness of the proposed FLIP plugin, we conduct comprehensive experiments on multiple OCR models. Our experimental setup aims to demonstrate the significant improvement in OCR accuracy through the triple architecture of output parsing, error correction, and feedback learning.

4.1. Datasets and Metric

Dataset: Our experiments are conducted on the Chinese-OCR-10K dataset, which consists of 10,000 Chinese document images systematically selected from the publicly available Chinese OCR dataset by YCG09 (available at https://github.com/YCG09/chinese_ocr, accessed on 15 June 2025). The source dataset contains approximately 3.64 million synthetic Chinese text images with uniform 280 × 32 pixel resolution, covering 5990 characters including Chinese characters, English letters, digits, and punctuation marks. Each image contains exactly 10 characters with realistic distortions such as font variations, blur effects, and perspective transformations to simulate real-world document conditions. Our Chinese-OCR-10K subset was systematically sampled from the validation set to ensure comprehensive coverage of character combinations and distortion types while maintaining computational feasibility for extensive comparative analysis across multiple OCR models.

We utilize this dataset to establish baseline performance for each OCR model, followed by the systematic evaluation of our FLIP plugin’s improvement capabilities. The dataset provides a robust foundation for assessing the effectiveness of our post-processing optimization approach across diverse Chinese text recognition scenarios.

Evaluation Metrics: We employ the following metrics to evaluate performance improvements:

Precision: The ratio of correctly recognized characters to all characters output by the system, expressed as follows:

$Precision = \frac{T P}{T P + F P} .$

(23)
Chinese Precision: Precision is specifically calculated for Chinese characters, focusing on the core OCR challenge, as follows:

$Chinese Precision = \frac{T P_{C h i n e s e}}{T P_{C h i n e s e} + F P_{C h i n e s e}} .$

(24)
Recall: The ratio of correctly recognized characters to all ground truth characters, expressed as follows:

$Recall = \frac{T P}{T P + F N} .$

(25)
F1 Score: The harmonic mean of precision and recall, providing a balanced evaluation metric, expressed as follows:

$F 1 Score = \frac{2 \times T P}{2 \times T P + F P + F N} .$

(26)

where $T P$ represents true positives (correctly recognized characters), $F P$ represents false positives (incorrectly recognized characters), and $F N$ represents false negatives (missed characters in ground truth).

4.2. Experimental Setup and Process

Our experimental evaluation follows a systematic multi-stage approach, aiming to assess the effectiveness of each component and the overall performance of the plugin. We also designed experiments to compare with traditional OCR post-processing methods. The structure of the experimental process is as follows:

4.2.1. Baseline Establishment

Initially, we apply different OCR models to recognize our selected public Chinese dataset and compare the results with ground truth labels to establish baseline performance for each model. This baseline evaluation provides the foundational performance metrics against which our plugin improvements are measured.

4.2.2. Output Parsing Stage

The recognition results from each OCR model are input into our output parsing module to generate preprocessed data suitable for the error correction module. This preprocessing transforms the raw OCR output into UTF-8 encoded character representations and identifies potential character pairs requiring correction analysis. We refer to this processed data as the “character pairs to be corrected dataset”.

It is important to note that practical Chinese documents frequently contain mixed character types including Chinese characters, English letters, Arabic numerals, and punctuation marks within the same image. To address this mixed-character environment, FLIP implements a selective processing strategy that focuses optimization efforts specifically on Chinese characters while preserving the accuracy of non-Chinese characters.

The plugin employs Unicode range-based character classification, where characters are classified as Chinese if they fall within the Unicode range from U+4E00 to U+9FFF. Only characters identified as Chinese undergo the complete FLIP processing pipeline, including similarity analysis, error correction, and feedback learning. Characters outside this range, including English letters, Arabic numerals, and punctuation marks, are automatically excluded from the correction process and retain their original OCR recognition results. This selective approach ensures that computational resources are concentrated on the most challenging character category where improvement is most needed, while maintaining the integrity of typically more accurate non-Chinese character recognition.

4.2.3. Initial Error Correction Stage

The character pairs dataset is input into the error correction module, which applies our mathematical similarity analysis algorithms to identify and correct recognition errors. This first optimization stage primarily targets character pairs that are close in UTF-8 code point positions and exhibit structural similarities detectable through our encoding-level analysis. Characters with calculated similarity scores exceeding the initial threshold (

θ = 0.65

) are flagged for correction, resulting in the “first optimization dataset”.

4.2.4. Feedback Learning Stage

We apply the feedback learning module to the first optimization results through both simulated and real-user evaluation approaches to comprehensively validate the adaptive learning capabilities.

Simulated Feedback Evaluation: For systematic experimental comparison, we simulate realistic user feedback scenarios by providing ten consecutive positive feedback instances for character pairs requiring correction. This controlled simulation represents typical user correction patterns in practical OCR post-processing applications. Through the feedback algorithm, the similarity scores of character pairs undergo adaptive adjustment based on these user corrections. When the updated similarities reach our second threshold (

θ_{feedback} = 0.75

), which is more stringent than the initial threshold, these character pairs are incorporated into the final optimization dataset.

Real-User Evaluation: To evaluate effectiveness and enterprise deployment feasibility, we conducted a comprehensive user study using a contract comparison software system integrated with the FLIP plugin. The evaluation examined four key performance metrics: recognition accuracy improvements, computational complexity, time efficiency, and noise-robustness under actual usage conditions.

4.2.5. Traditional Method Comparison

To evaluate the effectiveness of our FLIP algorithm against traditional post-processing approaches, we implement a dictionary-based correction method as a comparative baseline. This approach constructs correction mappings by analyzing character replacement pairs from the OCR output, filtering pairs with occurrence frequencies of at least 2 to ensure reliability, and creating a static lookup table that directly maps frequently misrecognized characters to their correct counterparts. During correction, the system queries this dictionary for each misrecognized character and applies the corresponding correction if a mapping exists, otherwise retaining the original character unchanged. We evaluate both methods using identical datasets and metrics, including correction accuracy, coverage rate, and character-level improvement counts, enabling the direct comparison of the adaptive learning capabilities of our feedback-enhanced approach against this traditional rule-based method.

4.2.6. Final Performance Evaluation

The final performance evaluation systematically validates FLIP’s effectiveness through multiple comparative analyses. We compare the final optimization results with baseline performance across all tested OCR models to quantify overall improvement effects. Ablation studies are conducted by evaluating plugin performance without the feedback learning component to isolate the contribution of adaptive learning mechanisms. Additionally, we perform comparative evaluation against the traditional dictionary-based correction method using identical datasets and metrics. Finally, we analyze the feasibility of the plugin based on the evaluation results of real users.

4.3. Baseline

We evaluate our FLIP plugin on the following four state-of-the-art OCR models representing different technical approaches:

PaddleOCR V3: An open-source OCR toolkit developed by Baidu employing CRNN (Convolutional Recurrent Neural Network) architecture with MobileNetV3 backbone, Bi-directional LSTM for sequence processing, and CTC loss for alignment. We utilize the mobile version, which supports lightweight local client development and provides practical solutions for real-world text recognition applications.
PaddleOCR V4: An enhanced version of Baidu’s OCR toolkit with improved CRNN architecture featuring advanced knowledge distillation strategies and enhanced multilingual recognition capabilities. We utilize the mobile version, which offers better text recognition capabilities for lightweight local client development.
Qwen-VL-OCR: A vision-language model developed by Alibaba utilizing Vision Transformer (ViT) architecture with pre-trained weights from OpenClip’s ViT-bigG as visual encoder and Qwen language model foundation. It supports processing complex documents, including handwritten text, tables, charts, and multilingual content.
Doubao-1.5-v-pro: A multimodal AI model developed by ByteDance employing sparse Mixture of Experts (MoE) architecture that activates only a subset of parameters during inference for enhanced computational efficiency. It excels in OCR, document parsing, visual reasoning, and interactive applications.

The baseline results are presented in Table 1, showing the original performance of each OCR model before applying our optimization plugin.

Table 1. Baseline performance of different OCR models on the test dataset.

4.4. Parameter Settings

The key parameters for FLIP are configured as follows:

Similarity weight coefficients: $w_{1} = 0.6$ , $w_{2} = 0.4$
Similarity parameter: $s c a l e = 2.0$
Similarity threshold: $θ = 0.65$
Learning rate: $α = 0.025$
Feedback threshold: $θ_{feedback} = 0.75$

The similarity threshold

θ = 0.65

is selected based on experimental analysis revealing that most visually similar Chinese character pairs fall within this range during initial optimization dataset creation, ensuring comprehensive capture of potentially confusable characters while maintaining correction precision. The learning rate

α = 0.025

provides robust convergence behavior that balances adaptation speed with stability—sufficiently responsive to user feedback without immediately reaching the secondary threshold

θ_{feedback} = 0.75

, while avoiding overly slow adaptation that would impair practical deployment effectiveness. As demonstrated in Figure 2, this learning rate achieves optimal convergence characteristics compared to both higher rates that may cause instability and lower rates that result in insufficient adaptation speed.

Figure 2. Learning rate comparison analysis showing the convergence behavior of different learning rates (

α

) in the feedback learning mechanism. The baseline similarity value is set to the median similarity of character pairs requiring feedback correction. The analysis demonstrates how similarity scores evolve with increasing feedback iterations.

4.5. Experimental Results and Analysis

We present our experimental results and analysis to demonstrate the effectiveness of the FLIP plugin. Based on the aforementioned experimental steps, the results show consistent improvements across all tested OCR models under the complete plugin, along with ablation experiments after removing the feedback module, demonstrating notable gains in Chinese character recognition accuracy. These findings validate the universal applicability and effectiveness of our approach.

4.5.1. Overall Performance Improvements

Table 2 presents the performance after applying the complete FLIP plugin.

Table 2. OCR performance results with complete FLIP plugin.

The results demonstrate significant improvements across all tested OCR models. PaddleOCR V4 shows the most substantial enhancement with 10.37% precision improvement, indicating that the plugin effectively addresses character recognition errors in lower-performing models. High-performance models like Qwen-VL-OCR and Doubao-1.5-v-pro achieve remarkable Chinese character precision rates of 99.42% and 99.33% respectively, demonstrating the plugin’s capability to fine-tune already accurate systems. The consistent improvements in F1 Scores across all models confirm the balanced enhancement of both precision and recall metrics.

4.5.2. Ablation Study Results

We evaluate the feedback learning component by testing the plugin with only the core error correction module. The results are shown in Table 3.

Table 3. Performance comparison in the ablation study: OCR models with core error correction module only.

This validates that the adaptive learning mechanism successfully captures domain-specific character similarities through user interactions, with greater impact on systems requiring more substantial correction.

The ablation study results, combined with the visual evidence in Figure 3, reveal the nuanced effectiveness of feedback learning across different OCR system types. The numerical comparison between Table 2 and Table 3 shows that PaddleOCR V3 gains 2.46% additional precision through feedback learning (from 91.75% to 94.21%), while PaddleOCR V4 achieves a substantial 4.66% improvement (from 86.95% to 91.61%). These improvements are clearly visualized in Figure 3, where the progressive enhancement from baseline through core correction to complete plugin demonstrates the additive value of each component.

Figure 3. Performance comparison of OCR models with FLIP plugin optimization. (a) PaddleOCR V3, (b) PaddleOCR V4, (c) Qwen-VL-OCR, and (d) Doubao-1.5-v-pro.

For high-performance models, feedback learning provides more modest but consistent improvements of 0.45% for Qwen-VL-OCR and 0.55% for Doubao-1.5-v-pro. While these gains appear small in absolute terms, they represent significant achievements when optimizing systems already operating near performance limits. The visual comparison effectively illustrates this pattern, showing minimal overall precision changes but remarkable Chinese character precision, reaching 99.42% and 99.33%, respectively, and demonstrating the plugin’s particular effectiveness for logographic character recognition.

Figure 3 reveals distinct improvement trajectories that reflect each model’s architectural characteristics. PaddleOCR models show dramatic visual improvements across all metrics, with clear step-wise enhancements particularly striking for PaddleOCR V4, where feedback learning elevates the system from moderate to high-performance category. In contrast, Qwen-VL-OCR and Doubao-1.5-v-pro demonstrate a different optimization pattern where Chinese character accuracy metrics show the most significant enhancement, confirming that the plugin successfully addresses its primary design objective even for highly optimized systems.

The comprehensive metric display consistently shows that precision improvements are accompanied by balanced recall and F1 Score enhancements, with the ablation study confirming recall improvements ranging from 0.53% to 1.68% across models. This balanced improvement pattern indicates that feedback learning enhances overall recognition quality rather than optimizing individual metrics at the expense of others. The consistency of improvements across architecturally diverse OCR systems validates the universal applicability of the adaptive learning mechanism and confirms FLIP’s effectiveness for practical deployment in Chinese document processing applications.

4.5.3. Traditional Method Comparison Results

The comparative evaluation against traditional dictionary-based correction methods demonstrates the fundamental advantages of FLIP’s adaptive feedback learning approach. Table 4 shows that FLIP consistently outperforms dictionary-based methods across all tested OCR models.

Table 4. Performance comparison between FLIP and traditional dictionary-based correction methods.

The results reveal the critical limitations of static dictionary approaches. Traditional methods can only correct pre-defined error patterns that have been manually observed and catalogued, severely constraining their effectiveness when confronted with novel recognition errors. Dictionary-based systems remain fundamentally static after deployment, unable to adapt to new character confusion patterns or domain-specific terminology without manual intervention and system updates.

In contrast, FLIP’s feedback learning mechanism achieves dynamic adaptation by learning from user interactions rather than relying on fixed mappings. This adaptive capability allows the system to continuously refine its correction behavior based on real-world usage patterns and user feedback, enabling it to handle new error types and domain-specific character relationships that were not present in any pre-defined correction dictionary.

4.5.4. Real-User Evaluation

Figure 4 illustrates the most frequently misrecognized Chinese characters encountered by the OCR system during evaluation. The feedback interaction interface allows users to review the FLIP plugin’s correction results and submit their feedback. we evaluate the FLIP plugin across several key dimensions in actual user scenarios.

Figure 4. Actual user interface demonstration showing seamless integration of FLIP plugin in contract document processing workflow with immediate feedback capabilities and intuitive correction mechanisms.

Recognition accuracy improvements: Under the sustained effect of user feedback, FLIP’s accuracy continuously improves. The system learns from user corrections and adapts to domain-specific terminology over time. As demonstrated in the examples shown in the figure, these frequently misrecognized characters evolve in the direction desired by users during the usage process.

Computational complexity: FLIP is designed to be lightweight and can run efficiently on standard personal computers without requiring high-end hardware or internet connectivity. The plugin has minimal memory and storage requirements, making it suitable for deployment on individual client devices. This design ensures low operational costs while maintaining data privacy and providing real-time post-processing capabilities.

Noise-robustness: When high-resolution images are input to OCR models, the recognition results are typically more accurate. However, when images are blurry, low-resolution, or contain noise interference, OCR models are prone to misrecognition, particularly for visually similar Chinese characters. These low-quality image conditions are precisely the scenarios where the FLIP plugin can play an important role. Since FLIP is based on character encoding-level similarity analysis, it can effectively identify and correct character confusion errors caused by image quality issues. When OCR systems produce misidentifications of similar characters while processing blurry or low-resolution images, FLIP’s mathematical similarity algorithms can detect these potential errors and provide accurate correction suggestions, thereby significantly improving overall recognition accuracy under various image quality conditions.

Time efficiency: During the period when three users tested our algorithm, it demonstrates exceptional computational efficiency. Error detection and correction are completed at second-level or even millisecond-level speeds for entire documents. The FLIP plugin’s character-level similarity analysis operates at remarkable speeds. This enables instantaneous error identification without perceptible delay in user workflow. Traditional manual verification methods are significantly slower. FLIP-assisted processing provides real-time correction suggestions that appear instantly as users review documents.

5. Limitations

While FLIP demonstrates significant improvements in Chinese OCR accuracy across multiple systems, several inherent limitations must be acknowledged to provide a comprehensive understanding of the plugin’s applicability and potential constraints.

5.1. Homograph Error Challenges

One fundamental limitation of FLIP lies in its handling of homograph errors, where visually identical characters possess different semantic meanings depending on contextual usage. Our current approach relies primarily on character-level similarity analysis through UTF-8 encoding, which cannot distinguish between semantically different uses of the same character. For instance, certain Chinese logographs can represent completely different concepts despite identical visual appearance, such as characters that may function as both verbs and nouns with distinct pronunciations and meanings in different contexts. FLIP’s encoding-based similarity computation treats these as identical characters, potentially missing context-dependent correction opportunities where OCR systems correctly recognize the character form but users require semantic validation.

This limitation becomes particularly pronounced in legal and technical documents where precise semantic interpretation is critical. While our feedback learning mechanism can partially address domain-specific usage patterns through user corrections, it cannot fundamentally resolve the ambiguity inherent in homographic characters without incorporating sophisticated natural language understanding capabilities, which would contradict our design philosophy of maintaining independence from external linguistic resources.

5.2. Scalability to Non-Chinese Languages

Although FLIP’s UTF-8 encoding-based approach theoretically supports universal character analysis, practical scalability to non-Chinese languages presents several challenges that limit its broader applicability. The plugin’s design assumptions and parameter configurations are specifically optimized for Chinese character characteristics, including stroke-based structural similarities and the logographic nature of Chinese writing systems.

Alphabetic writing systems such as Latin, Cyrillic, or Arabic scripts exhibit fundamentally different error patterns and character relationships compared to Chinese logographs. The similarity measures and threshold parameters that prove effective for distinguishing Chinese character pairs may not transfer directly to alphabetic systems where character confusions often involve letter substitutions rather than structural component variations.

Furthermore, our experimental validation exclusively employed Chinese text datasets and OCR systems optimized for Chinese character recognition. The plugin’s performance on non-Chinese languages remains empirically unvalidated, and the feedback learning mechanisms may require substantial recalibration to accommodate different linguistic structures and error patterns characteristic of non-logographic writing systems.

5.3. User-Specific Optimization Variability

A significant practical limitation emerges from the inherent variability in user-specific optimization effects as application contexts and usage patterns evolve over time. While our experimental evaluation demonstrates consistent improvements across different OCR models using standardized datasets, real-world deployment scenarios introduce dynamic factors that can substantially influence optimization effectiveness.

Individual users’ correction patterns, domain-specific terminology preferences, and evolving application requirements create personalized optimization trajectories that may diverge significantly from our standardized experimental results. As users interact with different document types, modify their correction criteria, or adapt their workflow practices, the plugin’s learned parameters may become misaligned with current requirements, potentially reducing optimization effectiveness below experimentally observed levels.

Moreover, the feedback learning mechanism’s dependency on user interaction quality introduces variability that cannot be statistically guaranteed. Inconsistent user corrections, temporary shifts in application focus, or changes in organizational requirements can influence the adaptive optimization process in ways that may not reflect the controlled experimental conditions under which FLIP’s performance was validated.

This limitation suggests that while FLIP provides substantial improvements under controlled conditions, practical deployment may require ongoing monitoring and potential parameter readjustment to maintain optimal performance as user requirements and application contexts evolve. The plugin’s effectiveness should therefore be understood as context-dependent rather than universally guaranteed across all possible deployment scenarios.

6. Conclusions

This study presents FLIP (Feedback Learning-based Intelligent Plugin), contributing significant mathematical innovations to optical character recognition post-processing. The work establishes a rigorous mathematical framework for character similarity quantification. It combines information-theoretic principles with encoding-level analysis.

The primary mathematical advancement applies information entropy theory to character similarity computation. We build upon the established success of ID3 and C4.5 algorithms. Character similarity is formalized as a binary classification problem with entropy-based uncertainty quantification. UTF-8 character encodings are transformed into measurable bit-level differences through XOR operations. This enables the precise quantification of structural relationships between logographic characters.

We introduce a mathematically rigorous weighted similarity measure. It combines bit-level differences with entropy-based structural analysis. The framework employs exponential decay functions and normalized similarity coefficients. This captures both local encoding variations and global structural patterns. The dual-scale approach provides complete mathematical characterization of character relationships.

The feedback learning mechanism implements gradient-based parameter optimization. Mean squared error minimization enables real-time adaptation of similarity functions. The convergence properties demonstrate robust mathematical foundations for continuous learning. We establish a mathematical framework for threshold-based classification. It incorporates both initial similarity thresholds and adaptive feedback thresholds.

Comprehensive experiments across four OCR architectures validate the mathematical framework’s effectiveness. Results demonstrate consistent improvements with precision gains ranging from 1.17% to 10.37%. Ablation studies mathematically isolate the contribution of feedback learning. Additional improvements from 0.45% to 4.66% are shown across different systems. The mathematical relationship between thresholds ensures conservative correction behavior.

The information-theoretic foundation provides mathematical generalizability beyond Chinese characters. Entropy-based similarity computation offers a universal mathematical approach for any UTF-8 encoded character system. This establishes theoretical foundations for broader character recognition applications. The mathematical algorithms achieve computational efficiency through encoding-level operations. They eliminate dependency on external linguistic resources.

Future mathematical research includes extending the entropy-based framework to multi-dimensional character spaces. Advanced similarity learning mathematics for complex character relationships will be developed. Mathematical convergence guarantees for adaptive learning systems need to be established. The mathematical contributions establish FLIP as a theoretically grounded solution. It advances information-theoretic approaches to character recognition and provides significant innovations for computational character analysis.

Author Contributions

Conceptualization, X.T.; methodology, X.T.; software, X.T. and Y.W.; validation, X.T., Y.H. and Y.W.; formal analysis, X.T. and Y.J.; investigation, X.T. and Y.J.; data curation, X.T., Y.H. and Y.J.; writing—original draft preparation, X.T.; writing—review and editing, X.T., Y.H. and Y.W.; resources, Y.W.; visualization, Y.W.; supervision, Y.W.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2024 Science and Technology Innovation Project of Anhui Province (202423k09020031); the Special Fund for Anhui Characteristic Agriculture Industry Technology System (2021–2025); and the Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information Open Fund Project (BDSYS2021003).

Data Availability Statement

The datasets used and/or analyzed in the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Yang, C.; Han, X.; Han, T.; Han, H.; Zhao, B.; Wang, Q. Edge Approximation Text Detector. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Yang, C.; Chen, M.; Yuan, Y.; Wang, Q. Text growing on leaf. IEEE Trans. Multimed. 2023, 25, 9029–9043. [Google Scholar] [CrossRef]
Zhang, X.; Su, Y.; Tripathi, S.; Tu, Z. Text Spotting Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 9519–9528. [Google Scholar]
Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. SwinTextSpotter: Scene Text Spotting via Better Synergy Between Text Detection and Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4593–4603. [Google Scholar]
Atmakuri, V.; Dhanalakshmi, M. Improving Text Recognition in Natural Scenes using Optimized OCR Techniques. In Proceedings of the 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Erode, India, 11–13 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 943–950. [Google Scholar]
Chiron, G.; Doucet, A.; Coustaty, M.; Moreux, J.P. ICDAR2017 competition on post-OCR text correction. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 1423–1428. [Google Scholar]
Park, J.; Lee, E.; Kim, Y.; Kang, I.; Koo, H.I.; Cho, N.I. Multi-lingual optical character recognition system using the reinforcement learning of character segmenter. IEEE Access 2020, 8, 174437–174448. [Google Scholar] [CrossRef]
Lee, A.; Yu, H.; Min, G. An algorithm of line segmentation and reading order sorting based on adjacent character detection: A post-processing of OCR for digitization of Chinese historical texts. J. Cult. Herit. 2024, 67, 80–91. [Google Scholar] [CrossRef]
Alotaibi, F.; Abdullah, M.T.; Abdullah, R.B.H.; Rahmat, R.W.B.O.K.; Hashem, I.A.T.; Sangaiah, A.K. Optical character recognition for quranic image similarity matching. IEEE Access 2017, 6, 554–562. [Google Scholar] [CrossRef]
Wan, Y.; Ren, F.; Yao, L.; Zhang, Y. Research on Scene Chinese Character Recognition Method Based on Similar Chinese Characters. In Proceedings of the 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 23–25 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 459–463. [Google Scholar]
Huang, Y.-X.; Li, B. A Chinese text classification model based on radicals and character distinctions. IEEE Access 2023, 11, 45520–45526. [Google Scholar] [CrossRef]
Luo, G.F.; Wang, D.H.; Du, X.; Yin, H.Y.; Zhang, X.Y.; Zhu, S. Self-information of radicals: A new clue for zero-shot Chinese character recognition. Pattern Recognit. 2023, 140, 109598. [Google Scholar] [CrossRef]
Najam, R.; Faizullah, S. Analysis of recent deep learning techniques for Arabic handwritten-text OCR and post-OCR correction. Appl. Sci. 2023, 13, 7568. [Google Scholar] [CrossRef]
Valizadeh, F.; Ghasemian, F.; Shabaninia, E. Comparative Analysis of Large Language Models for OCR Post-Processing in Persian: From ParsBERT to GPT. In Proceedings of the 2025 29th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 4–6 February 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–6. [Google Scholar]
Jatowt, A.; Coustaty, M.; Nguyen, N.V.; Doucet, A. Deep statistical analysis of OCR errors for effective post-OCR processing. In Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Champaign, IL, USA, 2–6 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 29–38. [Google Scholar]
Nguyen, T.T.H.; Jatowt, A.; Coustaty, M.; Doucet, A. Survey of post-OCR processing approaches. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
Avyodri, R.; Lukas, S.; Tjahyadi, H. Optical character recognition (OCR) for text recognition and its post-processing method: A literature review. In Proceedings of the 2022 1st International Conference on Technology Innovation and Its Applications (ICTIIA), Jambi, Indonesia, 23–24 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Kissos, I.; Dershowitz, N. OCR error correction using character correction and feature-based word classification. In Proceedings of the 2016 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini, Greece, 11–14 April 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 198–203. [Google Scholar]
Nguyen, Q.D.; Phan, N.M.; Krömer, P.; Le, D.A. An efficient unsupervised approach for ocr error correction of vietnamese ocr text. IEEE Access 2023, 11, 58406–58421. [Google Scholar] [CrossRef]
Kumaran, U.; Biswas, D.; Sneha, B.; Nadipalli, S.; Raja, S. Text Post-processing on Optical Character Recognition output using Natural Language Processing Methods. In Proceedings of the 2023 IEEE 3rd Mysore Sub Section International Conference (MysuruCon), Hassan, India, 18–19 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Rakshit, A.; Mehta, S.; Dasgupta, A. A novel pipeline for improving optical character recognition through post-processing using natural language processing. In Proceedings of the 2023 IEEE Guwahati Subsection Conference (GCON), Guwahati, India, 23–25 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 01–06. [Google Scholar]
Aung, T.; Thu, Y.K.; Oo, M.N. myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction. In Proceedings of the 2024 19th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, 4–6 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Thomas, A.; Gaizauskas, R.; Lu, H. Leveraging LLMs for post-OCR correction of historical newspapers. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA)@LREC-COLING-2024, Turin, Italy, 25 May 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 116–121. [Google Scholar]
Singh, A.; Jangra, S.; Aggarwal, G. EnvisionText: Enhancing Text Recognition Accuracy through OCR Extraction and NLP-based Correction. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 47–52. [Google Scholar]
Xue, Y. Simulation research on large language model of complex OCR scene based on reinforcement learning algorithm optimization. In Proceedings of the 2023 International Conference on Internet of Things, Robotics and Distributed Computing (ICIRDC), Wuhan, China, 10–12 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 738–742. [Google Scholar]
Zhang, X.; Wang, Y.; Wu, L. Research on Cross Language Text Keyword Extraction Based on Information Entropy and TextRank. In Proceedings of the 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chengdu, China, 15–17 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 16–19. [Google Scholar]
Yu, J.; Chen, R.; Xu, L.; Wang, D. Concept extraction for structured text using entropy weight method. In Proceedings of the 2019 IEEE Symposium on Computers and Communications (ISCC), Barcelona, Spain, 29 June–3 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]

Figure 1. FLIP (Feedback Learning-based Intelligent Plugin) architecture showing the three core components: Output Parsing for UTF-8 character encoding, Error Correction for similarity-based character analysis, and Feedback Learning for adaptive parameter optimization. The plugin is illustrated with an example of visually similar Chinese character pairs.

Figure 2. Learning rate comparison analysis showing the convergence behavior of different learning rates (

α

) in the feedback learning mechanism. The baseline similarity value is set to the median similarity of character pairs requiring feedback correction. The analysis demonstrates how similarity scores evolve with increasing feedback iterations.

Figure 3. Performance comparison of OCR models with FLIP plugin optimization. (a) PaddleOCR V3, (b) PaddleOCR V4, (c) Qwen-VL-OCR, and (d) Doubao-1.5-v-pro.

Figure 4. Actual user interface demonstration showing seamless integration of FLIP plugin in contract document processing workflow with immediate feedback capabilities and intuitive correction mechanisms.

Table 1. Baseline performance of different OCR models on the test dataset.

OCR Models	Precision	Chinese Precision	Recall	F1 Score
PaddleOCR V3	88.43	94.37	54.03	67.08
PaddleOCR V4	81.24	93.92	29.34	43.11
Qwen-VL-OCR	77.61	97.57	91.84	84.12
Doubao-1.5-v-pro	76.60	97.64	92.73	83.90

Table 2. OCR performance results with complete FLIP plugin.

OCR Models	Precision	Chinese Precision	Recall	F1 Score
PaddleOCR V3	94.21	98.58	57.56	71.46
PaddleOCR V4	91.61	98.38	33.08	48.60
Qwen-VL-OCR	78.92	99.42	93.39	85.55
Doubao-1.5-v-pro	77.77	99.33	94.15	85.18

Table 3. Performance comparison in the ablation study: OCR models with core error correction module only.

OCR Models	Precision	Chinese Precision	Recall	F1 Score
PaddleOCR V3	91.75	96.79	56.06	69.60
PaddleOCR V4	86.95	96.37	31.40	46.13
Qwen-VL-OCR	78.47	98.79	92.86	85.06
Doubao-1.5-v-pro	77.22	98.54	93.48	84.58

Table 4. Performance comparison between FLIP and traditional dictionary-based correction methods.

OCR Models	FLIP Chinese Accuracy (%)	Dictionary-Based Chinese Accuracy (%)	Improvement (%)
PaddleOCR V3	98.6	55.5	+43.1
PaddleOCR V4	98.4	30.5	+67.9
Qwen-VL-OCR	99.4	92.4	+7.0
Doubao-1.5-v-pro	99.3	93.1	+6.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

FLIP: A Novel Feedback Learning-Based Intelligent Plugin Towards Accuracy Enhancement of Chinese OCR

Abstract

1. Introduction

2. Methodology

2.1. ID3 and C4.5 Algorithms

2.2. Inspiration for FLIP Design

3. FLIP

3.1. Output Parsing

3.2. Error Correction

3.3. Feedback

4. Experiment

4.1. Datasets and Metric

4.2. Experimental Setup and Process

4.2.1. Baseline Establishment

4.2.2. Output Parsing Stage

4.2.3. Initial Error Correction Stage

4.2.4. Feedback Learning Stage

4.2.5. Traditional Method Comparison

4.2.6. Final Performance Evaluation

4.3. Baseline

4.4. Parameter Settings

4.5. Experimental Results and Analysis

4.5.1. Overall Performance Improvements

4.5.2. Ablation Study Results

4.5.3. Traditional Method Comparison Results

4.5.4. Real-User Evaluation

5. Limitations

5.1. Homograph Error Challenges

5.2. Scalability to Non-Chinese Languages

5.3. User-Specific Optimization Variability

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics