1. Introduction
As digital transformation accelerates globally, enterprises and financial institutions are confronted with growing challenges in managing and processing receipt-like financial documents. According to a 2022 industry report by McKinsey, over 70% of global enterprises have either deployed or are planning to deploy intelligent document processing systems to cope with the growing volume of transaction data and compliance audit requirements [
1]. In China, as enterprises expand, financial management has become increasingly complex, involving multi-level approval processes, cross-departmental coordination, tax regulation compliance, and the handling of diverse document formats such as electronic invoices, reimbursement slips, and transportation receipts. Traditional manual document processing methods may no longer fully meet the increasing demands of modern office operations and business expansion. Receipts carry critical information such as transaction amounts, dates, company details, and product specifications. These details are essential for financial audits, tax compliance, and daily operational management [
2]. However, traditional manual data entry and review processes are inefficient and prone to errors—incorrect mismatched invoice codes and missed entries during reconciliation—significantly impacting operational efficiency.
To address these challenges, the demand for automated receipt processing has become increasingly urgent. With the rapid development of computer vision and deep learning technologies, automated document recognition systems have emerged as a mainstream solution [
3]. By scanning and processing document images, these systems can quickly extract textual information, automatically classify it, and input it into financial management systems, thereby significantly reducing manual operations, improving work efficiency, and minimizing errors. More importantly, automated document recognition technology not only alleviates the workload of financial personnel but also enhances the transparency and compliance of financial reimbursement and auditing processes, enabling enterprises to respond more swiftly and make more informed decisions in a competitive market. However, despite the immense potential of financial document automation technology, its practical application still faces numerous technical challenges, particularly in recognizing complex document layouts, extracting handwritten text, and effectively integrating multimodal information [
4].
Recent studies in financial document processing have highlighted the critical importance of precision. In financial contexts, even small errors in data extraction—incorrect transaction amounts or misinterpreted tax numbers—can result in compliance issues and financial discrepancies. Research by Tomar et al. [
5] emphasizes the need for high-precision models capable of handling complex receipt layouts, noisy images, and variations in handwriting. These studies underscore the importance of developing robust and precise document recognition systems to ensure both accuracy and efficiency in financial workflows.
Traditional document recognition methods, such as CTPN and CRNN, have partially addressed text detection and recognition issues but are limited in handling multimodal information [
6]. The CTPN model is primarily designed for long-text detection, generating text proposals to locate text regions in images. While it works well with simple layouts, its accuracy significantly declines in complex layouts, noisy backgrounds, and handwritten text [
7]. Additionally, CTPN underutilizes non-textual information in images, such as barcodes, QR codes, and icons, which are crucial for enhancing document recognition by providing additional contextual information. Similarly, CRNN (Convolutional Recurrent Neural Network) is used for sequential text recognition by leveraging convolutional neural networks (CNNs) and recurrent neural networks (RNNs). While it is effective for processing continuous character sequences, CRNN struggles with handwritten or complex layout text, especially in cases with dense text or significant character variations, where recognition errors are more likely. Moreover, CRNN also fails to fully leverage non-textual information in document images, limiting its application in more complex document automation [
8,
9].
In recent years, multimodal pre-trained models such as LayoutLM and CLIP have significantly improved structured document understanding by jointly learning text, image, and layout information [
10]. For instance, LayoutLM-v3 achieves an F1 score of 92% in key field extraction tasks for invoices. However, these models still face two major challenges in practical applications. On the one hand, they demand substantial computational resources. For example, LayoutLM has over 150 million parameters and requires more than 4 GB of GPU memory for single-image inference, making deployment in resource-constrained environments difficult. On the other hand, the financial document domain has unique characteristics, such as industry-specific symbols (e.g., “¥” and “tax number”) [
11], which necessitate domain-specific fine-tuning. Fine-tuning general pre-trained models typically requires large datasets and may disrupt the original cross-modal alignment capabilities, thereby affecting performance. To address these technical challenges and the limitations of existing methods, receipt detection and recognition technology based on multimodal alignment and lightweight sequence modeling has emerged. Research in this area not only advances receipt automation technology but also provides more efficient and secure solutions for enterprise financial management, tax compliance, and auditing, holding significant market demand and societal value.
This paper proposes a novel approach that integrates multimodal alignment, lightweight sequence modeling, and an end-to-end compression framework. Traditional methods primarily focus on text extraction and recognition but overlook other crucial information in images, such as layouts, logos, and barcodes. To address this, we introduce a multimodal alignment method based on CLIP (Contrastive Language-Image Pretraining). CLIP, based on contrastive learning principles, jointly learns semantic representations of images and text, enabling a deeper understanding of textual information within images [
12]. This approach achieves effective alignment and semantic correction, which enhances text detection and recognition accuracy, particularly for documents with complex layouts, such as handwritten content and documents with diverse structures.
Traditional recurrent neural networks (RNNs), such as LSTM (Long Short-Term Memory), excel in sequence modeling but are computationally intensive, making them unsuitable for efficient deployment in resource-constrained environments. To overcome this limitation, this paper proposes a lightweight sequence modeling method that combines bidirectional gated recurrent units (BiGRUs) with position-aware attention mechanisms. Compared to LSTM, BiGRU optimizes computational complexity by reducing the number of parameters and computation steps while maintaining high performance and significantly improving processing speed [
13]. To further enhance the model’s understanding of complex text, this paper introduces a position-aware attention mechanism that focuses on important information at different positions during text sequence modeling, thereby improving the comprehension of financial data, handwritten text, and symbols. By combining BiGRU with position-aware attention, this method achieves a balance between computational efficiency and model performance, enabling precise and fast automated receipt recognition that adapts to documents of varying complexity [
14].
This paper introduces a novel framework for receipt detection and recognition based on multimodal alignment and lightweight sequence modeling. Specifically, the framework integrates Contrastive Language-Image Pretraining (CLIP) with Bidirectional Gated Recurrent Unit (BiGRU) networks to optimize the synergy between image and text data. By leveraging multimodal feature alignment and lightweight sequence modeling, the framework aims to improve the accuracy and efficiency of receipt processing, particularly in resource-constrained environments such as edge devices.
In this paper, we aim to address the following key research questions:
(1) How can multimodal alignment between image and text information enhance the accuracy of receipt detection and recognition, especially in complex layouts and low-quality images?
(2) What is the effectiveness of integrating lightweight sequence modeling, specifically through the use of Bidirectional Gated Recurrent Units (BiGRU), in improving text recognition accuracy while maintaining computational efficiency for edge devices?
(3) How can the proposed framework be optimized to operate efficiently on resource-constrained edge devices while maintaining high performance in real-time receipt processing?
To evaluate the effectiveness of the proposed framework, we conduct experiments on the CORD (Consolidated Receipt Dataset), a publicly available benchmark widely used in financial document understanding tasks. The CORD dataset comprises scanned receipt images annotated with rich semantic and structural information, including key fields such as item names, total amounts, and tax details. It covers a wide variety of real-world receipt layouts, including both printed and handwritten text, and provides a comprehensive testbed for assessing both detection and recognition performance in complex document scenarios.
The remainder of this paper is structured as follows:
Section 2 provides a review of related work in the field of receipt recognition, highlighting the limitations of current methods.
Section 3 presents the proposed methodology, detailing the multimodal alignment and lightweight sequence modeling framework.
Section 4 outlines the experimental setup, including the datasets used, evaluation metrics, and performance benchmarks.
Section 5 presents the experimental results and analysis, comparing the proposed method with existing approaches.
Section 6 discusses the results and suggests potential directions for future work.
Section 7 concludes the paper with a summary of the key findings and practical implications.
2. Related Work
The development of receipt recognition technology has always focused on two core tasks: accurate localization of text regions and reliable recognition of text content. Early research primarily relied on traditional image processing techniques, template-based layout matching, and rule-based reasoning to process structured documents [
15,
16]. These approaches worked well for fixed-format layouts but exhibited limited adaptability to complex, irregular, or multilingual document types, such as receipts and tax forms [
17]. With the advancement of deep learning, neural network-based solutions have become mainstream, enabling more robust handling of diverse layout types and visual noise.
Precision in financial document processing has become an increasingly important research area due to the nature of the data involved. For instance, studies on receipt processing have shown that even small errors in recognizing tax amounts or transaction details can result in serious compliance issues. Recent work by Hesami et al. [
18] focused on improving the precision of document recognition systems, highlighting the need for robust methods capable of dealing with complex layouts and handwritten text while maintaining high accuracy. Transformer-based models, such as Vision Transformers (ViT) and Transformer-based variants of CRNN, have been explored for receipt recognition tasks due to their ability to capture long-range dependencies. However, these models face challenges related to computational efficiency and high memory requirements, particularly in real-time applications where precision is critical [
19].
Moreover, Graph Neural Networks (GNNs) have recently emerged as a promising approach for structured information extraction in financial documents, including receipts. GNNs excel at modeling relationships between entities in documents, such as the connections between fields in a receipt or between text regions and their layout. Recent studies have demonstrated that GNNs can improve the accuracy of document understanding by leveraging the inherent structure of documents, making them particularly suitable for tasks that require an understanding of hierarchical or relational data [
20]. This approach complements traditional methods by enhancing the semantic understanding of document structure, making it a valuable addition to multimodal receipt recognition frameworks.
Additionally, Handwritten Text Recognition (HTR) systems have seen significant improvements with the introduction of deep learning techniques. HTR systems specialized for recognizing handwritten content in financial documents, such as handwritten receipts and checks, often incorporate attention mechanisms and sequence modeling to handle variations in handwriting styles. Recent advancements, such as the use of Transformer-based models in HTR systems, have shown promising results in improving recognition accuracy under challenging conditions like poor image quality or non-standard handwriting styles [
15].
In the field of text detection, CTPN (Connectionist Text Proposal Network) generates text proposal boxes and incorporates Recurrent Neural Networks (RNNs) to model contextual information, providing a solid foundation for horizontal text detection [
21]. However, its ability to detect slanted or densely packed small text remains weak. To address this limitation, EAST (Efficient and Accurate Scene Text Detector) employs a fully convolutional network to directly predict geometric parameters (e.g., rotated rectangular boxes), significantly improving detection efficiency and flexibility [
22].
For text recognition, CRNN (Convolutional Recurrent Neural Network) combines CNNs and RNNs with the Connectionist Temporal Classification (CTC) loss to achieve end-to-end sequence recognition [
17]. While effective on standard datasets, CRNN struggles with domain-specific challenges in receipt documents, such as fine-grained numeric fields (e.g., tax ID numbers), blurred text on low-contrast backgrounds, and multilingual tokens in cross-border receipts. Experiments show that CRNN’s Character Error Rate (CER) increases by approximately 30% in multilingual mixed scenarios compared to single-language ones [
19], indicating its limitations in complex semantic environments.
To address the shortcomings of unimodal models, researchers have developed multimodal pre-trained models that integrate text, image, and layout information. CLIP (Contrastive Language-Image Pretraining) embeds visual and textual data into a shared semantic space using contrastive learning, showing particular advantages in semantic-level error correction (e.g., distinguishing “5” from “S”) [
23]. LayoutLM extends CLIP’s capabilities by incorporating layout coordinates, significantly improving performance in tasks like invoice field extraction [
24]. Nevertheless, such models typically consist of over 150 million parameters and require more than 4 GB of GPU memory for inference, making them unsuitable for real-time or edge device deployment. Furthermore, their understanding of financial-domain-specific symbols (e.g., “¥”, “tax ID”) is limited without extensive fine-tuning, which may compromise their cross-modal alignment.
In sequence modeling optimization, various strategies have emerged to enhance recognition accuracy. To address the path alignment bias in CTC loss, dynamic threshold adjustment methods have been proposed to improve decoding accuracy in long-sequence scenarios [
25]. Additionally, improvements to attention mechanisms—positional encoding and cross-attention—enable better spatial reasoning and semantic consistency in multi-column or complex layouts [
26,
27]. These enhancements are especially important when dealing with visually cluttered financial documents, such as receipts.
Given the rising demand for edge computing in enterprise applications, lightweight modeling has become essential. Techniques like MobileNetV3 leverage Neural Architecture Search (NAS) create efficient convolutional modules with a strong trade-off between accuracy and speed [
28], while ShuffleNetV2 applies channel splitting and reordering to optimize inference performance on mobile devices [
29]. Compression techniques such as quantization (FP32→INT8) reduce model size by up to 75% without significantly affecting accuracy [
30], and channel pruning reduces computational complexity by up to 30% by removing redundant filters [
31].
Table 1 summarizes representative works in this field, highlighting the methodological evolution from traditional rule-based systems to modern multimodal and lightweight models [
15,
16,
17,
19,
32].
Despite the significant advancements in financial document recognition, several challenges remain unresolved. Current methods, particularly in text detection and recognition, often struggle with complex and irregular layouts commonly found in financial documents like receipts, invoices, and tax forms. While multimodal models like CLIP and LayoutLM have made strides in integrating text, image, and layout information, their application in financial domains is limited due to the models’ high computational demands and lack of specialized understanding of domain-specific symbols (e.g., “¥” and “tax ID”). Additionally, sequence modeling techniques such as CRNN face difficulties in multilingual and mixed-text scenarios, leading to higher error rates in real-world applications. Furthermore, there is a lack of lightweight solutions that can perform efficiently on edge devices while maintaining high accuracy.
This research aims to fill these gaps by proposing a multimodal alignment framework based on CLIP and BiGRU, which is not only computationally efficient but also tailored to handle the unique challenges of financial document recognition, especially in resource-constrained environments.
3. Methodology
This section outlines the research approach used in this study, which focuses on developing a receipt recognition framework through multimodal alignment and lightweight sequence modeling. The primary goal of this approach is to enhance the accuracy and efficiency of receipt recognition, particularly for complex and low-quality images, by integrating both image and textual information. The research approach utilizes CLIP for multimodal alignment and BiGRU for lightweight sequence modeling, offering a balance between computational efficiency and recognition performance.
3.1. Overall Architecture
The receipt-like financial document automation recognition system developed in this study follows a three-stage pipeline with a progressive design: Detection→Alignment→Recognition. This approach is designed to optimize the entire process, from image input to structured text output, by addressing key challenges such as:
Complex layout analysis: Financial documents like receipts often contain complex, variable layouts that may include text, logos, barcodes, and various other non-textual elements. Detecting and understanding the spatial relationship between these components is crucial for accurate text localization and recognition.
Handwriting recognition: Many financial documents, especially receipts, often feature handwritten text that varies significantly in style, size, and quality. Recognizing handwritten text in these documents requires robust models that can handle diverse handwriting variations under different conditions.
Cross-modal semantic consistency: Integrating visual and textual information from financial documents presents a challenge in ensuring semantic consistency between the image content and the recognized text. This is particularly important when distinguishing similar characters (e.g., “5” vs. “S”) or resolving ambiguities arising from noisy or low-quality images.
The architecture integrates lightweight techniques with multimodal alignment mechanisms, allowing for efficient processing and enhanced recognition accuracy. Specifically, the CLIP–BiGRU Multimodal Text Detection and Recognition Framework provides a robust solution by combining CLIP for multimodal alignment and BiGRU for sequence modeling, making it capable of handling the intricacies of receipt document recognition. The framework is illustrated in
Figure 1.
Semantic features in this context refer to both the visual and textual elements of the document that provide critical contextual information. Examples of semantic features include:
Visual Features: The positioning of text relative to other visual elements (such as logos, barcodes, and icons), the layout of text in different regions of the receipt (e.g., header, body, and footer), and the recognition of graphical elements that contribute to the document’s overall meaning.
Textual Features: The recognized characters, words, and phrases in the text, including important data such as transaction amounts, dates, product names, tax identification numbers, and item descriptions. Additionally, the relationships between these textual elements—the proximity of a price to an item name or the order in which transaction details are presented—are also considered semantic features that enhance understanding.
By integrating both visual and textual semantic features, our framework ensures that the system interprets both the content and structure of the receipt effectively, improving accuracy and efficiency in the recognition process.
In the detection stage, the improved MobileNetV3-Small is used as the backbone network, which extracts multi-scale features with minimal computational overhead through depthwise separable convolution modules optimized by Neural Architecture Search (NAS) [
33]. A Feature Pyramid Network (FPN) performs cross-layer fusion of C3–C4 features, significantly enhancing the model’s sensitivity to small-sized text (such as tax IDs and decimal places in amounts). The detection head innovatively introduces a dynamic threshold classification strategy, combined with a geometric regression loss (IoU Loss) and classification loss (Focal Loss), to accurately predict the parameters of rotated rectangular bounding boxes. Experiments show that this design improves detection recall by 12.3% in low-contrast document scenarios, effectively addressing the industry pain point of missed detection of blurred text.
In the multimodal alignment stage, the focus is on the collaborative optimization of visual and textual semantics. The text region images output from the detection stage are processed in two parallel paths: one path uses the lightweight visual Transformer architecture, MobileViT-XXS, to extract global semantic features. Its hybrid CNN-Transformer design, with only 5.6 M parameters, balances the modeling of local details (e.g., stamp edges) and global layout (e.g., table structure). The second path inputs the preliminary recognition results generated by Tesseract OCR into the lightweight CLIP-Text encoder (with parameters compressed to 30 M). This utilizes contrastive learning to align the text embedding with the semantic space of the image. In the feature fusion stage, element-wise summation is used to map the 256-dimensional image features and text features into a 128-dimensional joint semantic space . This mechanism dynamically resolves semantic ambiguities—recovering the misclassified “S” to “5” in OCR, reducing the Character Error Rate (CER) by 1.8%, and significantly improving the reliability of key field recognition.
The recognition stage adopts a lightweight sequence modeling framework, balancing accuracy and efficiency through the synergistic collaboration of ShuffleNetV2, Bidirectional Gated Recurrent Units (BiGRU), and position-aware attention. ShuffleNetV2, through channel splitting and reordering operations, reduces 40% of the computational load in extracting local features at 8 × 32 resolution. BiGRU compresses the hidden layer dimension from 512 to 256, reducing the parameter count by 60%, while alleviating long sequence dependency issues through gating mechanisms. The introduced sinusoidal position encoding enhances the model’s spatial awareness, leading to a 2.3% reduction in Word Error Rate (WER) for multi-column text. The position-aware attention mechanism further integrates spatial coordinates with semantic features, demonstrating strong robustness in dense handwriting recognition.
The system achieves global optimization through an end-to-end training framework. The dynamic threshold classification loss and geometric regression loss in the detection stage are weighted 3:1 to ensure the precision of bounding box localization. The recognition stage combines CTC loss and cross-entropy loss with a 4:1 weight to balance sequence alignment stability and character classification discriminability. After INT8 quantization, the model size is compressed to 18 MB, providing a solution for efficient and practical automation of receipt-like financial document processing, with both high precision and utility.
3.2. Text Detection Module
The text detection module is the core component of the receipt-like financial document automation recognition system, designed to accurately locate text regions from complex backgrounds, especially in challenging scenarios such as small-sized text, low contrast, and rotated text. This module utilizes an improved MobileNetV3-Small + Feature Pyramid Network (FPN) architecture, combined with dynamic threshold classification and rotated bounding box geometric regression, to achieve an end-to-end optimized design. The text detection module is shown in
Figure 2.
The backbone network is based on MobileNetV3-Small, which efficiently extracts multi-level features through depthwise separable convolutions and lightweight bottleneck blocks. The network design incorporates the h-swish activation function, maintaining nonlinear expressive capabilities while reducing computational overhead. Key feature layers include: (1) the C3 feature map, with a resolution of 1/8 of the input image (64 × 64) and 160 channels, retains rich detail information, suitable for small text localization; and (2) the C4 feature map, with a resolution of 1/16 (32 × 32) and 512 channels, contains higher-level semantic information, reducing background noise interference.
To enhance multi-scale feature representation, FPN performs cross-layer fusion of the C3 and C4 features: channel alignment is achieved by applying a 1 × 1 convolution to unify the channel count of C3 and C4 to 256, ensuring compatibility of the feature maps. The upsampling and summation processes involve bilinearly upsampling the C4 feature map to the resolution of C3 (64 × 64) and performing element-wise summation with the C3 features. The formula for this operation is as follows:
This work integrates high-level features with fine-grained features, significantly improving the detection recall rate for small text regions.
Traditional fixed threshold classification tends to be error-prone in low-intensity areas. This model proposes a dynamic threshold mechanism by generating two supporting maps: the Score Map and the
generated feature map, with feature distribution
for text regions. The Threshold Map is generated by
to produce an adaptive threshold
. The final classification probability is obtained by applying the Sigmoid activation function.
where
is the Sigmoid function and
is the bias term. The classification loss is based on the improved Focal Loss, which adjusts the imbalance problem by dynamically tuning the loss for harder-to-classify examples.
In this experiment, , are set, which improves the recall rate of small text regions by 12.3% compared to low-intensity text areas.
To meet the needs of text detection in inclined scenes, five parameters are returned from the bounding box with rotation: the center coordinates
, width
, height
, and rotation angle
. The loss function is optimized based on the IoU (Intersection over Union) metric to directly maximize the overlap between the predicted and ground-truth bounding boxes.
To improve the regression stability, the width
and height
are smoothed using a log scale.
The detection module achieves global optimization through end-to-end training framework, where the total loss function is a combination of classification loss and several regression losses.
where
,
, balancing detection precision with location accuracy. In the refinement stage, non-maximum suppression (NMS, threshold value 0.4) is applied to filter out the final bounding boxes for text regions.
In the experiment on the CORD dataset, the detection module achieves an F1 score of 93.1%. The line-based model (EAST) improves by 6.2%, and the parameter size is about 2.1 million. The innovation of this design lies in automating the processing of financial forms, such as invoices, providing high accuracy and resolving issues like extended document types or other categories of data.
3.3. Multimodal Feature Alignment Module
The term ‘multimodal alignment’ refers to the process of aligning and integrating information from different modalities, such as text and images, into a shared semantic space. In our system, this involves using contrastive learning to align the textual and visual features of receipt images, ensuring that both modalities are correctly interpreted together. By performing this alignment, the system can effectively resolve ambiguities in character recognition (e.g., distinguishing similar characters like ‘5’ and ‘S’) and improve overall text detection and recognition accuracy, particularly in complex or noisy environments. It aims to resolve the semantic ambiguity problem of traditional OCR in complex scenarios through semantic collaboration between vision and text. This module consists of three parts: an image encoder, a text encoder, and feature fusion. Through cross-modal contrastive learning and lightweight design, it achieves efficient and precise semantic alignment. The multimodal feature alignment module is shown in
Figure 3.
The image encoder adopts the lightweight MobileViT-XXS architecture, combining the local perception ability of CNN with the global modeling advantage of Transformer, to extract multi-scale semantic features from cropped text region images.
Hybrid convolution-Transformer block. The input image (224 × 224) first passes through a lightweight CNN to extract local features. Then, it goes through the Transformer encoder to model global dependencies. The Transformer layer uses a window attention mechanism, dividing the feature map into non-overlapping windows to reduce computational complexity. The formula is as follows:
where
is the learnable position bias, and
is the attention head dimension.
This study adopts the standard ViT with 12 layers, reducing the number of encoding layers to 6 and the hidden layer dimension from 384 to 128, with a parameter reduction of 5.6 M. The output feature vector represents the global semantic representation of the cropped text region (such as the ‘amount’ part of a ‘money’ text segment).
Lightweight CLIP-text. The text encoder is based on the CLIP-Text model, which is compressed to retain its cross-modal alignment capabilities while controlling the model’s parameter size. The model is reduced from 12 layers of Transformer to 6 layers, and the hidden layer dimension is decreased from 512 to 256. Knowledge distillation is used with the original CLIP as the teacher model to minimize the output distribution difference of the lightweight student model. The loss function is as follows:
where
is the temperature parameter, and it is set to
.
The input is the initial recognition result from Tesseract OCR, which is further processed by the WordPiece tokenizer. The text features are output, representing the shared semantic space with the image features.
Feature fusion. The multimodal feature fusion is realized in two steps: semantic enhancement and redundant removal. The image features
and text features
are fused through a shared pathway, preserving complementary information.
This operation can correct OCR misrecognitions. For example, when the image clearly shows ‘5’ but the text recognizes it as ‘S’, the visual features dominate the fusion result.
The fused features are mapped to a lower-dimensional space through a fully connected layer, removing redundant information and enhancing discriminability.
where
, with an output dimension of 64. This reduces computational cost while preserving cross-modal relationships.
Cross-modal alignment optimization. To improve the alignment accuracy between image and text, contrastive loss is introduced:
where
is the cosine similarity,
is the temperature parameter, and
is the batch size. This loss function aims to minimize the mismatch between image and text features.
This module enables end-to-end cross-modal alignment, providing robust semantic understanding for receipt-like financial document recognition. It can be extended to various financial document scenarios, such as checks and invoices.
3.4. Text Recognition Module
The text recognition module, as illustrated in
Figure 4, is designed to address the challenges of recognizing characters with diverse morphological variations and complex spatial layouts in receipt-like financial documents, while also being constrained by the limited resources of edge devices. This module employs a lightweight sequence processing framework that balances efficiency and accuracy, comprising three core components: local feature extraction, sequence modeling, and joint loss optimization.
For local feature extraction, the ShuffleNetV2 network is used to replace the traditional MobileNetV3, achieving efficient computation through unique channel splitting and rearrangement mechanisms. The input is the 64-dimensional fused feature output from the multimodal alignment module. After processing by the initial convolution layer of ShuffleNetV2, the feature extraction phase begins with the stacking of basic units. Each basic unit splits the input channels into two branches: the main branch uses a 1 × 1 convolution to increase the channel dimension, followed by a 3 × 3 depthwise separable convolution to extract spatial features; the residual branch retains the original channel dimension. The outputs of both branches are concatenated after a channel rearrangement operation, which is represented as follows:
This design requires only 0.8 M parameters at an input resolution of 8 × 32, achieving a 40% reduction in computational complexity compared to MobileNetV3, while maintaining sensitivity to fine-grained character details. This lightweight architecture ensures efficient performance on resource-constrained edge devices, making it particularly suitable for real-time receipt-like financial document processing tasks where both accuracy and computational efficiency are critical.
In the sequence modeling stage, a collaborative architecture combining Bidirectional Gated Recurrent Units (BiGRU) and position-aware attention is introduced. The hidden layer dimension of the BiGRU is compressed to 256, and its gating mechanism dynamically regulates the information flow through the following formulas:
where
and
represent the update gate and reset gate, respectively, and
is the element-wise product. The bidirectional structure passes through both forward and backward GRU layers. The hidden state is concatenated as
, enhancing the ability to capture semantic information across time steps.
To enhance the model’s spatial awareness for tasks such as table alignment and multi-column text processing, a position-aware attention mechanism is introduced, which incorporates sinusoidal encoding to improve the sensitivity of features to positional information. Given the sequence position pos and the dimension index ii, the positional encoding matrix
is generated as follows:
The output hidden state
from the BiGRU is concatenated, and then attention weights are computed to focus on the key positions.
The combined loss function uses Connectionist Temporal Classification (CTC) with the addition of a cross-entropy term. This combination stabilizes the sequential alignment and character classification accuracy. Given an input sequence
and the target label sequence
, the total loss is defined as:
where the CTC loss solves the optimal alignment path through dynamic programming:
The cross-entropy loss addresses the character classification error at each time step as follows:
The experiment is set with . On the CORD dataset, this design achieved a 1.2% reduction in Character Error Rate (CER) compared to using a single loss function. This improvement highlights the effectiveness of the proposed approach in enhancing the accuracy of text recognition tasks, particularly in the context of receipt processing where precision is critical. The integration of joint loss optimization demonstrates its capability to better handle the complexities of receipt-like financial documents, such as varied character morphologies and intricate spatial layouts, while maintaining computational efficiency suitable for edge device deployment.
3.5. Global Optimization Strategy
In the global optimization process of the receipt-like financial document recognition model, the research team developed a multi-dimensional collaborative technical framework. This framework integrates quantization compression, knowledge distillation, and structural simplification to adapt complex multi-modal models to resource-constrained edge devices. The optimization strategy not only focuses on improving computational efficiency but also achieves a dynamic balance between accuracy and speed through sophisticated mathematical design. In the following, we elaborate on the technical principles and implementation details, with key formulas presented independently to clearly illustrate their mathematical essence.
Model quantization. The core of quantization technology lies in the efficient encoding of information. While traditional FP32 models offer exceptional precision, their large size and computational overhead make them unsuitable for real-time requirements. The research team employed TensorRT’s dynamic range quantization strategy to discretize weights and activation values from continuous floating-point space to an 8-bit integer domain. This process is not a simple truncation but involves dynamically calculating the scaling factor for each layer by statistically calibrating the activation distribution of the dataset. For example, for the weight matrix
of a convolutional layer, the quantization formula is as follows:
where
functions like an adaptive “optical zoom lens”, preserving essential features while compressing redundant information. When the quantized model runs on the Jetson AGX Orin device, the dequantization formula is applied as follows:
The INT8 computation results are restored to floating-point space to ensure numerical stability. This process reduces the model size from 72 MB to 18 MB, with inference speed increased to 25 frames per second. This process is analogous to compressing a large-scale reference resource into a compact format, while preserving essential semantic information.
Knowledge distillation. The essence of knowledge distillation lies in the transfer of wisdom. The teacher model (ResNet-50 + CRNN) acts like an experienced accountant, leveraging deep convolutional networks to extract detailed features from receipts and employing bidirectional LSTM to capture the contextual logic of fields such as amounts and dates. This model achieves a CER of 3.8% on the CORD dataset. On the other hand, the student model (CLIP–BiGRU) resembles a bright apprentice, absorbing the “dark knowledge” from the teacher model’s outputs through Temperature Scaling. The loss function is designed as follows:
where the temperature coefficient
acts like a knob that adjusts the concentration of knowledge—higher temperatures (e.g.,
) soften the probability distribution, enabling the student model to not only learn the distinction between characters like “5” and “S”, but also to understand the semantic relationships between fields such as “amount” and “tax ID”. Through distillation, the student model’s CER decreases from 6.6% to 5.1%, akin to an apprentice rapidly mastering core skills under the guidance of a seasoned mentor. This process highlights the effectiveness of knowledge distillation in transferring nuanced insights from a complex teacher model to a more efficient student model, making it highly suitable for real-world financial applications where both accuracy and computational efficiency are critical.
Channel pruning. The goal of channel pruning is to eliminate redundancy while retaining essential information. In this study, a sparse training strategy is employed, applying L1 regularization constraints to the convolutional layers of MobileNetV3:
where
represents the task-specific loss (e.g., classification or recognition loss),
is the regularization coefficient that controls the strength of sparsity,
denotes the weight matrix of the
convolutional layer, and
is the L1 norm, which encourages sparsity in the weights.
This process is akin to clearing paths through a dense forest of model parameters—the “axe” of L1 regularization forces some channel weights to approach zero, marking them as redundant paths. Subsequently, based on channel importance scores:
Channels with scores below a threshold () are removed, much like pruning dead branches from a tree. After pruning, the model’s parameter count is reduced by 32%, and FLOPs are decreased by 28%. Through fine-tuning with a learning rate of 1 × 10−5, the Character Error Rate (CER) quickly recovers to pre-optimization levels, demonstrating that the lightweight network can still accurately capture structural features such as small tax numbers and complex table layouts. This approach ensures that the model remains efficient and precise, making it highly suitable for deployment in resource-constrained financial applications.
5. Experimental Results and Analysis
5.1. Baseline Models and Comparative Design
To validate the effectiveness of the proposed CLIP–BiGRU model, three baseline models were selected for comparison:
CTPN + CRNN: A traditional two-stage method where CTPN generates horizontal text boxes and CRNN performs sequence recognition. It achieves a detection F1 score of 84.3% and a Character Error Rate (CER) of 8.9%.
EAST + CRNN: EAST supports rotated text detection, paired with CRNN for recognition. It improves the F1 score to 89.0%, but the model size is 95 MB (FP32), with an inference speed of 12 FPS.
Original CLIP + BiGRU: An uncompressed multi-modal model achieving a CER of 5.8%. However, it has a large parameter size of 210 MB and a slow inference speed of 9 FPS, making it unsuitable for edge deployment.
The proposed CLIP–BiGRU model, through lightweight design and compression strategies, achieves breakthroughs in accuracy, speed, and model size. This makes it highly suitable for edge deployment while maintaining competitive performance in receipt image recognition tasks.
5.2. Text Detection Performance
As shown in
Figure 6 and
Table 4, the CLIP–BiGRU model achieves an outstanding F1 score of 93.14% on the CORD dataset for text detection, surpassing the baseline EAST + CRNN model by 4.14 percentage points. This performance is driven by two key factors: first, a dynamic threshold classification head, which improves recall for low-contrast text (e.g., faint annotations on invoice edges) by 12.3%; second, a lightweight FPN design that enhances the mean IoU for small text (e.g., tax numbers) from 0.78 to 0.82. Additionally, the IoU-Smooth L1 joint loss reduces localization errors for rotated text (e.g., multi-column table headers) by 18%, ensuring precise detection of complex text layouts. These advancements collectively make the model highly effective for real-world receipt image processing.
5.3. Text Recognition Performance
As illustrated in
Figure 7 and
Table 5, the proposed model achieves a Character Error Rate (CER) of 5.1%, marking a significant 3.4 percentage point improvement over the baseline CRNN model (CER 8.5%). This enhancement is driven by several key innovations: CLIP-driven image-text feature fusion reduces the misrecognition of easily confused characters (e.g., “5” vs. “S”) by 52%, while sinusoidal positional encoding improves the handling of multi-column text, lowering the Word Error Rate (WER) from 7.8% to 5.5%. Additionally, the joint CTC and Cross-Entropy Loss approach, with a 4:1 weight ratio, addresses sequence alignment biases, boosting matching accuracy by 21%. These advancements collectively ensure robust and precise text recognition, making the model highly effective for real-world receipt image processing.
5.4. Lightweight Performance
As shown in
Table 6, through a three-stage compression strategy of quantization, pruning, and distillation, the model achieves remarkable lightweight performance on edge devices. The INT8 quantization reduces the model size to just 18 MB, a 75% compression from the original 72 MB FP32 model. With an end-to-end throughput of 25 FPS, it delivers a 108% speed boost over the baseline EAST + CRNN, processing each receipt in a swift 40 ms. Additionally, the model’s power consumption stabilizes at 12 W on the Jetson AGX Orin, marking a 33% reduction in energy usage. These optimizations ensure a perfect balance of efficiency, speed, and energy savings, making it ideal for real-time receipt image processing on edge devices.
To evaluate the framework’s adaptability across heterogeneous edge devices, we extended benchmarking to include Jetson Nano (GPU-enabled but resource-constrained) and Raspberry Pi 4B (CPU-only). Key results are summarized in
Table 7.
The performance of the CLIP–BiGRU framework varies across edge devices due to differences in computational architectures. On the Jetson AGX Orin, the model achieves 25 FPS by leveraging TensorRT’s INT8 acceleration and GPU parallelism, with no significant bottlenecks. In contrast, the Jetson Nano exhibits reduced throughput (12 FPS) due to fewer CUDA cores and lower memory bandwidth, underscoring the impact of GPU scale on inference efficiency. For CPU-only devices like the Raspberry Pi 4B, the absence of GPU/TPU support forces reliance on serial CPU execution for computationally intensive operations (e.g., convolutional layers and attention mechanisms), resulting in a limited throughput of 5 FPS. Additionally, the 4 GB RAM constrains batch processing and increases latency when handling large feature maps, further highlighting the challenges of memory bandwidth in resource-constrained environments.
These results emphasize the framework’s adaptability to heterogeneous hardware while illustrating the trade-offs between hardware capabilities and performance. On high-performance GPUs, such as the Jetson AGX Orin, the model operates efficiently with minimal bottlenecks. However, as the device’s computational capabilities decline, the inference speed and accuracy degrade, particularly on CPU-based devices like the Raspberry Pi 4B. This analysis underscores the importance of tailoring optimization strategies to specific hardware configurations to achieve the best balance of speed, accuracy, and resource utilization.
5.5. Comprehensive Comparison with SOTA Models
Table 8 provides a comprehensive performance comparison between CLIP–BiGRU and existing SOTA models on the CORD dataset. Below is a detailed analysis from multiple perspectives.
LayoutLMv3, a Transformer-based multimodal model, excels in semantic understanding but is impractical for edge deployment due to its large size (410 MB) and low speed (6 FPS). In contrast, CLIP–BiGRU combines lightweight design (MobileNetV3 + BiGRU) with a three-stage compression strategy, achieving near-LayoutLMv3 accuracy (5.1% vs. 4.8% CER) while improving speed by 4.2× (25 FPS) and reducing size by 95.6% (18 MB), making it ideal for edge devices.
Table 9 shows CLIP–BiGRU’s edge efficiency (FPS/Watt) is 2.08 (25 FPS/12 W), significantly higher than LayoutLMv3’s 0.13 (6 FPS/45 W), highlighting its suitability for mobile and embedded scenarios.
As shown in
Table 10, CLIP–BiGRU introduces multimodal alignment, leveraging CLIP to map image and text features into a unified space, reducing CER by 2.1 points. It also employs lightweight sequence modeling with BiGRU, cutting parameters by 60% and boosting speed by 108% with only a 0.3% CER increase, enabling efficient edge deployment.
On 1000 real-world receipts (including blurred, folded, and low-light samples), CLIP–BiGRU performs as shown in
Table 11. The robustness analysis shows that performance degradation under extreme conditions is controlled, with a Character Error Rate (CER) fluctuation of less than 2%, demonstrating that the lightweight design does not significantly sacrifice generalization.
The CLIP–BiGRU framework demonstrates exceptional efficiency and accuracy, making it highly suitable for edge deployment. With a compact model size of just 18 MB and a real-time processing speed of 25 FPS, the system enables seamless operation on low-power devices, ensuring practicality for resource-constrained environments. In terms of performance, the model achieves a Character Error Rate (CER) of 5.1%, closely matching state-of-the-art cloud-based models, while its key field F1 score reaches an impressive 93.14%, highlighting its robust multimodal accuracy. Furthermore, the framework’s cost efficiency surpasses existing solutions by one to two orders of magnitude, offering a highly economical and scalable solution for receipt image automation. This combination of efficiency, accuracy, and cost-effectiveness positions CLIP–BiGRU as a transformative tool for real-world applications in financial automation and beyond.
5.6. Ablation Study Analysis
To validate the effectiveness of core components in the CLIP–BiGRU framework, this section conducts module-wise ablation experiments, quantifying the contribution of each technical innovation to the final performance. Experiments are performed on the CORD test set with consistent parameters, and the results are as follows.
As shown in
Figure 8 and
Table 12, the CLIP-driven multimodal alignment significantly reduces the Character Error Rate (CER) by 1.8 percentage points, representing a relative reduction of 26.1%. This improvement is primarily attributed to two key mechanisms: cross-modal semantic correction and contextual awareness enhancement.
The cross-modal semantic correction dynamically refines OCR results, particularly for ambiguous characters. For instance, the misclassification of “5” as “S” is reduced by 45%, ensuring higher accuracy in character recognition. The contextual awareness enhancement strengthens the semantic consistency between related fields, such as “Amount” and “Total Price”. This improvement not only reduces errors but also boosts the key field F1 score by 5.5 percentage points, demonstrating the framework’s ability to better understand and process complex document structures. Together, these advancements underscore the effectiveness of CLIP–BiGRU in achieving robust and accurate multimodal alignment for receipt image understanding.
As shown in
Figure 9 and
Table 13, the position-aware attention mechanism, enhanced by sinusoidal positional encoding, reduces WER by 2.3 percentage points (29.5% relative reduction). This improvement stems from Table Alignment Optimization, which cuts word-level errors in multi-column text by 37%, and Long Sequence Modeling, which boosts alignment accuracy for long text lines (>20 characters) by 13.8 percentage points. These advancements underscore the mechanism’s effectiveness in handling complex document structures and sequences.
As shown in
Figure 10 and
Table 14, the model is optimized through quantization, pruning, and distillation. INT8 quantization reduces size by 75% (72 MB→18 MB) with a 0.2% CER increase and 2.3× speedup. Pruning improves speed by 19% (21 FPS→25 FPS), reduces parameters by 32%, and increases CER by 0.5%. Distillation recovers 0.5% CER, balancing accuracy and efficiency for edge deployment.
As shown in
Table 15, multimodal alignment and position-aware attention are the primary drivers of accuracy improvement, contributing 2.5 percentage points to CER reduction. Quantization and pruning achieve 75% model compression and 178% speed improvement with minimal accuracy loss (CER fluctuation < 1%).
6. Discussion
The CLIP–BiGRU framework demonstrates significant improvements in the robustness and accuracy of receipt recognition, particularly through the integration of multimodal alignment mechanisms. However, there are still several limitations that need to be addressed in future work to enhance the system’s practical applicability across diverse real-world scenarios.
6.1. Limitations of Multimodal Alignment Mechanism
The multimodal alignment mechanism substantially enhances the robustness of the model by integrating text and visual information. However, the performance of the framework is still heavily reliant on the accuracy of preliminary OCR results. Under challenging conditions—documents with severe blurring, uneven lighting, or physical damage—the OCR accuracy decreases significantly, which results in error accumulation during the cross-modal semantic correction process.
Experimental data indicate that when the Peak Signal-to-Noise Ratio (PSNR) of the input image falls below 15 dB, the Character Error Rate (CER) increases by approximately 40% compared to baseline scenarios. Additionally, the model struggles with recognizing unstructured text elements, such as handwritten annotations or text obscured by stamps. These difficulties arise from the limitations of local attention mechanisms, which are not fully adaptable to the irregular and highly variable layouts typically found in receipt documents.
Further testing reveals additional failure cases that highlight the model’s limitations:
Poor illumination and low contrast. The model struggles with low-contrast or poorly illuminated text, which leads to an increased Character Error Rate (CER) due to the difficulty in distinguishing text from the background.
Distorted documents. Skewed or rotated text, such as that caused by document folding or crumpling, leads to misalignment between the text and image features, causing errors in character recognition.
Physically damaged documents. Missing or faded text, especially from torn or damaged receipts, causes failures in text detection, particularly for critical fields like total amounts or merchant names.
Handwritten text. Recognition of handwritten annotations remains particularly challenging, especially with non-standard or highly decorative handwriting.
6.2. Challenges in Edge Deployment
Deploying the model on resource-constrained edge devices presents two main challenges:
Model compression and inference speed. While INT8 quantization allows for a 75% compression in model size, the inference speed improvement is less than expected, achieving only 60% of the theoretical value on low-power devices like those based on ARM Cortex-M7, which lack dedicated integer matrix acceleration units.
Sparse computation support. The support for sparse computations varies significantly across different edge devices, such as the Jetson AGX Orin and Raspberry Pi 4B. As a result, the channel pruning rate must be dynamically adjusted (ranging from 20% to 50%) to optimize performance according to the hardware limitations of each device. Additionally, the non-uniform support for mixed-precision computation on certain embedded processors leads to cascading quantization noise, which particularly affects long-text sequence recognition tasks.
These challenges highlight the need for further optimization to ensure efficient deployment across a variety of edge devices with different computational architectures.
6.3. Comparison with State-of-the-Art Models
A critical question in the field of document recognition is how the proposed CLIP–BiGRU framework compares to state-of-the-art (SOTA) models, such as ChatGPT and DeepSeek, particularly for tasks like receipt recognition. Although these models excel in general purpose natural language processing and multimodal tasks, their application to receipt recognition is still limited. These models are primarily designed for natural language understanding and generation, not for structured document recognition.
The CLIP–BiGRU framework outperforms traditional methods like EAST + CRNN and CTPN + CRNN in both accuracy and model size. However, a direct comparison with ChatGPT and DeepSeek was not conducted, as these models are not optimized for receipt image recognition. Future work could explore integrating large multimodal models for receipt processing as these models evolve to handle more diverse types of visual and textual inputs.
6.4. Limitations of the CORD Dataset
One limitation of the current study is the exclusive use of the CORD dataset, which mainly contains retail receipts. While the dataset offers substantial diversity in receipt layouts and content, it does not encompass the full spectrum of financial documents, such as tax forms, account statements, and insurance claims. Therefore, the generalizability of the proposed framework to other financial document types is limited.
Future work should aim to validate the framework on additional datasets that cover a broader range of financial document types to assess its performance and adaptability in more varied real-world scenarios.
6.5. Future Work Directions
Looking ahead, several technological advancements can significantly improve the model’s performance and its deployment in real-world financial environments. These advancements are proposed in three key dimensions:
Financial domain-adaptive pre-training. Future research could focus on developing a financial domain-adaptive pre-training paradigm. This would incorporate domain-specific knowledge such as fixed field layouts in value-added tax invoices and multilingual encoding rules in cross-border financial documents. Pre-training tasks such as occluded field reconstruction and cross-page semantic continuity prediction could enhance the model’s ability to handle complex and diverse document layouts more accurately.
Incremental lightweight framework. Another promising direction is to develop an incremental lightweight framework that combines online knowledge distillation with dynamic sparse training. This would allow the model to optimize its parameters based on the computational capabilities of the device and the layout variations in financial documents. Preliminary tests have demonstrated a 0.8% CER elastic recovery capability during deployment in laboratory settings, showing great potential for further enhancing real-time performance on edge devices.
Graph neural network-based error correction engine. Introducing a graph neural network-based multimodal error correction engine could address semantic-level errors such as missing tax number digits or mismatched currency units. By constructing a triple knowledge graph of text–layout–tax code, this engine could systematically correct errors and improve the F1 score of key field recognition by up to 3.2 percentage points. Preliminary validation shows significant accuracy improvements in complex receipt processing tasks.
These proposed advancements offer both theoretical depth and engineering feasibility, providing a clear roadmap for transitioning the CLIP–BiGRU framework from laboratory validation to large-scale industrial deployment in financial automation systems.
7. Conclusions
In this study, we propose a novel approach for receipt recognition based on multimodal alignment and lightweight sequence modeling, achieving high performance in both accuracy and efficiency. Our framework significantly improves the recognition of receipt data by integrating textual and visual information in a unified manner. Despite the strong results demonstrated in real-world applications, several limitations remain. First, the model’s performance may degrade under extreme conditions, such as when receipts are severely blurred or have low contrast. Although the robustness analysis shows controlled degradation, performance could still be compromised in these challenging scenarios. Second, while the CORD dataset offers a comprehensive representation of retail receipts, its scope is somewhat limited, and the generalizability of our model to other types of receipts or financial documents, such as invoices or tax forms, requires further validation on a broader range of datasets.
From a theoretical perspective, our work contributes to the field of receipt recognition by advancing multimodal alignment techniques—specifically through contrastive learning that integrates image and text information. This approach not only enhances the accuracy of recognition in complex receipt layouts but also offers a solution for efficient real-time processing on resource-constrained edge devices. The incorporation of lightweight sequence modeling provides an effective method for handling large-scale receipt data while maintaining computational efficiency, which has significant implications for real-time applications.
In future work, we plan to focus on further improving the model’s robustness in recognizing handwritten text and enhancing its performance in extreme conditions. Additionally, we aim to explore the scalability of our approach for a wider variety of receipt types and document layouts, which will help extend its practical applications across different industries and environments.