1. Introduction
Although Electronic Data Interchange (EDI) has been widely adopted for many years, paper documents remain the primary vouchers in multi-tier distribution and cross-border trade. Currently, for standardized documents with fixed formats (such as VAT invoices and bank bills), processing technologies based on OCR and template matching are well-developed. However, these technologies struggle to achieve effective structured extraction when dealing with non-standard documents featuring diverse structures and complex layouts. Meanwhile, due to the lack of unified physical item description standards among enterprises, traditional character matching methods suffer from low accuracy in physical goods handover mapping, which impairs the efficiency of product handover between enterprises in the supply chain. With useful logical reasoning and semantic generalization capabilities, large language models offer a new approach to overcoming document structure differences and realizing complex semantic mapping. Nevertheless, when deployed on the logistics edge side constrained by hardware video memory, lightweight models still suffer from inherent drawbacks such as knowledge bottlenecks, logic drift, and unstable output formats.
The analysis of inbound documents and the matching with goods constitute the core link of physical handover in the supply chain, encompassing two subtasks: firstly, converting bills of lading, packing lists, etc. into structured data through intelligent document processing technology; secondly, utilizing retrieval technology to establish a mapping association between recognition results and the WMS system, ensuring the consistency between data and physical goods, and completing the inbound process. Currently, there is limited academic research on the analysis of inbound document information and the matching with purchased goods data. Existing literature primarily focuses on the evolution of single technologies such as intelligent document processing or text retrieval, lacking systematic exploration of the deep integration of large model capabilities with complex supply chain inbound scenarios.
Intelligent document processing technology has evolved from the initial stages of character recognition and template matching to a semantic understanding and end-to-end generation paradigm centered around large language models [
1]. Early deep learning models, such as the LayoutLM series, achieved the fusion of text geometric coordinates and semantic features by introducing 2D positional embeddings, laying the foundation for understanding visually rich documents [
2]. Subsequently, the emergence of OCR-free architectures such as Donut simplified the recognition process and reduced the problem of error propagation in OCR [
3]. Currently, LLM-driven IDPs, leveraging their zero-shot generalization capabilities and instructional fine-tuning strategies, have significantly enhanced the stability in handling “long-tail” documents in cross-border logistics [
2,
4]. On the VRDU architecture, research progress has focused on enhancing layout awareness, such as the layout thought chain in LayoutLLM, supporting high-resolution processing [
5], and implementing lightweight multimodal extensions (such as the error correction capability in DocLayLLM) [
6]. Addressing specific challenges in the logistics domain, such as complex table extraction, the TSR-OCR-UQ framework introduces “Conformal Prediction” (CP) technology to quantify the uncertainty of extraction results, automatically triggering manual review for high-risk areas, reducing the verification workload by approximately 53% [
7]. For handling personalized handwriting, there is a solution provided by MetaWriter based on meta-learning prompt fine-tuning [
8]. Future development trends point towards Agentic workflows, such as the DocAgent framework, which addresses the contextual understanding limitations of extremely long documents by simulating multi-agent collaboration inspired by human browsing habits [
8].
The evolution of information retrieval technology is a computational history that spans from literal matching to deep semantic understanding [
9]. The initial lexical retrieval systems, with inverted index as the core architecture, laid the foundation through Vector Space Model (VSM) and TF-IDF [
10], reaching their peak with the Okapi BM25 probability model [
11], addressing the keyword-based ranking problem. However, the limitations of the lexical model’s “vocabulary gap” gave rise to the rise of semantic retrieval. This leap began with static word embeddings such as Word2Vec and GloVe based on the distributed hypothesis [
12,
13], mapping words to dense vector spaces. The real revolution occurred with the introduction of the Transformer architecture and BERT model [
14], enabling context-aware dynamic semantic representation and improving the accuracy of query intent understanding. To balance the contradiction between accuracy and efficiency in large-scale retrieval, the industry has shifted to dual-encoder architectures such as Dense Passage Retrieval (DPR) [
15], combined with innovative models such as ColBERT (late interaction) [
16] and SPLADE (Sparse Neural Expansion) [
17]. Current cutting-edge technologies focus on hybrid retrieval frameworks (such as RRF fusion) [
18], balancing the advantages of lexical and semantic, and closely integrating with Retrieval-Augmented Generation (RAG) models [
19] to jointly address the knowledge timeliness and factual accuracy issues of large language models, indicating the evolution of retrieval systems towards a “contextual engineering” perception layer [
20,
21,
22].
To adapt to 24 GB VRAM hardware (e.g., RTX 4090), this paper selects models with fewer than 10 billion parameters. A 10B-parameter model occupies approximately 20 GB of VRAM when loaded in FP16 precision, leaving only marginal memory for KV Cache and long-document inference, which represents the limit for single-card deployment [
23]. However, such models suffer from inherent defects: supply chain knowledge bottlenecks caused by limited parameter capacity, logic drift during long-chain reasoning, and unstable output formats when generating complex nested JSON.
This study addresses logistics document processing as a document-to-entity alignment problem under edge-computing constraints. The task is not limited to recognizing text from document images; it requires converting heterogeneous document images into schema-compliant JSON records and then aligning noisy, abbreviated, or semantically inconsistent material descriptions with standardized enterprise SKU records. Therefore, the central challenge lies in jointly managing visual extraction errors, semantic ambiguity, database-scale candidate search, and local inference latency.
Research Gap and Positioning
To clarify the methodological position of this study,
Table 1 compares RRA-Logis with closely related research directions in multimodal document understanding, OCR-free extraction, multimodal large models, and semantic entity alignment. Existing methods usually focus on either layout-aware field extraction, image-to-text generation, general visual-language reasoning, or text-based entity retrieval. In contrast, the target problem in this paper requires joint handling of visual document noise, non-standard material descriptions, enterprise SKU master-data alignment, and edge-computing latency constraints.
Therefore, the contribution of RRA-Logis lies not in the isolated use of LoRA fine-tuning, dense retrieval, or LLM-based reranking, but in organizing these components into a confidence-gated decision mechanism for logistics document-to-entity alignment under resource-constrained deployment conditions.
The main contributions of this study are as follows.
We formulate logistics document processing as a document-to-entity alignment problem under edge-computing constraints. This formulation goes beyond conventional document extraction by coupling schema-constrained visual information extraction with downstream SKU-level semantic alignment.
We propose a confidence-gated, resource-constrained alignment mechanism for logistics documents. The mechanism combines vector candidate recall, LLM-based semantic verification, fused confidence scoring, and margin-based uncertainty control to determine whether a prediction should be automatically accepted or routed to human verification.
We construct the L-Doc-2K benchmark for logistics document extraction and material matching and describe its generation, labeling, splitting, and feedback process to support reproducibility.
We evaluate the framework under edge-computing constraints using extraction accuracy, JSON compliance, retrieval accuracy, latency, robustness, public-dataset transfer, and sensitivity analysis. The evaluation is intended to assess both accuracy and deployment feasibility rather than only reporting a single recognition score.
2. Research Methodology
This study follows a design-science research methodology because the objective is to develop and evaluate an artifact for a practical logistics document-processing problem. The research procedure consists of five steps.
First, the practical problem was defined through the inbound logistics workflow. The target task was decomposed into two coupled subtasks: structured extraction from document images and semantic alignment between extracted material descriptions and standardized enterprise SKU records.
Second, a benchmark dataset was constructed. The L-Doc-2K dataset contains logistics document images generated from enterprise-style templates and desensitized business fields. To reduce the gap between synthetic and operational data, the generation process introduced layout variation, font variation, printing degradation, geometric distortion, low-light conditions, blur, stains, wrinkles, and partial occlusion. The dataset was divided into training, validation, and test subsets at the template level to reduce leakage across visually similar layouts.
Third, the RRA-Logis framework was developed. The extraction module uses a lightweight multimodal LLM with LoRA/QLoRA instruction fine-tuning to map document images to a predefined JSON schema. The alignment module first retrieves candidate SKU records through dense vector similarity and then performs LLM-based re-ranking within the top-N candidate set. A confidence-gated verification module routes uncertain fields to human review.
Fourth, comparative experiments were designed. The extraction task was evaluated against OCR + rule-based extraction, OCR + sequence labeling, native multimodal LLMs, and fine-tuned multimodal LLMs. The retrieval task was evaluated against edit-distance matching, vector retrieval, direct LLM prompting, and the proposed two-stage retrieval method.
Fifth, robustness and external validity were analyzed. Controlled perturbations were applied to the test set to evaluate sensitivity to image quality degradation. Cross-template and external real-document validation are discussed as necessary steps for assessing generalization beyond the Sim-to-Real distribution.
3. RRA-Logis Framework for Logistics Document-to-Entity Alignment
3.1. Design Rationale and Overall Architecture
RRA-Logis is designed for a constrained but common industrial scenario: logistics documents must be processed locally or near the edge, while the extracted material descriptions must be mapped to enterprise master data despite inconsistent naming, abbreviations, visual recognition noise, and template variation. The methodological contribution of RRA-Logis is not the isolated use of LoRA fine-tuning, dense retrieval, or LLM-based re-ranking, since these techniques have been studied in existing work. Rather, RRA-Logis organizes these components into a confidence-gated decision mechanism that allocates computational and human resources according to semantic ambiguity and prediction confidence. Clear cases are processed automatically through low-cost retrieval, ambiguous cases are verified by LLM reasoning over a reduced candidate set, and low-confidence cases are routed to human verification before entering downstream ERP or WMS workflows.
The core design goal of RRA-Logis is to achieve accurate document parsing and semantic structuring under limited edge-side computing resources. As shown in
Figure 1, the framework follows the principles of data-driven adaptation, end-to-end multimodal reasoning, and closed-loop data evolution. Its workflow contains three core modules: data construction and feedback, schema-constrained multimodal extraction, and confidence-gated semantic entity alignment.
3.2. Core Modules and Workflow
Following the roadmap in
Figure 1, RRA-Logis is organized around three methodological modules. The data construction and feedback module builds the L-Doc-2K benchmark and supports incremental improvement through human-verified samples. To address the scarcity of accessible enterprise logistics documents, the initial data are generated from desensitized business fields and enterprise-style templates, while operational corrections are stored as high-confidence samples for periodic model updating. This design allows the dataset to evolve with the business environment rather than remain a fixed synthetic resource.
The schema-constrained multimodal extraction module is responsible for converting document images into standardized JSON records. Under the 24 GB VRAM deployment constraint, the module replaces the traditional OCR detection-recognition-extraction cascade with a lightweight multimodal LLM. Qwen2.5-VL is adapted with LoRA instruction tuning through LLaMA-Factory, and structured prompts constrain the output space to logistics-specific fields. This module therefore maps document pixels directly to structured semantic fields while reducing error accumulation across separate OCR and extraction stages.
The confidence-gated semantic alignment module maps extracted material descriptions to enterprise SKU master data. To process large SKU databases with limited computation, the module first performs vector-based candidate recall to obtain a compact Top-N candidate set. It then applies LLM-based semantic verification within this reduced set and uses confidence and score-margin criteria to decide whether the result can be accepted automatically or should be routed to human verification. This design explicitly balances matching accuracy, inference latency, and manual-review cost.
4. Schema-Constrained Multimodal Extraction
This module adopts Qwen2.5-VL-7B as the multimodal backbone and aims to reduce error accumulation caused by traditional cascaded processing, where text detection, recognition, and named entity extraction are separated. The dynamic-resolution mechanism helps preserve small characters and fine-grained visual cues in logistics documents. After LoRA instruction tuning, the model becomes more sensitive to logistics terminology and document-layout patterns. Structured prompts further constrain the output to a predefined JSON schema, making the extracted results easier to parse by downstream WMS or ERP workflows.
The architecture is centered around a lightweight large language model, which achieves direct mapping from unstructured images to standardized data through efficient parameter fine-tuning and structured prompt engineering.
4.1. Backbone Model and Deployment Constraint
This article adopts Qwen2.5-VL as the foundational architecture. This model features a dynamic resolution mechanism, enabling it to process logistics document images with varying aspect ratios while retaining fine-grained visual features. Compared to traditional cascaded models, this architecture integrates visual perception and semantic understanding within a unified parameter space. It can infer ambiguous or incomplete characters through visual context, thereby enhancing the robustness of recognition.
4.2. Parameter-Efficient Instruction Tuning with LoRA
To adapt the general multimodal model to logistics terminology and document-layout patterns, this study uses the LLaMA-Factory framework to implement LoRA/QLoRA instruction fine-tuning.
Let
W0 denote a frozen weight matrix in the original multimodal language model. LoRA injects a trainable low-rank update into selected linear projection layers:
where A and B are trainable low-rank matrices,
r is the LoRA rank, and alpha_l is the LoRA scaling factor. During fine-tuning,
W0 remains frozen and only A and B are updated. This parameter-efficient strategy reduces memory consumption and enables logistics-domain adaptation under the 24 GB VRAM constraint.
The instruction dataset is organized as image-query-target JSON triplets. By exposing the model to diverse logistics document layouts and field constraints, the fine-tuning process teaches the model to generate schema-compliant outputs rather than unconstrained free text.
4.3. Structured Prompting and JSON Schema Constraint
To adapt the general fine-tuning large model to the specific logistics domain, this paper employs the LLaMa-Factory framework for efficient instructional fine-tuning.
In System Prompt, a strict JSON Schema is defined, requiring the model to map recognition results to preset fields. This article designs structured prompts to constrain the output space of the model.
The system prompt uses a unified JSON schema throughout the experiments: “Please extract the logistics document into the following JSON schema: {‘document_id’:‘…’,‘supplier’:‘…’,‘shipping_company’:‘…’,‘items’:[{‘name’:‘…’,‘model’:‘…’,‘specification’:‘…’,‘quantity’:‘…’,‘unit’:‘…’}]}. If a field is not visible, return an empty string rather than inventing a value.” This unified schema improves comparability across extraction experiments and downstream database parsing.
In terms of fine-tuning instructions, the model was fine-tuned using LoRA technology, which improves its sensitivity to logistics terminology while maintaining generalization ability.
By incorporating typical logistics document processing examples into the prompts, the model is guided to understand the boundaries of professional terminology, reducing formatting errors caused by the illusion of large models, as shown in
Figure 2. The prompts require the model to perform a brief visual positioning description before extraction, utilizing the reasoning ability of large models to assist in the precise alignment of complex tabular data.
4.4. Training Objective
Given a document image I, a structured instruction prompt Ps, and a target JSON token sequence Y = (y
1, …, y
t), the extraction model is trained by minimizing the conditional negative log-likelihood:
Here, P_theta denotes the multimodal language model after LoRA adaptation, y<t denotes previously generated tokens, and Ps specifies the required schema and field constraints. Invalid JSON outputs are penalized during evaluation rather than directly during token-level training. This objective enables the model to learn the mapping from visual document content to structured semantic fields.
5. Confidence-Gated Semantic Entity Alignment
The alignment method integrates vector-space retrieval [
15,
16,
17,
18] with lightweight LLM inference. After structured JSON extraction, material descriptions are searched against the inventory database. Because extracted descriptions may differ from standard SKU names through abbreviations, aliases, synonyms, or recognition noise, vector embeddings are used to compute semantic similarity between the query and inventory entries before LLM-based verification is applied.
To bridge the semantic gap between unstructured incoming documents and standardized enterprise master data while satisfying real-time deployment constraints, this paper proposes a confidence-gated resource-constrained alignment framework. The framework differs from a standard coarse-to-fine retrieval pipeline because it is coupled with schema-constrained multimodal extraction and explicitly models uncertainty in the final decision. It consists of three stages: low-cost vector candidate recall, LLM-based semantic verification within a small candidate set, and confidence-gated automatic acceptance or human review.
5.1. Vector-Based Candidate Recall
The goal of the first stage is to quickly recall the Top-N potential matches from a large inventory database, thereby narrowing the search space to a candidate window that can be processed efficiently by the LLM reranker.
: To capture the multidimensional characteristics of SKU, this paper defines a text serialization function T(.). For any SKU record e
j in the database, the standard name, model, unit, and specification parameters are serialized into a unified textual representation T(e
j).
: All generated inventory vectors V = {v1, v2, …, v|S|} are indexed in Faiss, and an approximate nearest-neighbor index based on HNSW (Hierarchical Navigable Small World) is constructed for efficient vector retrieval.
After the fine-tuned extraction model obtains a non-standard material description q, the system encodes q into a query vector v
q and retrieves the nearest inventory vectors according to cosine similarity. The top-N records with the highest similarity scores form the candidate set for the second-stage reranking model.
: Based on the similarity score, the system extracts the original SKU information corresponding to the top-N candidate vectors, where N = 10 in this experiment, forming a candidate set Cq = {c1, c2, …, cN}.
5.2. LLM-Based Semantic Verification and Re-Ranking
Due to the potential interference from noisy text when handling direct mapping tasks with lightweight models, this paper introduces large model inference technology to perform deep semantic comparison within a reduced candidate space.
The Top-K candidate set output from Stage 1 and the original recognized text are encapsulated in a specific inference template. Through prompt engineering, the model is guided to perform multi-step inference analysis:
Step 1: Attribute disassembly. Compare the candidate product with the target text item by item, focusing on specifications, units, and models.
Step 2: Difference analysis. Analyze whether the noise in the recognition results is due to synonym conversion or missing characters (for example, mistaking “500 mL” for “500 mL”).
Step 3: Final determination. Based on the attribute matching degree, provide the final entry number or reject the matching signal.
By limiting the search space, the lightweight model avoids long sequence contexts, which helps reduce token generation time for localized deployment while maintaining inference accuracy.
For each query q extracted from the document, the vector retrieval stage returns a candidate set Cq = {c1, c2, …, cN}. For each candidate ci, the normalized vector similarity is denoted as si, and the LLM semantic verification confidence is denoted as gi. The fused alignment score is calculated as
Fi, where
controls the trade-off between retrieval similarity and LLM reasoning confidence.
Let F(1) and F(2) denote the highest and second-highest fused scores, respectively. The confidence margin is defined as m = F(1) − F(2). The final result is automatically accepted only when F(1) ≥ tau and m ≥ delta; otherwise, the sample is routed to manual verification. Here, tau is the minimum confidence threshold and delta is the ambiguity-margin threshold.
5.3. Confidence-Gated Decision and Human Verification
To reduce the risk of silent errors in logistics operations, RRA-Logis introduces a confidence-gated decision rule. A field or material-matching result is routed to manual review if the generated output violates the predefined JSON schema, a required field is empty or inconsistent with business rules, the maximum fused score is below tau, or the score margin between the top two candidates is smaller than delta. This mechanism is important for logistics SKU matching because visually similar model numbers, aliases, and abbreviated descriptions may produce close candidate scores. Human corrections are stored as high-confidence samples after validation and are used for periodic incremental fine-tuning. In this way, the system does not merely combine retrieval and re-ranking; it makes an explicit deployment-oriented decision that jointly considers accuracy, latency, and manual review cost.
6. Data Construction and Human-in-the-Loop Evolution
To address the limited availability of real enterprise documents and the presence of physical noise in operational scenarios, this paper introduces a human-in-the-loop data evolution strategy. This section describes the construction of the L-Doc-2K benchmark based on Sim-to-Real generation and the feedback mechanism used for incremental learning.
6.1. Sim-to-Real Construction of L-Doc-2K
The L-Doc-2K dataset was constructed to simulate the document diversity observed in logistics handover scenarios. It contains 2000 base document images generated from 20 enterprise-style templates, including delivery orders, packing lists, inbound receipts, and shipping notices. Each sample contains desensitized business fields such as document number, supplier, shipping company, material name, model, specification, quantity, and unit.
To improve physical realism, the image generation process includes font variation, layout perturbation, contrast changes, printing degradation, geometric distortion, low illumination, motion blur, reflection, wrinkles, stains, and partial occlusion. These perturbations are applied after semantic labels are generated, so the structured labels remain consistent with the document content. The dataset is divided at the template level into training, validation, and test subsets to reduce leakage across visually similar layouts.
Because Sim-to-Real generation cannot fully represent every operational environment, the dataset construction is complemented by human-confirmed real samples collected during system use. These samples are desensitized, checked for label consistency, and used for incremental validation and periodic model updating rather than being mixed into the held-out test subset.
6.2. Label Alignment and Consistency Verification
Due to severe geometric distortion in the image after physical transformation, the original digital coordinates cannot be directly used, as shown in
Figure 3. This paper adopts a mapping algorithm based on feature point matching to project the coordinate system of the electronic master into the space of the captured real image.
Specifically, by utilizing the positioning markers at the four vertices of the document, and combining homography transformation with local thin plate spline interpolation (TPS), the initial semantic labels are automatically mapped onto the deformed images. Finally, through manual sampling verification, it is ensured that the annotation accuracy rate of 2000 images reaches over 99%.
The final L-Doc-2K dataset lays a solid data foundation for enhancing the performance of subsequent models in complex environments.
6.3. Human-in-the-Loop Feedback for Incremental Updating
Although Sim-to-Real generation provides an initial training resource, the data distribution in operational logistics scenarios changes over time. To address this issue, the framework introduces a human-in-the-loop feedback mechanism in which corrected operational samples become validated data for subsequent incremental updates, as shown in
Figure 4.
During use, operators can correct low-confidence or incorrect fields through lightweight selection or text editing. These manually confirmed values are stored after desensitization and consistency checking, and they are accumulated as an incremental training set. Periodic fine-tuning on these real-distribution samples allows the model to adapt to new templates, environmental noise, and enterprise-specific material descriptions.
7. Comparative Experimental Results and Discussion
7.1. Experimental Protocol and Reproducibility
To improve reproducibility and to clarify the evidential scope of the experiments, all evaluations follow a unified protocol. The L-Doc-2K samples are separated into non-overlapping training, validation, and test subsets at the template level to reduce layout leakage. The validation subset is used for prompt checking, candidate-size selection, fusion-weight selection, and confidence-threshold selection, while the test subset is used only for final reporting. For the externally collected real-document subset, all sensitive business information is desensitized and the samples are used only as a preliminary external validation set, not for model training.
Annotation consistency is checked through automatic label generation followed by manual sampling verification. For real or manually corrected samples, field values are checked against the original document image and inconsistent labels are resolved before evaluation. The evaluation reports document-level extraction accuracy, field-level accuracy, ANLS, JSON compliance, Precision@1, Recall@N, MRR, and latency where applicable. Key experiments are repeated with multiple random seeds, and the results are reported as mean and standard deviation. Latency is measured under the same 24 GB VRAM hardware constraint and includes the corresponding extraction, retrieval, and re-ranking stages unless otherwise specified.
7.2. QLoRA Fine-Tuning Experiment
This experiment was conducted on two servers equipped with NVIDIA RTX 4090 (24 GB VRAM) graphics cards. To achieve efficient fine-tuning with limited video memory, this paper adopted QLoRA technology and DeepSpeed ZeRO-2 optimization strategy. By quantizing the main model to 4-bit and injecting only trainable low-rank matrices (LoRA Rank = 16, Alpha = 32) into all linear projection layers, the number of trainable parameters was compressed to less than 1.5% of the total parameters, enabling full-process gradient updates under the constraint of only 24 GB of video memory per card, as shown in
Table 2. The experiment utilized DeepSpeed ZeRO-2 stage optimization, which significantly improved throughput by splitting gradients and optimizer states between the two GPUs, reducing the training time for 6000 high-resolution documents to 4.5 h.
7.3. Structured Extraction Evaluation
To comprehensively evaluate the effectiveness of the method proposed in this paper in the task of logistics document information extraction, multiple sets of comparative experiments were designed. The experiments primarily revolved around two dimensions: firstly, verifying the performance gains of supervised fine-tuning compared to traditional rule-based methods and native large models; secondly, exploring the performance of models with different parameter scales in specific business scenarios.
7.3.1. Evaluation Parameters of the Experiment
JSON format compliance rate: Whether the output is a valid and parseable JSON.
Key information extraction accuracy: The precision of matching for core fields such as “model”, “supplier”, “product name”, and “quantity”.
n_correct: The number of samples where the extracted field content is consistent with the labeled answer.
n: The total number of evaluated samples.
Average Normalized Levenshtein Similarity (ANLS): It measures the character-level similarity between the extracted text and the actual text. The higher this value is, the higher the similarity. ANLS assesses the degree of character-level similarity between the predicted text and the actual text. It is calculated based on the edit distance:
: Assuming there are n samples in this study. For the ith sample:
: The text string predicted by the model.
: The ground-truth label text string.
: The Levenshtein edit distance between the predicted text pi and the ground-truth text gi, i.e., the minimum number of single-character insertions, deletions, or substitutions required to transform pi into gi.
: The maximum length of
pi and
gi, i.e.,
Li = max(|
pi|,|
gi|).
: Average similarity among all n samples:
7.3.2. Experimental Setup
To address the end-to-end extraction issue from unstructured logistics documents to structured JSON data, given an image I of a logistics inventory document, the objective is to generate a JSON object that conforms to a predefined schema.
The following baselines are compared to cover rule-based extraction, OCR-based sequence labeling, native multimodal LLM inference, and task-adapted multimodal LLM inference. The OCR + rule baseline uses PP-OCRv4 for text recognition and heuristic rules for field assembly. The OCR + sequence-labeling baseline uses PP-OCRv4 followed by a BERT-based entity tagger. The native multimodal LLM baselines use the original Qwen-series models without task-specific fine-tuning. The LoRA-tuned multimodal LLM baselines use the same prompt and schema after instruction fine-tuning on the training set. Specialized systems such as LayoutLM-style layout-aware models, Donut-style OCR-free document models, and lightweight multimodal models such as LLaVA or InternVL are discussed as closely related competitive approaches; however, their direct reproduction under identical logistics SKU alignment conditions requires additional OCR/layout annotations or task-specific adaptation. This limitation is explicitly acknowledged in the discussion.
To further clarify the relationship between the implemented baselines and contemporary specialized document-intelligence methods,
Table 3 compares representative methods in terms of OCR-free processing, structured extraction, entity alignment, edge deployment, and experimental reproduction in this study. LayoutLMv3, Donut, and InternVL are included as specialized reference systems because direct reproduction under the same logistics SKU-alignment setting would require additional layout annotations, OCR-level supervision, or task-specific SKU labels. Therefore, they are discussed for methodological positioning, whereas Qwen2.5-VL and RRA-Logis are evaluated under the same schema and 24 GB VRAM hardware constraints.
The predefined schema contains order number, supplier, shipping company, material name, model, specification, quantity, and unit. All compared extraction models use the same schema and field-matching rules to avoid evaluation bias caused by different output formats. The test data for the L-Doc-2K extraction comparison are kept separate from the training and validation subsets.
The OCR-based baseline uses PaddleOCR (PP-OCRv4) for text detection and recognition. The OCR output is sorted and concatenated according to coordinates, and a pre-trained BERT-base-Chinese tagger is used for sequence labeling of entities such as date and material name. The recognized entities are then assembled into JSON format through regular expressions and heuristic rules.
The fine-tuned Qwen2.5-VL-7B model and the native Qwen2.5-VL-7B model use the same unified prompt schema. This avoids a schema mismatch across experiments and ensures that accuracy differences reflect model capability and fine-tuning rather than output-format differences.
Taking into account the limitations imposed by the actual deployment environment (such as edge devices or local servers) on computing power, this paper further explores the impact of model parameter quantity on performance. Experiments were conducted using the Qwen2.5 series (7B, 3B) and the more lightweight Qwen3 series (2B, 4B) for horizontal comparison.
7.3.3. Results and Analysis
The results show that instruction fine-tuning substantially improves structured extraction. The Qwen2.5-VL-7B-LoRA model achieves the highest document-level extraction accuracy of 85.4% and a JSON compliance rate of 100%. Field-level metrics show that supplier and quantity fields are easier to extract than fine-grained model specifications, because model strings often contain visually similar characters, numbers, and units. The Qwen3-VL-4B-LoRA model obtains slightly lower extraction accuracy but lower inference latency, indicating a better accuracy–efficiency trade-off for edge deployment. Therefore, the 7B model is selected when accuracy is the primary requirement, whereas the 4B model is recommended for resource-constrained local deployment.
Figure 5 presents a comparison of large model capabilities.
This is primarily because the base model struggles to accurately capture complex business logic and field constraints. For instance, when processing logistics shipping labels, the base model may not be able to accurately distinguish between “origin” and “destination” based on spatial location. Alternatively, when extracting “number of goods”, a fine-tuned model can more sensitively identify the unique document layouts of different suppliers and prioritize feature extraction for key fields such as “quantity” and “model”. Furthermore, it can leverage some prior knowledge to achieve this.
Figure 6 compares the full-order accuracy of models with different parameter sizes.
Model type error is the main type of error. Models are composed of characters, numbers, and Chinese characters, which poses a higher difficulty. For example, identifying “M8 × 20 carbon steel bolt” as “M8 × 25 carbon steel bolt”. The model is not sensitive to subtle length ratios or thread density in images, resulting in performance on model types that is usually lower than supplier recognition. Since suppliers’ products are all composed of Chinese characters, their recognition accuracy is generally higher.
Table 4 presents the information extraction accuracy of different methods on the test set.
Furthermore, fine-tuning substantially reduces output-format instability. In the reported test set, all fine-tuned models achieve 100% JSON compliance, while the base models still show approximately 1% to 2% format errors. This improves the reliability of programmatic parsing of model outputs.
Table 5 presents the Accuracy comparison of different methods on the test set.
Table 6 presents the performance comparison of different retrieval methods on the retrieval test dataset.
7.4. Material Alignment Evaluation
To verify the effectiveness of the “two-stage semantic alignment strategy” proposed in this paper in handling heterogeneous data mapping, a specialized key-value alignment benchmark dataset was constructed, and this method was compared with traditional string matching algorithms and single vector retrieval algorithms as demonstrated in
Figure 7.
In addition to top-1 accuracy, the retrieval component should be evaluated using Precision@1, Precision@5, Recall@10, MRR, and average latency. Precision@1 reflects whether the first returned SKU is correct, Recall@10 measures whether the correct SKU is included in the candidate set, and MRR evaluates the rank position of the correct SKU. These metrics provide a more complete evaluation of the coarse-to-fine retrieval process than a single accuracy value.
This paper constructs a key-value alignment benchmark dataset based on text-text modality. Unlike traditional visual datasets, this dataset focuses on entity alignment tasks at the semantic level. The keys are composed of non-standard, unstructured inbound descriptions collected from real business scenarios (including abbreviations, aliases, and noise), while the values correspond to standardized unique product records in the inventory database. This dataset aims to quantitatively evaluate the accuracy and robustness of systems in mapping heterogeneous inbound queries to standard SKU data.
Compare the two-stage search framework with the following three representative methods:
(1) Edit Distance Retrieval Method: Calculate the minimum number of editing operations between the query term and the standard name. It represents the traditional string fuzzy matching algorithm. (2) Word Vector Retrieval Method: Extract sentence vectors using the pre-trained BERT-base model and calculate cosine similarity. (3) Direct Large Model Prompt Method: Attempt to directly input all inventory lists into the Prompt, allowing the LLM to find matches based on memory or within the context.
Experimental Analysis
The experimental results show that the traditional edit-distance method achieves an accuracy of 34.5%, indicating that character-level matching is insufficient for non-literal cases such as aliases and abbreviations. The word-vector retrieval method improves the accuracy to 68.3%, but it still has limitations in modeling domain-specific rules such as material specifications and units. In comparison, the proposed two-stage retrieval method with the 7B reranker reaches 92.8% accuracy. This suggests that vector-based candidate recall followed by LLM-based verification can better handle complex semantic alignment cases.
After introducing large model inference, the retrieval time increased from millisecond level (10–25 ms) to hundred-millisecond level (400–500 ms). To explore the feasibility of lightweight deployment, this paper tested models with different parameter counts.
The 7B model provides the highest accuracy, with an accuracy rate of 92.8%, but it takes 500 ms for a single search, which is the longest time. It is suitable for scenarios where high accuracy is required, and it is deployed on the server side.
As the number of parameters decreases, the accuracy rate exhibits a step-like decline (78–85%), but the inference speed increases. Specifically, the 4B model maintains a high accuracy rate of 85% while reducing some computational overhead compared to the 7B model. In a constrained computing environment, using the 3B or 4B model in conjunction with vector retrieval can achieve superior response speed while ensuring basic business availability (accuracy rate > 80%), validating the potential of the “coarse-to-fine” strategy for deployment on the edge.
7.5. Evaluation on Public Dataset
To avoid evaluating the proposed framework only on the self-constructed L-Doc-2K dataset, an additional experiment was conducted on the public DocILE benchmark. DocILE is a large-scale business document understanding dataset designed for Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR). It contains real business documents such as invoices and orders, and therefore provides a suitable public benchmark for evaluating the generalization ability of RRA-Logis in document information extraction and line-item-level material understanding tasks.
In this experiment, the official DocILE validation split was used as the external public test set. Since DocILE does not provide an enterprise SKU master database, the public dataset experiment evaluates the structured information extraction and line-item recognition components rather than the logistics-specific SKU alignment module. Predictions are converted into DocILE-compatible field and line-item outputs, and the reported KILE-F1 and LIR-F1 follow the DocILE task definitions. This setting makes the public-dataset evaluation complementary to, rather than a replacement for, the L-Doc-2K material-matching evaluation.
The evaluation was conducted under two settings. In the first setting, the model trained on L-Doc-2K was directly tested on DocILE without using any DocILE training samples. This setting was used to evaluate zero-shot cross-domain generalization. In the second setting, a small number of DocILE training samples were converted into image-instruction-JSON triples and used for lightweight instruction adaptation with LoRA. This setting was used to evaluate whether the proposed framework can be efficiently transferred to a public business document benchmark.
The following methods were compared: OCR + rule-based extraction, native Qwen2.5-VL-7B, RRA-Logis trained only on L-Doc-2K, and RRA-Logis with additional DocILE instruction adaptation. The evaluation metrics include KILE-F1, LIR-F1, ANLS, JSON compliance rate, and average inference latency.
Table 7 presents the public dataset evaluation results on DocILE.
The results show that the proposed method generalizes better than OCR-based extraction and the native multimodal large model on the public DocILE benchmark. When directly transferred from L-Doc-2K to DocILE, RRA-Logis achieves a KILE-F1 of 68.7% and an LIR-F1 of 60.5%, indicating that the structured prompting and logistics-domain fine-tuning strategy still provides useful extraction capability under a different business document distribution. However, the performance is lower than that on L-Doc-2K, which reflects the domain gap between logistics documents and general business documents.
After lightweight instruction adaptation using a small number of DocILE samples, the performance improves to 78.9% KILE-F1 and 72.4% LIR-F1. This result suggests that RRA-Logis can be transferred to public business document scenarios with limited additional fine-tuning. At the same time, because DocILE differs from logistics handover documents and does not include SKU master-data matching, this experiment should be interpreted as evidence for cross-domain extraction transfer rather than full validation of the complete logistics matching workflow.
It should be noted that DocILE does not include enterprise SKU master data, so it cannot fully replace the material-matching evaluation on L-Doc-2K. Instead, it serves as an external public benchmark for validating the structured extraction and line-item recognition components of RRA-Logis. The combination of L-Doc-2K and DocILE evaluation provides a more balanced assessment: L-Doc-2K evaluates logistics-specific material matching, while DocILE evaluates cross-domain generalization on a public business document dataset.
7.6. Robustness and External Validity Analysis
To further evaluate the robustness and external validity of RRA-Logis, additional experiments were conducted under controlled perturbation and cross-distribution settings. The purpose of these experiments was to examine whether the proposed framework can maintain stable performance when document quality, layout distribution, and data source deviate from the standard L-Doc-2K test set.
First, controlled perturbations were applied to the held-out test set. Seven common degradation conditions in logistics scenarios were considered: Gaussian noise, motion blur, low illumination, reduced resolution, partial occlusion, reflection, and wrinkles or stains. Gaussian noise was used to simulate camera sensor noise and low-quality scanning. Motion blur was introduced to represent handheld image acquisition. Low illumination and reflection were used to simulate warehouse environments with unstable lighting. Reduced resolution was used to represent low-end cameras or compressed images. Partial occlusion, wrinkles, and stains were used to simulate physical damage or contamination of paper documents during circulation.
The robustness evaluation used the same metrics as the main experiment, including document-level extraction accuracy, JSON compliance rate, ANLS, entity alignment Precision@1, and MRR. The clean test set was used as the reference condition.
Table 8 presents the robustness results.
The results show that RRA-Logis maintains relatively stable structured output under moderate image degradation. The JSON compliance rate remains above 98.7% under all perturbation conditions, indicating that structured prompting and instruction fine-tuning effectively constrain the output format. However, the extraction accuracy and alignment accuracy decrease under severe visual degradation. The largest performance drops occur under partial occlusion and reflection, because these perturbations directly damage key visual fields or introduce local high-brightness regions that interfere with character recognition. Motion blur and low illumination also reduce performance, especially for fine-grained model specifications containing visually similar digits, letters, and units.
Second, cross-template evaluation was conducted to test whether the model generalizes to layouts not observed during training. In this setting, several document templates were excluded from the training set and used only for testing. In addition, the real-document subset contains 200 desensitized logistics documents collected from operational inbound scenarios, covering five document templates and 1120 line items. Enterprise names, customer information, amounts, addresses, and other sensitive fields were removed or replaced, while task-relevant fields such as material name, model, specification, quantity, and unit were retained. All labels in this subset were manually checked against the original document images, and the subset was used only for external validation, not for training or parameter tuning.
Table 9 presents the external validity evaluation on cross-template and real-document subsets.
The cross-template and real-document results are lower than those of the standard synthetic test set, confirming that the Sim-to-Real generation process cannot fully cover the complexity of real logistics documents. Nevertheless, the model remains usable under these more difficult conditions, which provides preliminary evidence of external validity. Because the real-document subset is still limited in scale, this result should not be interpreted as full production-scale validation. Larger independent real-world datasets are required to further assess generalization across companies, regions, document standards, and long-term data drift.
Potential bias may arise from the selected document templates, perturbation types, material-description patterns, and human feedback samples. To mitigate these risks, the system records low-confidence cases and manually corrected samples during operation. These samples are then desensitized, verified, and incorporated into the incremental training set. In this way, the framework can gradually reduce the distribution gap between synthetic benchmark data and real operational data.
7.7. Statistical Analysis and Sensitivity Study
To improve the reliability of the empirical conclusions, the main experiments were repeated with three random seeds. The results are reported as mean ± standard deviation. In addition, bootstrap resampling was used to estimate the 95% confidence interval for key metrics on the test set. This setting reduces the possibility that the reported improvements are caused by random initialization or sample variation.
Table 10 presents the statistical stability of the main methods.
The results show that the proposed method is stable across different random seeds. The standard deviation of Qwen2.5-VL-7B-LoRA remains below 1% for extraction accuracy, indicating that the performance improvement is not caused by a single random run. The two-stage retrieval method also shows stable alignment performance, while its latency is higher than that of vector retrieval alone because the LLM re-ranking stage requires additional semantic reasoning.
A sensitivity study was further conducted to analyze the influence of candidate size, fusion weight, confidence threshold, and LoRA rank.
Table 11 presents the sensitivity analysis of candidate size N.
The candidate size N affects the trade-off between recall and latency. When N increases from 5 to 10, alignment accuracy improves because the correct SKU is more likely to be included in the candidate set. However, when N increases from 10 to 20 or 30, the additional improvement becomes limited, while inference latency increases noticeably. Therefore, N = 10 is selected as the default setting.
The fusion weight
controls the balance between vector similarity and LLM reasoning confidence. A small
gives more weight to the LLM reranker, which may improve semantic reasoning but can be affected by hallucination in ambiguous cases. A large
gives more weight to vector similarity, which improves retrieval stability but weakens fine-grained semantic discrimination. The best performance is obtained at
= 0.5, indicating that both vector similarity and LLM reasoning contribute to the final alignment decision, as shown in
Table 12.
The confidence threshold tau determines how many low-confidence samples are routed to human verification. A lower threshold reduces manual workload but allows more uncertain predictions to enter downstream systems. A higher threshold improves post-verification accuracy but increases human review cost. In this study, tau = 0.70 provides a reasonable balance between automation and reliability, as shown in
Table 13.
The LoRA rank determines the adaptation capacity of the fine-tuning module as shown in
Table 14. Rank 8 is more efficient but underfits logistics-specific layouts and terminology. Rank 32 slightly improves extraction accuracy but increases memory cost and training time. Rank 16 achieves a better trade-off between accuracy and efficiency and is therefore used as the default setting.
Overall, the robustness, external validity, statistical, and sensitivity analyses indicate that RRA-Logis performs consistently under the standard test setting and remains relatively stable under several document degradation and hyperparameter settings. At the same time, the decline in cross-template and real-document subsets indicates that continuous data collection and real-world validation remain necessary for industrial deployment.
8. Conclusions, Limitations, and Future Work
This study proposed RRA-Logis, a confidence-gated and resource-constrained framework for logistics document-to-entity alignment. The core methodological contribution is not the isolated use of LoRA/QLoRA instruction fine-tuning, vector retrieval, or LLM re-ranking, but the organization of these components into a decision mechanism that couples schema-constrained visual extraction, vector candidate recall, LLM-based semantic verification, margin-based uncertainty control, and human verification. This mechanism enables the system to allocate computation and manual review according to prediction confidence and semantic ambiguity under edge-computing constraints.
The 7B model obtains the best extraction and alignment accuracy in the reported experiments, making it suitable for scenarios where accuracy is prioritized and sufficient GPU resources are available. The 4B model provides a more favorable balance between accuracy and inference latency, making it more appropriate for edge-side deployment. This distinction clarifies the difference between the best-performing model and the recommended deployment model.
The study still has several limitations. First, L-Doc-2K contains a substantial proportion of Sim-to-Real samples, and the gap between synthetic and real operational data cannot be fully eliminated. Second, although DocILE is used as a public benchmark for structured extraction and line-item recognition, it does not evaluate logistics-specific SKU master-data alignment. Third, the current comparison does not fully reproduce all contemporary specialized document intelligence systems under identical resource-constrained settings, which remains a direction for future benchmarking. Fourth, large-scale production deployment, long-term data drift, cumulative batch-processing errors, and operational cost analysis require further investigation. Finally, the LLM re-ranking stage improves semantic matching but still requires stronger interpretability and reliability analysis for high-risk industrial decisions.
Future research will focus on larger real-world datasets, transfer to other document domains such as finance, healthcare, legal technology, public administration, education, and manufacturing, adaptive model cascades and distillation, uncertainty quantification, explainability, multilingual document standards, and integration with ERP, WMS, and traceability platforms.
Author Contributions
Author Contributions: Conceptualization, W.W. and M.L.; methodology, M.L. and W.W.; software, L.Y., D.L., S.Z. and L.K.; validation, F.K. and W.W.; formal analysis, M.L. and F.K.; investigation, L.Y., D.L., S.Z. and L.K.; resources, L.Y., D.L., S.Z. and L.K.; data curation, L.Y., D.L., S.Z. and L.K.; writing—original draft preparation, F.K. and W.W.; writing—review and editing, W.W., M.L. and F.K.; visualization, F.K.; supervision, W.W.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by National Natural Science Foundation of China grant number 62273204 and China Railway Corporation grant number 913700001630559891202326.
Data Availability Statement
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to commercial confidentiality and privacy restrictions.
Conflicts of Interest
Author Lunlei Yang, Dongsheng Li, Shuaichao Zheng, and Lingzheng Kong were employed by the company China Railway 14th Bureau Group Equipment Corporation Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Wang, W.; Hu, H.; Zhang, Z.; Li, Z.; Shao, H.; Dahlmeier, D. Document Intelligence in the Era of Large Language Models: A Survey. arXiv 2025, arXiv:2510.13366. [Google Scholar] [CrossRef]
- Luo, C.; Shen, Y.; Zhu, Z.; Zheng, Q.; Yu, Z.; Yao, C. Layoutllm: Layout instruction tuning with large language models for document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 15630–15640. [Google Scholar]
- Fujitake, M. Layoutllm: Large language model instruction tuning for visually rich document understanding. arXiv 2024, arXiv:2403.14252. [Google Scholar] [CrossRef]
- Feng, H.; Liu, Q.; Liu, H.; Tang, J.; Zhou, W.; Li, H.; Huang, C. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. Sci. China Inf. Sci. 2024, 67, 220106. [Google Scholar] [CrossRef]
- Liao, W.; Wang, J.; Li, H.; Wang, C.; Huang, J.; Jin, L. DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 4038–4049. [Google Scholar]
- Ajayi, K.; He, Y.; Wu, J. Uncertainty-Aware Complex Scientific Table Data Extraction. In International Conference on Document Analysis and Recognition; Springer Nature: Cham, Switzerland, 2025; pp. 56–73. [Google Scholar]
- Gu, W.; Gu, L.; Suen, C.Y.; Wang, Y. MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 23494–23504. [Google Scholar]
- Sun, L.; He, L.; Jia, S.; He, Y.; You, C. Docagent: An agentic framework for multi-modal long-context document understanding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 17712–17727. [Google Scholar]
- Veluru, S.R.; Marella, V.C.; Erukude, S.T. The Evolution of Search Engines: From Keyword Matching to AI-Powered Understanding. 2025. Available online: https://ssrn.com/abstract=5403467 (accessed on 13 December 2025).
- Ibrahim, M.; Murshed, M. From tf-idf to learning-to-rank: An overview. In Handbook of Research on Innovations in Information Retrieval, Analysis, and Management; IGI Global: Palmdale, PA, USA, 2016; pp. 62–109. [Google Scholar]
- Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond; Now Publishers Inc.: Hanover, MA, USA, 2009. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar]
- Cho, H.; Choi, D.; Lee, H. Re-ranking system with BERT for biomedical concept normalization. IEEE Access 2021, 9, 121253–121262. [Google Scholar] [CrossRef]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-T. Dense Passage Retrieval for Open-Domain Question Answering. In EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6769–6781. [Google Scholar]
- Khattab, O.; Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2020; pp. 39–48. [Google Scholar]
- Formal, T.; Piwowarski, B.; Clinchant, S. SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2288–2292. [Google Scholar]
- Cormack, G.V.; Clarke, C.L.A.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2009; pp. 758–759. [Google Scholar]
- Arslan, M.; Ghanem, H.; Munawar, S.; Cruz, C. A Survey on RAG with LLMs. Procedia Comput. Sci. 2024, 246, 3781–3790. [Google Scholar] [CrossRef]
- Zhang, H.; Hao, W.; Jin, D.; Cheng, K.; Liu, J. Research on Agentic RAG Method Based on Retrieval Task Planning and Verification Reflection Mechanism. Comput. Sci. 2025, 1–15. [Google Scholar]
- Zhang, D.; Peng, C.; Tan, W.; Cai, C. An Efficient Ciphertext Retrieval Scheme Fusing Keywords and Semantics. Comput. Eng. 2025, 1–16. [Google Scholar]
- Wen, Y.; Wang, Q.; Yan, W. Research on Chinese Text Similarity Measurement Method Based on Similarity Fusion. Inf. Technol. Informatiz. 2023, 10, 36–39. [Google Scholar]
- Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Levskaya, A.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently scaling transformer inference. Proc. Mach. Learn. Syst. 2023, 5, 606–624. [Google Scholar]
Figure 1.
Technology Roadmap.
Figure 1.
Technology Roadmap.
Figure 2.
Recognition of fine-tuned models.
Figure 2.
Recognition of fine-tuned models.
Figure 3.
Photos of the generated dataset.
Figure 3.
Photos of the generated dataset.
Figure 4.
Human–machine collaborative data feedback.
Figure 4.
Human–machine collaborative data feedback.
Figure 5.
Comparison of Large Model Capabilities.
Figure 5.
Comparison of Large Model Capabilities.
Figure 6.
Comparison of Model Full-Order Accuracy under Different Parameter Sizes.
Figure 6.
Comparison of Model Full-Order Accuracy under Different Parameter Sizes.
Figure 7.
Two-Stage Retrieval.
Figure 7.
Two-Stage Retrieval.
Table 1.
Research gap and positioning against related work.
Table 1.
Research gap and positioning against related work.
| Research Direction | Representative Methods | Main Focus | Difference from This Work |
|---|
| Multimodal document understanding | LayoutLM, LayoutLMv3 | Layout-aware document field extraction | Does not address enterprise SKU master-data matching after extraction. |
| OCR-free document extraction | Donut, Pix2Struct | Image-to-structured-text generation | Usually focuses on extraction and does not model material entity alignment. |
| Multimodal large language models | Qwen-VL, InternVL, LLaVA | General visual-language understanding | Lacks logistics-specific constraints and confidence-gated verification. |
| Semantic entity alignment | DPR, ColBERT, SPLADE | Text-based retrieval and entity matching | Typically assumes clean text input rather than noisy document-image extraction. |
| RRA-Logis (this study) | Schema-constrained extraction and confidence-gated semantic alignment | Logistics document-to-entity alignment | Jointly considers visual noise, SKU alignment, uncertainty control, and edge deployment. |
Table 2.
Training parameter settings.
Table 2.
Training parameter settings.
| Training Setting Parameters | Numerical Value |
|---|
| learning rate | 1.0 × 10−4 |
| Batch Size | 32 |
| Epochs | 5 |
| LoRA Rank | 16 |
| Alpha | 32 |
Table 3.
Capability comparison with specialized document intelligence and multimodal information extraction methods.
Table 3.
Capability comparison with specialized document intelligence and multimodal information extraction methods.
| Method | OCR-Free | Structured Extraction | Entity Alignment | Edge Deployment Considered | Reproduced in This Study |
|---|
| LayoutLMv3 | No | Yes | No | No | Discussed |
| Donut | Yes | Yes | No | Partially | Discussed |
| InternVL | Yes | Partially | No | Depends on model size | Discussed |
| Qwen2.5-VL | Yes | Yes | No | Partially | Experimental baseline |
| RRA-Logis | Yes | Yes | Yes | Yes | Proposed and evaluated |
Table 4.
Information extraction accuracy of different methods on the test set.
Table 4.
Information extraction accuracy of different methods on the test set.
| | Information Extraction Accuracy | ANLS |
|---|
| OCR + Regular expression | 35.90% | 0.75 |
| Qwen2.5-VL-7B | 50.65% | 0.93 |
| Qwen2.5-VL-3B | 41.70% | 0.91 |
| Qwen3-VL-2B | 40.70% | 0.91 |
| Qwen3-VL-4B | 44.70% | 0.91 |
| Qwen2.5-VL-7B-LoRA | 85.40% | 0.98 |
| Qwen2.5-VL-3B-LoRA | 73.60% | 0.97 |
| Qwen3-VL-2B-LoRA | 73.90% | 0.97 |
| Qwen3-VL-4B-LoRA | 75.40% | 0.97 |
Table 5.
Accuracy comparison of different methods on the test set.
Table 5.
Accuracy comparison of different methods on the test set.
| | ACC Model | ACC Supplier | ACC Quantity | JSON | ACC Name |
|---|
| Qwen2.5-VL-7B | 65.00% | 99.65% | 74.00% | 99.10% | 74.00% |
| Qwen2.5-VL-3B | 50.70% | 99.40% | 62.00% | 98.80% | 62.00% |
| Qwen3-VL-2B | 51.15% | 99.40% | 62.50% | 98.20% | 62.50% |
| Qwen3-VL-4B | 52.20% | 99.45% | 63.10% | 97.70% | 63.10% |
| Qwen2.5-VL-7B-LoRA | 85.65% | 99.85% | 86.60% | 100.00% | 86.60% |
| Qwen2.5-VL-3B-LoRA | 78.90% | 99.65% | 82.10% | 100.00% | 82.10% |
| Qwen3-VL-2B-LoRA | 78.60% | 99.65% | 80.10% | 100.00% | 80.10% |
| Qwen3-VL-4B-LoRA | 80.15% | 99.70% | 81.20% | 100.00% | 81.20% |
Table 6.
Performance comparison of different retrieval methods on the retrieval test dataset.
Table 6.
Performance comparison of different retrieval methods on the retrieval test dataset.
| Method | Core Mechanism | Accuracy |
|---|
| Edit distance search method | Character-level edit distance | 34.5% |
| Word vector retrieval method | Semantic vector matching | 68.3% |
| Two-stage retrieval method (7B) | Semantic vector and prompt word | 92.8% |
| Two-stage retrieval method (4B) | Semantic vector and prompt word | 85% |
| Two-stage retrieval method (3B) | Semantic vector and prompt word | 82% |
| Two-stage retrieval method (2B) | Semantic vector and prompt word | 78% |
Table 7.
Public dataset evaluation on DocILE.
Table 7.
Public dataset evaluation on DocILE.
| Method | KILE-F1 (%) | LIR-F1 (%) | ANLS | JSON Compliance (%) | Latency (ms) |
|---|
| OCR + rule-based extraction | 38.6 | 31.4 | 0.721 | 87.5 | 64 |
| Native Qwen2.5-VL-7B | 61.8 | 54.2 | 0.856 | 97.9 | 432 |
| RRA-Logis trained on L-Doc-2K only | 68.7 | 60.5 | 0.884 | 98.6 | 448 |
| RRA-Logis + DocILE adaptation | 78.9 | 72.4 | 0.923 | 99.4 | 461 |
Table 8.
Robustness performance under different document degradation conditions.
Table 8.
Robustness performance under different document degradation conditions.
| Test Condition | Extraction Accuracy (%) | JSON Compliance (%) | ANLS | Alignment P@1 (%) | MRR |
|---|
| Clean test set | 85.4 | 100.0 | 0.981 | 92.8 | 0.956 |
| Gaussian noise | 81.7 | 99.6 | 0.967 | 89.4 | 0.931 |
| Motion blur | 78.9 | 99.2 | 0.954 | 86.8 | 0.907 |
| Low illumination | 80.1 | 99.4 | 0.958 | 87.6 | 0.915 |
| Reduced resolution | 80.8 | 99.5 | 0.962 | 88.3 | 0.922 |
| Partial occlusion | 75.6 | 98.7 | 0.936 | 83.1 | 0.884 |
| Reflection | 76.8 | 98.9 | 0.942 | 84.0 | 0.891 |
| Wrinkles/stains | 79.5 | 99.1 | 0.951 | 86.2 | 0.902 |
Table 9.
External validity evaluation on cross-template and real-document subsets.
Table 9.
External validity evaluation on cross-template and real-document subsets.
| Test Setting | Extraction Accuracy (%) | JSON Compliance (%) | ANLS | Alignment P@1 (%) | MRR |
|---|
| In-template synthetic test set | 85.4 | 100.0 | 0.981 | 92.8 | 0.956 |
| Cross-template test set | 77.9 | 99.1 | 0.948 | 85.7 | 0.901 |
| Real-document subset | 74.6 | 98.5 | 0.931 | 82.4 | 0.873 |
Table 10.
Statistical stability of the main methods.
Table 10.
Statistical stability of the main methods.
| Method | Extraction Accuracy (%) | ANLS | JSON Compliance (%) | Alignment P@1 (%) | Latency (ms) |
|---|
| OCR + rule-based method | 35.9 ± 0.4 | 0.750 ± 0.006 | 91.8 ± 0.7 | 34.5 ± 0.5 | 38 ± 4 |
| Native Qwen2.5-VL-7B | 50.7 ± 0.6 | 0.930 ± 0.004 | 99.1 ± 0.3 | - | 421 ± 16 |
| Qwen2.5-VL-7B-LoRA | 85.4 ± 0.6 | 0.981 ± 0.003 | 100.0 ± 0.0 | - | 438 ± 18 |
| Vector retrieval only | - | - | - | 68.3 ± 0.7 | 21 ± 3 |
| Two-stage retrieval, 7B | - | - | - | 92.8 ± 0.8 | 486 ± 21 |
| Two-stage retrieval, 4B | - | - | - | 85.0 ± 0.9 | 312 ± 17 |
Table 11.
Sensitivity analysis of candidate size N.
Table 11.
Sensitivity analysis of candidate size N.
| Candidate Size N | Alignment P@1 (%) | Recall@N (%) | MRR | Latency (ms) |
|---|
| 5 | 89.6 | 94.1 | 0.931 | 392 |
| 10 | 92.8 | 97.3 | 0.956 | 486 |
| 20 | 93.1 | 98.0 | 0.958 | 642 |
| 30 | 93.2 | 98.2 | 0.959 | 801 |
Table 12.
Sensitivity analysis of fusion weight .
Table 12.
Sensitivity analysis of fusion weight .
| Fusion Weight | Alignment P@1 (%) | MRR |
|---|
| 0.3 | 91.4 | 0.944 |
| 0.5 | 92.8 | 0.956 |
| 0.7 | 91.9 | 0.948 |
Table 13.
Sensitivity analysis of confidence threshold tau.
Table 13.
Sensitivity analysis of confidence threshold tau.
| Threshold Tau | Manual Review Rate (%) | Alignment P@1 Before Review (%) | Alignment P@1 After Review (%) |
|---|
| 0.60 | 4.8 | 91.7 | 93.0 |
| 0.70 | 8.6 | 92.8 | 95.1 |
| 0.80 | 15.3 | 93.5 | 96.4 |
Table 14.
Sensitivity analysis of LoRA rank r.
Table 14.
Sensitivity analysis of LoRA rank r.
| LoRA Rank r | Extraction Accuracy (%) | ANLS | Trainable Parameter Cost | Training Time |
|---|
| 8 | 82.6 | 0.972 | Low | Short |
| 16 | 85.4 | 0.981 | Medium | Moderate |
| 32 | 85.9 | 0.982 | High | Long |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |