Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application

Tian, Fengchun; Wang, Haochen; Wan, Zhenlong; Liu, Ran; Liu, Ruilong; Lv, Di; Lin, Yingcheng

doi:10.3390/electronics13101941

Open AccessArticle

Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application

by

Fengchun Tian

¹

,

Haochen Wang

¹,

Zhenlong Wan

²,

Ran Liu

¹

,

Ruilong Liu

¹,

Di Lv

¹ and

Yingcheng Lin

^1,*

¹

The School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

²

National Information Center of GACC, Beijing 100010, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(10), 1941; https://doi.org/10.3390/electronics13101941

Submission received: 14 April 2024 / Revised: 10 May 2024 / Accepted: 14 May 2024 / Published: 15 May 2024

Download

Browse Figures

Versions Notes

Abstract

As a crucial national security defense line, the existing risk prevention and screening system of customs falls short in terms of intelligence and diversity for risk identification factors. Hence, the urgent issues to be addressed in the risk identification system include intelligent extraction technology for key information from Customs Unstructured Accompanying Documents (CUADs) and the reliability of the extraction results. In the customs scenario, OCR is employed for M2M interactions, but current models have difficulty adapting to diverse image qualities and complex customs document content. We propose a hybrid mutual learning knowledge distillation (HMLKD) method for optimizing a pre-trained OCR model’s performance against such challenges. Additionally, current models lack effective incorporation of domain-specific knowledge, resulting in insufficient text recognition accuracy for practical customs risk identification. We propose a customs domain knowledge graph (CDKG) developed using CUAD knowledge and propose an integrated CDKG post-OCR correction method (iCDKG-PostOCR) based on CDKG. The results on real data demonstrate that the accuracies improve for code text fields to 97.70%, for character type fields to 96.55%, and for numerical type fields to 96.00%, with a confidence rate exceeding 99% for each. Furthermore, the Customs Health Certificate Extraction System (CHCES) developed using the proposed method has been implemented and verified at Tianjin Customs in China, where it has showcased outstanding operational performance.

Keywords:

OCR; post-OCR; knowledge distillation; knowledge graph; document information extraction

1. Introduction

With the rapid development of computer technology and artificial intelligence, significant progress has been achieved globally for the intelligent management of societies and the establishment of digital governments. Documents play a critical role in government operations and require urgent digitization and data-driven management. Institutions such as banks and courts have widely adopted intelligent automation processes for document handling [1,2]. However, in the context of customs, which is a crucial aspect of national border security, research and implementation of intelligent automation for document processing are still in the early stages. This deficiency not only affects the efficiency and quality of customs operations but also adds complexity to information security and risk management [3]. Customs document verification is a critical step in ensuring the compliance of imports and exports. By auditing these documents, customs can prevent illegal entry and exit as well as tax evasion activities. Customs can use data analysis to monitor potential threats to national security and help maintain national stability [4].

Customs clearance operations involve various types of documents: specifically, Customs Unstructured Accompanying Documents (CUADs). Despite having templates and format specifications, these documents show variations in image quality and diverse content complexities. Currently, customs document risk prevention and screening heavily relies on the expertise of professionals. As transaction volumes increase, manual inspections become time-consuming and inefficient, which can lead to information security risks. Therefore, it has become imperative for Chinese customs to develop intelligent methods for extracting key information from unstructured customs documents.

The digital transformation of customs requires the implementation of unmanned and unsupervised machine-to-machine (M2M) applications. Specifically, in the M2M context of customs, high demands are placed on the reliability and accuracy of image text extraction. This ensures consistency and reliability when handling various data sources, and, in parallel, maintains machine interaction—a balance in communications between machines that is critical for customs. In recent years, the advancement of optical character recognition (OCR) technology has facilitated the extraction of key information from customs documents. However, applying existing OCR algorithms directly to customs results in low accuracy and fails to meet the requirements for information extraction.

In this study, we propose a hybrid mutual learning knowledge distillation method to address the challenges of low accuracy in text character recognition in customs scenarios. Taking into consideration the diverse text characteristics in various fields in CUADs, we utilize different character dictionaries to fine-tune and optimize a generic OCR model for specific scenarios. Additionally, we introduce the iCDKG-PostOCR method, which integrates corrections based on a customs domain knowledge graph to enhance the accuracy of field results in OCR within the customs context.

Our proposed method represents a crucial advancement in intelligent information processing for identifying customs risks. Building on the methods proposed in this paper, we have developed an intelligent and procedural system for extracting information from unstructured customs accompanying documents (IES-CUAD). This system has been deployed and applied in Tianjin Customs in China to serve as an application demonstration for customs risk identification technology research. In summary, the primary contributions of this paper are as follows:

We constructed a dataset of unstructured customs documents based on real customs business scenarios and proposed optimization methods for OCR models, HMLKD, and post-OCR text recognition correction methods, iCDKG-PostOCR.
We evaluated the recognition accuracy of public OCR models on the CUSD-RBS dataset and selected a benchmark model that balances performance and inference speed. The effectiveness of the HMLKD and iCDKG-PostOCR methods was validated on this model, and the advantages and disadvantages of the methods were analyzed. The study achieved an accuracy of over 94% for three types of fields.
We proposed a new perspective on the intelligent processing of documents and offered ideas and a framework for handling customs unstructured accompanying documents, which is a topic that has received less research attention. This can serve as a reference for other researchers.

The remaining structure of this paper is as follows: a review of relevant work is in Section 2, the methodology is in Section 3, evaluation experiments are in Section 4, and the conclusions are in Section 5.

2. Related Works

2.1. OCR (Optical Character Recognition)

In recent years, driven by the rapid advancement of deep learning technologies, optical character recognition (OCR) methodologies have shifted from template-matching approaches [5] to deep neural network (DNN)-based methods [6]. Presently, mainstream OCR systems are categorized into end-to-end models and two-stage frameworks. The end-to-end model integrates the processes of text detection and recognition, as initially proposed in [7], by combining an attention mechanism [8] with a CRNN. Tang et al. implemented the integrated combination of YOLOv5 and OCR [9]. On the other hand, the two-stage framework separates the tasks of detection and recognition. In recent years, DBNet [10] has achieved efficient and accurate text detection. For text recognition, popular methods include CRNN [11] as well as approaches that utilize CNN [12] or ViT [13] as feature extractors. A study by Semkovych and Shymanskyi demonstrates the effectiveness of a method that combines CRNN, LSTM, and CTC. This method, though simple, outperforms standalone CNN/RNN approaches by approximately 15% and can decode handwritten text using low-power devices [14]. SVTR [15] proposes a unified visual model for recognizing scene text that balances inference speed and accuracy.

With the acceleration of the digitization process, the extraction and digitized management of unstructured document information has become increasingly important. A study by Santamaria et al. presents a novel approach that combines image processing techniques, OCR, and OMR for digitizing historical music books [16]. The work of Srinidhi et al. [17] on information extraction and correction of medical documents, Shriansh [2] on bank check recognition, Tam-An Vo-Nguyen [18] on bank statement table detection and extraction, Prateek [19] on bank check character recognition and signature verification, the intelligent processing framework for legal documents established by Sagar Chakraborty [1], and the intelligent system for cross-border customs clearance developed by Han [20] have all contributed to the advancement of digital transformation. However, the current state of information extraction for CUADs is still unsatisfactory. Moreover, the previously mentioned systems primarily focus on machine-to-human (M2H) interactions and pay little attention to machine-to-machine (M2M) application scenarios [21]. In environments that require unmanned monitoring and stable operation, M2M devices are widely used in fields such as electricity and transportation to perform tasks such as monitoring and communication [22]. Typical M2M systems can be classified as dynamic M2M systems or static M2M systems [23]. The latter require high security, robustness, and availability in device-to-device communication [24], making them essential for energy, industrial, and customs scenarios.

2.2. Text Correction

While OCR allows for the recognition of optical characters, its accuracy often needs improvement [25], necessitating OCR result correction to enhance the output quality. Lexical approaches and character-level metrics such as the Damerau–Levenshtein edit distance based on corpora [25,26] are widely used for corrections. Rijhwani et al. [27] proposed an encoder–decoder-based approach to reduce language recognition errors in endangered languages through OCR. Karthikeyan et al. [17] employed the RoBRTa model, which was pre-trained using self-supervision, to reduce word error rates in medical documents. Francois [28] achieved an accuracy of 84% for correcting engineering documents using a clustering-based post-OCR method.

2.3. Knowledge Distillation

In the era of big data, adjusting network model parameters to improve performance also increases model complexity, resulting in parameter redundancy in most deep neural networks. Knowledge distillation, proposed by Hinton et al. [29], serves as a method for compressing models. It involves guiding a “teacher model” to instruct a “student model” in learning specific tasks. This approach aims to maintain a lightweight model while improving performance and enhancing the model’s ability to generalize and learn efficiently. Subsequently, Zhang et al. [30] introduced a mutual learning approach between models, wherein teacher and student models engage in reciprocal teaching. Experimental results validate that this method achieves superior performance compared to training individual models.

3. Methodology

3.1. Overall Process of CUAD Information Extraction

The process of extracting unstructured information from Customs Unstructured Accompanying Documents(CUADs) involves the following steps: template-based region segmentation, text detection, text recognition, text correction, and results output. The system uses the XML information embedded in the PDF files to determine the document’s template type and to segment the areas where the fields are located. Text detection is carried out using the open-source DBNet. Subsequently, text recognition, correction, and verification are carried out, followed by the output of high-confidence results. The pipes are depicted in Figure 1. The current system is capable of processing various types of CUADs, such as health certificates from Brazil and Ecuador. This study uses a health certificate from Brazil as an example for illustrative purposes.

3.2. Customs Domain Knowledge Graph

As an advanced form of knowledge representation, a knowledge graph (KG) enables the establishment of efficient data management systems [31,32] and has been widely applied in various industrial use cases [33]. A domain knowledge graph is a data system constructed based on industry-specific or vertical domain knowledge. It is tailored to a specific industry to facilitate information integration and the construction of knowledge systems [34]. We developed a knowledge graph for the customs domain that integrates structured relationships between entities and a set of calibration rules tailored for each field, drawing from customs business information.

In the Customs Domain Knowledge Graph(CDKG), each field is considered an entity within the knowledge graph. Each entity corresponds to six types of information, including attribute and relationship information. Attribute information includes the name, country, document type, field type, and rules. Relationship information includes hierarchical relationships and related relationships. Before being output, the system validates the corrected final results based on the rules in the CDKG and returns a status code indicating the confidence level of the final results. Only results labeled as “Valid” will be utilized in the subsequent stages of the customs risk identification system, while results with other status codes will be ignored.

The CDKG can be represented as a directed graph

G = (V, E)

, where V represents the set of nodes, with

v_{1}, v_{2}, \dots, v_{n} \in V

representing the entity labels of the content to be extracted. E represents the set of edges, with

e_{1}, e_{2}, \dots, e_{n} \in E

representing the relationships between entities, as shown in Figure 2.

When fields need to be corrected, based on the attribute labels corresponding to the fields, the system traverses or conditionally judges the rule nodes in the CDKG from the query node q to find the rule with the highest match for execution. Each path

p_{i}

can be represented as a node sequence

p_{i} = (v_{1}, v_{2}, \dots, v_{n})

satisfying

v_{1} = q, v_{m} \in V

. Each path

p_{i}

has an equal weight

w_{i} = 1

, meaning that each rule is equal for each field.

3.3. Hybrid Mutual Learning Knowledge Distillation Method

3.3.1. Baseline

In customs risk control operations, it is essential for the model’s inference speed to reach a certain level in order to process a large number of concurrent documents and meet the timeliness requirements. After conducting a comprehensive comparative analysis in our evaluation experiments, we selected SVTR [15] as the text recognition network. MobileNetV1-Enhanced is utilized for feature extraction during practical training and inference. It is a multi-layer lightweight deep convolutional network designed to operate on devices with lower computational capabilities while ensuring that the network’s performance does not significantly decline. The architecture of the text recognition network is illustrated in Figure 3.

3.3.2. GTC Strategy

This approach combines two mainstream methods in text recognition: namely, attention and CTC (connectionist temporal classification) [35,36]. Combining the two methods to integrate multiple features during training leads to more accurate and efficient models. The attention mechanism is only utilized during training and is not used during inference, thereby avoiding additional time costs. The training loss function for this approach is illustrated in Figure 4.

3.3.3. Minimal Dictionary

The number of characters in the dictionary is considered the final classification number of the fully connected layer during training. Filtering the number of characters in the dictionary can effectively reduce the dimensionality of the FC layer and the scale of the network parameters. We summarize the characters involved in CUADs and perform a character cleaning operation before training the model for distillation. Suppose there are N CUADs, CUAD₁, CUAD₂, …, CUAD_N, each containing K characters, such as CUAD₁ = {C₁, C₂, …, C_K}. We take the union operation on the set of characters containing N documents and get the minimum dictionary containing all the occurrences of the characters as:

D_{min} = ⋃_{i = 1}^{N} C U A D_{i} = C U A D_{1} \cup \dots \cup C U A D_{N}

(1)

If a field has a dictionary associated with it in CDKG and the dictionary contains N words W₁, W₂, …, W_n, with each word containing K characters, such as W₁ = {C₁, C₂, …, C_K}, then the minimum dictionary for this field is

D_{min} = ⋃_{i = 1}^{N} W_{i} = W_{1} \cup \dots \cup W_{N}

(2)

3.3.4. Mutual Learning Distillation

When using an SVTR model that has been pre-trained on large-scale public datasets directly for text recognition in CUADs, the accuracy is low. Therefore, we consider using a knowledge distillation optimization method based on the mutual learning idea to compress the redundant parameters of the model [30] and improve the model complexity, as shown in Figure 5, where

Θ_{1}

and

Θ_{2}

represent the names of two models. This mutual learning method creates a symmetric state of learning among the models. In this environment, information flow is not unidirectional but bidirectional.

The text recognition problem can be viewed as a multi-classification problem, where the number of characters in the dictionary can be viewed as the number of classifications M. The input image for each character is denoted as

x_{i} \in R^{H \times W \times 3}

,

i = 1, 2, \dots, n

, where H denotes the height of the input image and W denotes the width of the input image. For each character, its corresponding label can be denoted as

y_{i} \in R^{M \times 1}

,

i = 1, 2, \dots, n

. For a sample

x_{i}

, the probability of each category m is

p^{m} (x_{i}) = \frac{exp (z_{m})}{\sum_{m = 1}^{M} exp (z_{m})}

(3)

where z_m denotes the output of the softmax layer of the network, and the highest predicted probability class is the output character. The loss function for a general multi-classification problem is defined as the cross-entropy loss L between the predicted values and the labels.

L = - \sum_{i = 1}^{N} \sum_{m = 1}^{M} I (y_{i}, m) log (p_{m} (x_{i}))

(4)

where I is defined as

I (y_{i}, m) = \{\begin{matrix} 1 & if y_{i} = m \\ 0 & if y_{i} \neq m \end{matrix}

Suppose two models are

Θ_{1}

and

Θ_{2}

: their predictions are

p_{1}

and

p_{2}

, respectively. The difference in the distribution between the two models is measured using the KL dispersion (Kullback–Leibler divergence), is called L_DML (distillation mutual loss), and is calculated as

L_{D M L} = D_{K L} (p_{2} ‖ p_{1}) = \sum_{i = 1}^{N} \sum_{m = 1}^{M} p_{2}^{m} (x_{i} log \frac{p_{2}^{m} (x_{i})}{p_{1}^{m} (x_{i})})

(5)

The loss function during the training process consists of multiple components. After the FC layer in the SVTR network, the CTC module is used to calculate the CTC loss, which is denoted as

p_{C T C}

.

The 2D attention module [37] guides the model to learn better alignment and feature representation through attention, enabling fast inference speed and robust, accurate predictions. Its output is denoted as

p_{A}

, which stands for attention loss (AL). The output of this module is a vector with the same dimension as the label, and the loss function is defined as the cross-entropy between the output of the attention module and the label.

In this paper, we introduce a concept called feature distillation loss (FDL) [38] to optimize the parameters of the feature extraction module and enhance its performance. We use the mean squared error (MSE) between the outputs of the feature extraction modules of two models as the evaluation metric. By calculating the MSE, we obtain the result of FDL and use it to guide parameter optimization of the feature extraction module.

Assume the existence of two models

Θ_{1}

and

Θ_{2}

. We can represent their respective loss functions as follows:

The self-loss of

Θ_{1}

is denoted as:

L_{1} = F D L + L_{C T C_{1}} + A L_{1}

(6)

The mutual distillation loss of

Θ_{1}

is denoted as:

L_{D M L_{1}} = w_{1} \times D_{K L} (p_{A_{2}} ‖ p_{A_{1}}) + w_{2} \times D_{K L} (p_{C T C_{2}} ‖ p_{C T C_{1}})

(7)

Similarly, the self-loss of

Θ_{2}

is:

L_{2} = F D L + L_{C T C_{2}} + A L_{2}

(8)

The mutual distillation loss of

Θ_{2}

is denoted as:

L_{D M L_{2}} = w_{1} \times D_{K L} (p_{A_{1}} ‖ p_{A_{2}}) + w_{2} \times D_{K L} (p_{C T C_{1}} ‖ p_{C T C_{2}})

(9)

where

w_{1}

and

w_{2}

denote the weight coefficients. The mutual distillation loss between the two models is:

L_{D M L} = \frac{L_{D M L 1} + L_{D M L_{2}}}{2}

(10)

Finally, the overall loss function can be represented as:

L_{T r a i n} = L_{1} + L_{2} + L_{D M L}

(11)

During training, each iteration of each batch will mutually learn and update the model parameters based on the loss function. The optimization details are summarized in Algorithm 1.

3.4. iCDKG-PostOCR Methodology

In this paper, all fields are divided into three categories: code text fields, character text fields, and numerical text fields. According to the label of the fields, the rules are found along the paths in the CDKG and used for rule inference to realize the correction of various types of fields.

Due to the similarity in appearance, OCR is prone to recognizing “I0” as “10” in this field, and it also misses the “-” symbol. However, with the assistance of the various coding rules documented in the CDKG, the system can identify the most suitable rule and utilize it to direct the correction process, as illustrated in Figure 6.

Algorithm 1: The Training Procedure of HMLKD

In the CDKG, each character text field is mapped to a dictionary that contains all possible information that this field may have. By incorporating the Levenshtein edit distance strategy, OCR recognition results can be corrected for character errors. The correct result is shown in Figure 7.

The CDKG records various numerical representations, including Spanish numerical representations. For example, “20.7” is rewritten as “20,7” in Spanish format. In addition to correcting errors, it is necessary to convert numerical representations in other languages to the most widely used Arabic numeral representation. As shown in the figure below, the misrecognition of “,” as “.” has been corrected, and the accurate result is displayed in Figure 8.

By employing the aforementioned approach, we can quickly match multiple fields in the documents with their corresponding rules, enabling efficient field correction and validation.

4. Evaluation Experiments

4.1. Benchmark Dataset

With the assistance of the customs business side, we have developed the Customs Unstructured Document Dataset, which is based on real-world business scenarios (CUSD-RBS Dataset).

In this work, 462 scans of documents were used to train and test the proposed scheme. This batch of documents contains 229 Ecuadorian health certificates and 233 Brazilian health certificates, and we organized the dataset into fields. It was divided into a training set, a validation set, and a test set according to an 8:1:1 ratio.

A sample of the Brazilian documents is shown in Figure 9. It consists of multiple field areas. The specific content of the fields is not shown due to privacy concerns. Each field target area contains multiple types of languages. We have identified some specific fields that require extraction, including Certificate Number (code text), Name of Product (character text), and Net Weight (numeric text).

4.2. Metrics

In this work, we refer to concepts such as the character error rate (CER) proposed by Carrasco et al. [39], etc., and combined with the actual needs of customs M2M scenarios, we propose the field error rate (FER). Its meaning is that for a batch of CUADs, it is the ratio of the number of OCR results consistent with the label to the total amount of data; the formula is as follows:

F E R = \frac{Misrecognised Fields in Datasets}{Total Number of Fields}

(12)

We developed the FER because the CER is not very fitting for evaluating customs scenarios, e.g., for coded fields, because even if there is only a single character recognition error, the result may cause a lot of disturbances and misjudgments in the risk screening process.

The confidence rate (CR) is used to measure the reliability of the output results; as mentioned earlier, only the results given a “Valid” code will be output. The calculation method of this indicator is the same as that of precision, but the meaning is different. The formula is as follows:

C R = \frac{T r u e P o s i t i v e_{output}}{T r u e P o s i t i v e_{output} + F a l s e P o s i t i v e_{output}}

(13)

where TP means that the output is “Valid” in the result of the correct sample, and FP indicates that the output is “Valid” in the result of an actually wrong sample.

4.3. Image Pre-Processing

In order to improve the image quality, images are pre-processed after being input. For skewed images, the program performs a skew detection and calibration process. In order to improve the accuracy of text detection and text recognition, the image is also preprocessed in two steps: (1) the image is stretched to a certain extent to increase the pixel distances between the lines of text, and (2) the input image is padded so that the height of the image is filled to 320, which enables the model to capture the position of the text regions of the image more easily and avoids text being excluded due to being on an edge. The image is then resized to 320 × 48 × 3 and sent into the subsequent process. The pre-processed image is shown in Figure 10.

4.4. Evaluation Experiments and Results

4.4.1. OCR Performance Evaluation

In this part of the experiment, we aim to measure the comprehensive performance of several current OCR models on the CUSD-RBS Dataset. We selected several key parameters to measure their performance according to the usage requirements, including: FER, parameter size (M), and inference speed (s/image according to the actual hardware conditions; the inference speed on the CPU is used as a reference). A comparison of the results of the models is shown in Table 1.

According to the experimental results in Table 1, it can be seen that the performances of the models in the character text type fields are not very different. SVTR has better performance in the code text type fields and numeric text type fields on FER, and SVTR has an advantage in reasoning speed. So in the face of the situation of multiple documents on the customs file server waiting to be processed at the same time, congestion wait timeouts will not occur. Under an overall consideration, SVTR is the benchmark model that can best fulfill customs scenarios. We train the SVTR model using the configuration in Table 2, and the parameters are set as shown in Table 3.

The two models in the mutual learning approach are represented by Net-1 and Net-2, and finally, we choose Net-2 as the distilled model. The performance of the model after distillation is tested, and the results are shown in Table 1. From the experimental results, it can be seen that after tuning the parameters of the model by the HMLKD method, the error rate of the fields in the customs dataset decreases compared with the previous one, which proves the effectiveness of the distillation training tuning. Below are the training curves for the three types of fields during training.

4.4.2. Code Text Type

From Table 4, it can be observed that for the code text type fields, fine-tuning the pre-trained model with the minimal dictionary leads to a decrease in performance. This is because the dimensions of the FC layer in the pre-trained model are predetermined based on the number of characters in the original character dictionary. When the size of the character dictionary is reduced, the model needs to adapt to the new dictionary, which may result in certain specific characters or character combinations not being adequately learned during the new training process. This can lead to slow parameter updates and hinder model convergence. Furthermore, encoded texts often contain shorter sequences of numeric or character text and lack the rich contextual and semantic information present in normal language texts. Reducing the character dictionary size may limit the contextual information available for the model to comprehend and predict encoded texts accurately. Therefore, fine-tuning the pre-trained model ensures good performance in terms of both inference time and accuracy. For the distillation process of the baseline model for the code text type fields, after around 100 epochs, the model can reach convergence. Figure 11 illustrates the loss and accuracy trends during the training process of the model.

4.4.3. Character Text Type

It can be seen that when using the same dictionary, the accuracy of character texts is lower than that of code texts in Table 5. This is mainly because code texts are primarily composed of numbers and letters. Compared with complex long texts, the structure of code texts is simpler, and the contextual information is more direct. According to statistics, the average length of character text is typically 1.5–2.5 times longer than that of code text. When dealing with long texts and limited data availability, it becomes challenging to provide the ample contextual information required by the model. When it comes to code texts, the model requires a relatively small amount of information and can reach a higher upper limit of accuracy. This allows the model to easily acquire and comprehend information, thereby enhancing the accuracy of predictions. When using the minimum dictionary, the accuracy of character text is higher than that of code text. We analyze that under the condition of limited datasets, character text provides more comprehensive contextual semantic information, enabling the model to better adjust its parameters to align with a new dictionary. We conducted knowledge distillation training using a reduced dictionary. The model parameters were reduced from 20.2 M to 16.9 M by decreasing the dimension of the fully connected (FC) layer from 6625 to 76. This approach accelerates inference while maintaining performance. The distillation process reaches convergence in roughly 200 epochs. The loss function and accuracy during training are shown in Figure 12.

4.4.4. Numeric Text Type

Fields of this type have similar characteristics to code text type. If the minimal dictionary is used in the training process, it is difficult to converge to the ideal model that can obtain a high accuracy rate within a limited epoch; it can be seen from the results in Table 6 that the inference speed also improves slightly. Therefore, under comprehensive consideration, fine-tuning is performed on the original pre-trained SVTR to obtain a model that combines high accuracy and fast inference speed. The distillation process reaches convergence in roughly 50 epochs. Its loss function and accuracy during training are shown in Figure 13.

Thereafter, in order to verify the effectiveness of the parameter optimization method proposed in this paper, we also evaluated our method on two public datasets: IIID5K [43] and SVT [44]. The results are shown in Table 7.

From the results, it can be seen that the text recognition results on public datasets can be significantly improved. The model optimized using knowledge distillation with the minimal dictionary can reduce FER while significantly reducing model parameters. The parameter optimization method can be flexibly chosen based on the scenario of use.

4.5. Post-OCR

In this part of the experiment, the correction effects of the system for these three kinds of fields are demonstrated. The following experimental results are the statistical results obtained when the system is deployed in a real customs business environment.

4.5.1. Code Text Type

When the OCR results are input, if they can match a specific coding rule completely, the system will mark the result as high confidence and output it. When there is no exact match, the system will match it with the closest coding rule and correct it, and it will output the result with a lower confidence than that of a complete match. The correction results are shown in Table 8.

Without undergoing post-OCR processing, the FER is relatively high due to the difficulty with achieving 100% accuracy for recognizing the letters, numbers, and symbols within the field’s code. However, after applying the rule-based correction, there is a significant improvement in the FER. Regardless of whether we examine the pre- or post-correction stage, the confidence level is consistently high, indicating that the model ensures that the final output results conform to the rules recorded in the knowledge graph. Incorrect fields that do not adhere to the rules are effectively filtered out.

The few exceptional cases that were not correctly filtered out are due to certain characters being recognized incorrectly but still matching a specific rule, resulting in them being assigned the “Valid” code. This issue will be further addressed and optimized in future research endeavors.

4.5.2. Character Text Type

These fields represent strings with clear semantics. The effect of the calibration is shown in Table 9.

From the results, it can be seen that without post-OCR processing, OCR has a relatively high FER for the text of the character class fields, which is due to its long text length. With the aid of the dictionary, the words that are recognized incorrectly can be corrected. Since there is a dictionary containing all the possible words of the text as a reference, there is almost no misclassification for the validation of these fields, and the CRs of these fields can reach 100%.

4.5.3. Numeric Text Type

Numeric text type fields represent numeric expressions that are a combination of punctuation marks and numbers. The correction effect is shown in Table 10.

From the results, it can be seen that the FER of numeric fields is relatively low before post-OCR processing because the accuracy of OCR for numeric text recognition is still relatively reliable. Sometimes, there are recognition errors, such as recognizing “1” as “l”, “0” as “O”, etc. Due to the limitations of the OCR model’s own performance, there will be occasional errors in the recognition of numbers, which cannot be corrected by rules, so the confidence level cannot reach 100%.

Overall, after the OCR recognition and correction process, a high recognition accuracy can be achieved for multiple types of customs fields.

5. Conclusions

In this paper, we addressed the problem of extracting unstructured information from CUADs and made use of knowledge distillation to optimize the OCR model. We also implemented a knowledge-graph-based post-OCR processing method. Experimental results demonstrate that we achieved field extraction accuracy of over 90% in real-world business production environments, which can serve as a reference for other researchers and engineers encountering similar problems.

However, the current system is not intelligent enough to segment target areas. In addition, due to the current limited data, the model lacks sufficient adaptability when faced with new types of documents and new application scenarios. Therefore, our outlook for future research is: (1) investigating methods for automatically extracting regions of interest from complex documents to reduce the workload of manual annotation and (2) combined with large language models, focusing research on highly adaptable text recognition and correction verification methods for a wider range of scenarios and file types.

Author Contributions

All authors contributed to the study conceptualization and design; methodology, H.W., Y.L. and F.T.; software, H.W., D.L. and R.L. (Ruilong Liu); writing—original draft preparation, H.W., D.L. and R.L. (Ruilong Liu); writing—review and editing, H.W. and Y.L.; visualization, H.W.; project administration, Z.W., Y.L., F.T. and R.L. (Ran Liu); funding acquisition, Z.W., Y.L., R.L. (Ran Liu) and F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant number 2021YFC3340500.

Data Availability Statement

Due to national security issues involving company and personal information, the data cannot be made publicly available.

Acknowledgments

The authors are grateful to other project participants for their cooperation and endeavors.

Conflicts of Interest

The authors report no disclosures.

Abbreviations

The following abbreviations are used in this manuscript:

CUAD	Customs Unstructured Accompanying Document
M2M	Machine-to-Machine
M2H	Machine-to-Human
OCR	Optical Character Recognition
HMLKD	Hybrid Mutual Learning Knowledge Distillation
CDKG	Customs Domain Knowledge Graph
CUSD-RBS	Customs Unstructured Documents based on Real-world Business Scenarios
FER	Field Error Rate
CR	Confidence Rate

References

Chakraborty, S.; Harit, G.; Ghosh, S. TransDocAnalyser: A framework for semi-structured offline handwritten documents analysis with an application to legal domain. In Proceedings of the International Conference on Document Analysis and Recognition, San Jose, CA, USA, 21–26 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 45–62. [Google Scholar]
Srivastava, S.; Priyadarshini, J.; Gopal, S.; Gupta, S.; Dayal, H.S. Optical character recognition on bank cheques using 2D convolution neural network. In Proceedings of the Applications of Artificial Intelligence Techniques in Engineering: SIGMA 2018; Springer: Berlin/Heidelberg, Germany, 2019; Volume 2, pp. 589–596. [Google Scholar]
Pradipta, D.J.; Handayani, P.W.; Shihab, M.R. Evaluation of the customs document lane system effectiveness: A case study in Indonesia. In Proceedings of the 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Surabaya, Indonesia, 9–11 April 2021; pp. 209–214. [Google Scholar]
Basir, A.; Satyadini, A.E.; Barata, A. Modern Customs Risk Management Framework: Improvement towards Institutional Reform. Int. J. Innov. Sci. Res. Technol. 2019, 4, 60–69. [Google Scholar]
Mori, S.; Suen, C.Y.; Yamamoto, K. Historical review of OCR research and development. Proc. IEEE 1992, 80, 1029–1058. [Google Scholar] [CrossRef]
Subramani, N.; Matton, A.; Greaves, M.; Lam, A. A survey of deep learning approaches for ocr and document understanding. arXiv 2020, arXiv:2011.13534. [Google Scholar]
Lee, C.Y.; Osindero, S. Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2231–2239. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 21–25. [Google Scholar]
Tang, X.; Wang, C.; Su, J.; Taylor, C. An elevator button recognition method combining YOLOv5 and OCR. CMC Comput. Mater. Cont. 2023, 75, 117–131. [Google Scholar] [CrossRef]
Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11474–11481. [Google Scholar]
Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef]
Borisyuk, F.; Gordo, A.; Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 71–79. [Google Scholar]
Atienza, R. Vision transformer for fast and efficient scene text recognition. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 319–334. [Google Scholar]
Santamaría, G.; Domínguez, C.; Heras, J.; Mata, E.; Pascual, V. Combining image processing techniques, OCR, and OMR for the digitization of musical books. In Proceedings of the International Workshop on Document Analysis Systems, La Rochelle, France, 21–28 February 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 553–567. [Google Scholar]
Du, Y.; Chen, Z.; Jia, C.; Yin, X.; Zheng, T.; Li, C.; Du, Y.; Jiang, Y.G. SVTR: Scene text recognition with a single visual model. arXiv 2022, arXiv:2205.00159. [Google Scholar]
Semkovych, V.; Shymanskyi, V. Combining OCR methods to improve handwritten text recognition with low system technical requirements. In Proceedings of the International Symposium on Computer Science, Digital Economy and Intelligent Systems, Wuhan, China, 11–13 November 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 693–702. [Google Scholar]
Karthikeyan, S.; de Herrera, A.G.S.; Doctor, F.; Mirza, A. An OCR post-correction approach using deep learning for processing medical reports. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2574–2581. [Google Scholar] [CrossRef]
Vo-Nguyen, T.A.; Nguyen, P.; Le, H.S. An efficient method to extract data from bank statements based on image-based table detection. In Proceedings of the 2021 15th International Conference on Advanced Computing and Applications (ACOMP), Ho Chi Minh City, Vietnam, 24–26 November 2021; pp. 186–190. [Google Scholar]
Agrawal, P.; Chaudhary, D.; Madaan, V.; Zabrovskiy, A.; Prodan, R.; Kimovski, D.; Timmerer, C. Automated bank cheque verification using image processing and deep learning methods. Multimed. Tools Appl. 2021, 80, 5319–5350. [Google Scholar] [CrossRef]
Han, C.; Wang, B.; Lai, X. Research on the construction of intelligent customs clearance information system for cross-border road cargo between Guangdong and Hong Kong. In Proceedings of the International Conference on AI-Generated Content, Shanghai, China, 25–26 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 181–190. [Google Scholar]
Kim, J.; Lee, J.; Kim, J.; Yun, J. M2M service platforms: Survey, issues, and enabling technologies. IEEE Commun. Surv. Tutorials 2013, 16, 61–76. [Google Scholar] [CrossRef]
Salama, R.; Altrjman, C.; Al-Turjman, F. An overview of the Internet of Things (IoT) and Machine to Machine (M2M) Communications. NEU J. Artif. Intell. Internet Things 2023, 2, 55–61. [Google Scholar]
Cao, Y.; Jiang, T.; Han, Z. A survey of emerging M2M systems: Context, task, and objective. IEEE Internet Things J. 2016, 3, 1246–1258. [Google Scholar] [CrossRef]
Barki, A.; Bouabdallah, A.; Gharout, S.; Traore, J. M2M security: Challenges and solutions. IEEE Commun. Surv. Tutorials 2016, 18, 1241–1254. [Google Scholar] [CrossRef]
Nguyen, T.T.H.; Jatowt, A.; Coustaty, M.; Doucet, A. Survey of post-OCR processing approaches. ACM Comput. Surv. CSUR 2021, 54, 1–37. [Google Scholar] [CrossRef]
Damerau, F.J. A technique for computer detection and correction of spelling errors. Commun. ACM 1964, 7, 171–176. [Google Scholar] [CrossRef]
Rijhwani, S.; Anastasopoulos, A.; Neubig, G. OCR post correction for endangered language texts. arXiv 2020, arXiv:2011.05402. [Google Scholar]
Francois, M.; Eglin, V.; Biou, M. Text detection and post-OCR correction in engineering documents. In Proceedings of the International Workshop on Document Analysis Systems, La Rochelle, France, 22–25 May 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 726–740. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4320–4328. [Google Scholar]
Hao, X.; Ji, Z.; Li, X.; Yin, L.; Liu, L.; Sun, M.; Liu, Q.; Yang, R. Construction and application of a knowledge graph. Remote Sens. 2021, 13, 2511. [Google Scholar] [CrossRef]
Abu-Salih, B. Domain-specific knowledge graphs: A survey. J. Netw. Comput. Appl. 2021, 185, 103076. [Google Scholar] [CrossRef]
Hubauer, T.; Lamparter, S.; Haase, P.; Herzig, D.M. Use cases of the industrial knowledge graph at siemens. In Proceedings of the ISWC (P&D/Industry/BlueSky), Monterey, CA, USA, 8–12 October 2018. [Google Scholar]
Lin, J.; Zhao, Y.; Huang, W.; Liu, C.; Pu, H. Domain knowledge graph-based research progress of knowledge representation. Neural Comput. Appl. 2021, 33, 681–690. [Google Scholar] [CrossRef]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Hu, W.; Cai, X.; Hou, J.; Yi, S.; Lin, Z. GTC: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11005–11012. [Google Scholar]
Li, H.; Wang, P.; Shen, C.; Zhang, G. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8610–8617. [Google Scholar]
Li, C.; Liu, W.; Guo, R.; Yin, X.; Jiang, K.; Du, Y.; Du, Y.; Zhu, L.; Lai, B.; Hu, X.; et al. PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System. arXiv 2022, arXiv:2206.03001. [Google Scholar]
Carrasco, R.C. An open-source OCR evaluation tool. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, Madrid, Spain, 19–20 May 2014; pp. 179–184. [Google Scholar]
Sheng, F.; Chen, Z.; Xu, B. NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 781–786. [Google Scholar]
Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. ASTER: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar] [CrossRef]
Lee, J.; Park, S.; Baek, J.; Oh, S.J.; Kim, S.; Lee, H. On recognizing texts of arbitrary shapes with 2D self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 546–547. [Google Scholar]
Mishra, A.; Alahari, K.; Jawahar, C. Scene text recognition using higher order language priors. In Proceedings of the BMVC—British Machine Vision Conference, Glasgow, UK, 25–28 November 2012. [Google Scholar]
Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar]

Figure 1. Overall process of extracting information from CUADs.

Figure 2. Part of the CDKG: a few representative fields are shown here because the full mapping is too involved.

Figure 3. OCR network architecture: MobileNetV1-Enhanced with feature extraction and text recognition network SVTR.

Figure 4. The text recognition network uses the loss function composition of the GTC strategy, which contains CTC loss and attention loss.

Figure 5. Mutual learning distillation method with loss function containing each network’s own loss and joint loss.

Figure 6. Code text type correction example.

Figure 7. Character text type correction example.

Figure 8. Numeric text type correction example.

Figure 9. A sample from the CUSD-RBS Dataset.

Figure 10. ImagePre-processing example.

Figure 11. Training curve plots for HMLKD-SVTR code training on CUSD-RBS Dataset. The two models in the learning method are denoted as Net-1 and Net-2, (a,b) are the losses of Net-1 and Net-2 on their own, respectively, (c) is the DML loss during distillation, and (d) is the accuracy curve during training.

Figure 12. Training curve plots for HMLKD-SVTR character training on CUSD-RBS Dataset. The two models in the learning method are denoted by Net-1 and Net-2, (a,b) are the losses of Net-1 and Net-2 on their own, respectively, (c) is the DML loss during distillation, and (d) is the accuracy curve during training.

Figure 13. Training curve plots for HMLKD-SVTR numeric training on CUSD-RBS Dataset. The two models in the learning method are denoted by Net-1 and Net-2, (a,b) are the losses of Net-1 and Net-2 on their own, respectively, (c) is the DML loss during distillation, and (d) is the accuracy curve during training.

Table 1. Comparison of OCR models on CUSD-RBS Dataset.

Method	Code Text		Character Text		Numeric Text		Parameters (M)
Method	FER	Inference Time *	FER	Inference Time *	FER	Inference Time *	Parameters (M)
CRNN [11]	100%	0.82	100%	1.04	100%	0.65	8.3
NRTR [40]	85.42%	7.72	85.07%	8.42	90.48%	4.99	31.7
ASTER [41]	97.92%	2.76	85.07%	2.80	95.24%	2.49	27.2
SAR [37]	95.83%	3.60	91.04%	3.77	66.67%	3.25	57.5
SATRN [42]	87.50%	7.40	85.07%	7.86	61.90%	5.22	48.6
SVTR [15]	55.57%	1.13	86.67%	1.51	30.56%	0.83	20.2
HMLKD-SVTR_code	13.89%	1.11	-	-	-	-	20.2
HMLKD-SVTR_char	-	-	28.34%	1.25	-	-	16.9
HMLKD-SVTR_num	-	-	-	-	8.34%	0.83	20.2

* (s/image).

Table 2. Training configuration.

Environment	Configuration
Hardware	CPU:	AMD Ryzen 7 4800 H 2.90 GHz
	GPU:	NVIDIA GeForce RTX 2060
	Memory:	16 GB RAM
Software	Operating System:	Microsoft Windows 10
	Development Platform:	PyCharm 2022.1.2
	Language:	Python 3.8

Table 3. Training parameters.

Parameters	Value
Optimizer	Adam
Learning rate decay strategy	Piecewise
Initial learning rate	$5 \times 10^{- 4}$
Batch size	16
$ω_{1}$	0.5
$ω_{2}$	1

Table 4. Comparison of performance with/without min-Dict for HMLKD-SVTR_code on CUSD-RBS Dataset.

HMLKD-SVTR_Code	FER	Parameters (M)	Inference Speed (s/image)
HMLKD with min-Dict	41.70%	16.9	1.10
HMLKD without min-Dict	13.89%	20.2	1.11

Table 5. Comparison of performance with/without min-Dict for HMLKD-SVTR_character on CUSD-RBS.

HMLKD-SVTR_Character	FER	Parameters (M)	Inference Speed (s/image)
HMLKD with min-Dict	28.34%	16.9	1.25
HMLKD without min-Dict	31.70%	20.2	1.49

Table 6. Comparison of performance with/without min-Dict for HMLKD-SVTR_numeric on CUSD-RBS Dataset.

HMLKD-SVTR_Numeric	FER	Parameters (M)	Inference Speed (s/image)
HMLKD with min-Dict	66.67%	16.8	0.82
HMLKD without min-Dict	8.34%	20.2	0.83

Table 7. Performance comparison of HMLKD method on public datasets.

	FER		Parameters (M)		Inference Speed (s/image)
	IIIT5K	SVT	IIIT5K	SVT	IIIT5K	SVT
Without distillation	51.26%	38.84%	20.24	20.24	0.499	0.148
Distillation without min-Dict	6.25%	1.01%	20.24	20.24	0.483	0.150
Distillation with min-Dict	6.25%	3.13%	16.99	16.96	0.099	0.137

Table 8. Corrected performance of code text type fields on CUSD-RBS Dataset.

Method	FER	CR
HMLKD-SVTR without iCDKG-PostOCR	13.89%	98.36%
HMLKD-SVTR with iCDKG-PostOCR	2.30%	99.17%

Table 9. Corrected performance of character text type fields on CUSD-RBS Dataset.

Method	FER	CR
HMLKD-SVTR without iCDKG-PostOCR	28.34%	100%
HMLKD-SVTR with iCDKG-PostOCR	3.45%	100%

Table 10. Corrected performance of character numeric text type fields on CUSD-RBS Dataset.

Method	FER	CR
HMLKD-SVTR without iCDKG-PostOCR	8.34%	99.38%
HMLKD-SVTR with iCDKG-PostOCR	4.00%	99.38%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, F.; Wang, H.; Wan, Z.; Liu, R.; Liu, R.; Lv, D.; Lin, Y. Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application. Electronics 2024, 13, 1941. https://doi.org/10.3390/electronics13101941

AMA Style

Tian F, Wang H, Wan Z, Liu R, Liu R, Lv D, Lin Y. Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application. Electronics. 2024; 13(10):1941. https://doi.org/10.3390/electronics13101941

Chicago/Turabian Style

Tian, Fengchun, Haochen Wang, Zhenlong Wan, Ran Liu, Ruilong Liu, Di Lv, and Yingcheng Lin. 2024. "Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application" Electronics 13, no. 10: 1941. https://doi.org/10.3390/electronics13101941

APA Style

Tian, F., Wang, H., Wan, Z., Liu, R., Liu, R., Lv, D., & Lin, Y. (2024). Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application. Electronics, 13(10), 1941. https://doi.org/10.3390/electronics13101941

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application

Abstract

1. Introduction

2. Related Works

2.1. OCR (Optical Character Recognition)

2.2. Text Correction

2.3. Knowledge Distillation

3. Methodology

3.1. Overall Process of CUAD Information Extraction

3.2. Customs Domain Knowledge Graph

3.3. Hybrid Mutual Learning Knowledge Distillation Method

3.3.1. Baseline

3.3.2. GTC Strategy

3.3.3. Minimal Dictionary

3.3.4. Mutual Learning Distillation

3.4. iCDKG-PostOCR Methodology

4. Evaluation Experiments

4.1. Benchmark Dataset

4.2. Metrics

4.3. Image Pre-Processing

4.4. Evaluation Experiments and Results

4.4.1. OCR Performance Evaluation

4.4.2. Code Text Type

4.4.3. Character Text Type

4.4.4. Numeric Text Type

4.5. Post-OCR

4.5.1. Code Text Type

4.5.2. Character Text Type

4.5.3. Numeric Text Type

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI