Structured Element Extraction from Official Documents Based on BERT-CRF and Knowledge Graph-Enhanced Retrieval

Chen, Siyuan; Niu, Liyuan; Li, Jinning; Zhu, Xiaomin; Zhuang, Xuebin; Ye, Yanqing

doi:10.3390/math13172779

Open AccessArticle

Structured Element Extraction from Official Documents Based on BERT-CRF and Knowledge Graph-Enhanced Retrieval

by

Siyuan Chen

^1,2,

Liyuan Niu

²,

Jinning Li

^1,2,

Xiaomin Zhu

²,

Xuebin Zhuang

¹ and

Yanqing Ye

^2,*

¹

School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou 510220, China

²

Strategic Assessment and Consultation Institute, Military Academy of Sciences, Beijing 100071, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(17), 2779; https://doi.org/10.3390/math13172779

Submission received: 23 July 2025 / Revised: 11 August 2025 / Accepted: 22 August 2025 / Published: 29 August 2025

(This article belongs to the Special Issue Exploring Statistical Learning: Inference, Optimization, and Real-World Applications)

Download

Browse Figures

Versions Notes

Abstract

The growth of e-government has rendered automated element extraction from official documents a critical bottleneck for administrative efficiency. The core challenge lies in unifying deep semantic understanding with the structured domain knowledge required to interpret complex formats and specialized terminology. To address the limitations of existing methods, we propose a hybrid framework. Our approach leverages a BERT-CRF model for robust sequence labeling, a knowledge graph (KG)-driven retrieval system to ground the model in verifiable facts, and a large language model (LLM) as a reasoning engine to resolve ambiguities and identify complex relationships. Validated on the DovDoc-CN dataset, our framework achieves a macro-average F1 score of 0.850, outperforming the BiLSTM-CRF baseline by 2.41 percentage points, and demonstrates high consistency, with a weighted F1 score of 0.984. The low standard deviation in the validation set further indicates the model’s stable performance across different subsets. These results confirm that our integrated approach provides an efficient and reliable solution for intelligent document processing, effectively handling the format diversity and specialized knowledge characteristic of government documents.

Keywords:

document element extraction; BERT-CRF; knowledge graph; hybrid retrieval; large language model

MSC:

68T50; 68T30; 68P20; 68T07

1. Introduction

Official documents serve as a core medium for interdepartmental collaboration, performing essential functions such as policy communication, business approval, and information filing. With the advancement of government governance modernization, the volume of governmental information has grown rapidly, leading to an increasing demand for document management and structured processing. Particularly in the context of archive digitization, the extraction of information from unstructured texts using big data and artificial intelligence technologies has become a crucial issue to address [1]. In the field of natural language processing (NLP), this task of identifying and classifying key elements within text is formally known as named entity recognition (NER), which serves as a foundational step for subsequent intelligent applications.

To tackle this challenge, various approaches have been explored. However, they face significant limitations when dealing with the complexity of official documents [2]. Traditional rule-based methods are costly to maintain and highly sensitive to the diverse layouts of official documents [3,4]. While statistical learning and early deep learning models like BiLSTM-CRF offer more flexibility, they often struggle to model complex dependencies in text and show poor performance on domain-specific terminology and nested sentence structures [5]. Even state-of-the-art pretrained models such as BERT, despite their powerful contextual understanding, often underperform in specialized contexts due to a lack of explicit domain knowledge integration [6]. This review of the state of the art reveals a critical gap: a need for a framework that can simultaneously handle format variability, understand specialized terminology, and leverage structured domain knowledge.

To bridge this gap, we propose a novel hybrid framework for government document element extraction that tightly integrates the BERT-CRF sequence labeling model with knowledge graph-driven retrieval (KG). Unlike prior work that applies these techniques in isolation, our design explicitly couples deep contextual semantic modeling (BERT), structural sequence constraints (CRF), and domain knowledge grounding (KG) within a unified pipeline. The main contributions of this paper are threefold:

We introduce a hybrid framework that combines a robust BERT-CRF model for sequence labeling with a knowledge graph (KG)-driven retrieval system and a large language model (LLM) for reasoning, overcoming the challenges of format heterogeneity and specialized term recognition.
Extensive experiments on the DovDoc-CN dataset show that our framework achieves a macro-average F1 score of 0.850, surpassing the BiLSTM-CRF baseline by 2.41 percentage points. The framework also demonstrates high consistency, with a weighted F1 score of 0.984, highlighting its robustness and stability.
This integrated approach provides a scalable and effective solution for automated element extraction across various domain-specific and format-diverse text collections, such as legal, financial, and archival documents, offering strong potential for intelligent document processing in government sectors.

The remainder of this paper is organized as follows. Section 2 provides a detailed review of related work. Section 3 elaborates on our proposed methodology, covering the BERT-CRF model for entity recognition, the construction of the domain knowledge graph, and the hybrid retrieval-enhanced generation process. Section 4 presents the experimental setup, datasets, and a comprehensive analysis of the results. Finally, Section 5 concludes the paper and outlines potential directions for future research.

2. Related Work

The extraction of elements from official documents is a specialized form of information extraction that faces significant challenges, including format heterogeneity, contextual ambiguity, and domain-specific terminology [7,8]. To address these issues, our work builds upon established advancements in two key areas of natural language processing: sequence labeling models, particularly BERT-CRF, and knowledge-enhanced extraction using knowledge graphs (KG).

2.1. Sequence Labeling for Element Extraction with BERT-CRF

Framing element extraction as a named entity recognition (NER) task is a common and effective approach. Early methods relied on manually defined rules and templates [3,4], which performed adequately for documents with stable formats but required costly maintenance when structures changed. To improve the flexibility, statistical machine learning models like conditional random fields (CRF) were introduced, but their performance was heavily dependent on hand-crafted features and they struggled with complex, non-contiguous dependencies in text [4,9].

The advent of deep learning brought models like BiLSTM-CRF, which could automatically learn sequence features, marking a significant improvement [5]. However, these models still faced difficulties with long-range dependencies and in recognizing professional terminology not seen in their training data. A major breakthrough came with the introduction of Bidirectional Encoder Representations from Transformers (BERT) [6], which leverages large-scale pretraining to generate powerful, deep contextual embeddings. The combination of BERT with a CRF layer has become a state-of-the-art baseline for NER tasks. In the BERT-CRF architecture, the BERT encoder provides strong semantic representations for each token, while the CRF layer models the dependencies between labels, ensuring that the output is a valid sequence of tags [10]. Despite its strengths, the standard BERT-CRF model still has a critical limitation in highly specialized domains: its performance is constrained by the knowledge contained within its pretraining corpora, making it less effective in identifying niche or newly emerging professional terms [8].

Additionally, related work by Lv et al. introduced a BERT-BIGRU-CRF model for entity relationship extraction. This model inserts a bidirectional gated recurrent unit (BiGRU) layer between the BERT encoder and the final CRF layer to further model sequential features from BERT’s deep contextual embeddings [11]. While this architecture has demonstrated strong performance on product review datasets, our research investigates a more direct and computationally efficient BERT-CRF architecture for the domain of official documents. Our hypothesis is that the powerful, deep contextualization already provided by BERT’s Transformer architecture is sufficient, and the CRF layer can effectively model the label dependencies directly from these rich embeddings, without the need for an intermediate recurrent layer. This streamlined approach potentially offers a better balance of performance and efficiency for the specific structural and semantic characteristics of official documents.

2.2. Knowledge Graph (KG)-Enhanced Information Extraction

To address the knowledge gap inherent in pretrained language models, knowledge-enhanced approaches have emerged. Knowledge graphs (KGs) are particularly well suited for this, as they provide a structured way to store, retrieve, and reason over domain-specific concepts and their relationships [12,13]. Integrating external knowledge can significantly boost the performance of extraction tasks. For example, frameworks like LightRAG have shown that incorporating domain knowledge into language models improves the recognition of specialized entities [14].

However, the effective integration of KGs is non-trivial. Some approaches struggle with parsing complex document structures, while the construction and maintenance of a comprehensive KG for niche domains can be a significant challenge in itself. The primary opportunity, therefore, lies in using KGs not just as a static repository but as a dynamic source of evidence to guide the extraction process, providing explicit reasoning for entity disambiguation and relation inference where textual context alone is insufficient.

2.3. Positioning Our Work

In summary, the literature shows two parallel streams of progress: (1) the continuous evolution of sequence labeling models, in which combining the deep contextual understanding of BERT with the sequence constraints of CRF has been proven to be a highly effective technical path [15], and (2) the use of knowledge graphs (KGs) to inject domain knowledge into natural language processing (NLP) tasks [16]. However, few studies have systematically combined these two approaches to tackle the dual challenges of layout variability and domain-specific terminology in official document extraction.

Our work is positioned at the intersection of these two streams. We do not just use BERT-CRF as an extractor; we augment it with a dynamic, KG-driven retrieval mechanism. This explicit fusion of deep contextual understanding (from BERT-CRF) with structured domain reasoning (from the KG) is the core novelty of our approach. It creates a unified framework that simultaneously addresses the semantic, structural, and knowledge-driven aspects of the problem, offering a more robust and adaptable solution than either approach could in isolation.

3. Methodology

3.1. Method Framework

This paper proposes a document feature extraction framework (BERT-CRF-KG) that integrates BERT-CRF with knowledge graph-enhanced retrieval. Its overall architecture is shown in Figure 1.

The process begins with utilizing deep learning models for entity recognition, succeeded by the extraction of relationships between entities using large language models. This facilitates the creation of a domain-specific knowledge graph to establish semantic associations. Ultimately, a hybrid retrieval mechanism is utilized to optimize the generation process. The core innovation is the development of a dual-channel retrieval enhancement mechanism, which combines the logical reasoning capabilities of the knowledge graph with the semantic generalization ability of the vector database. This approach effectively addresses three major challenges: format heterogeneity, contextual deficiency, and the recognition of specialized terminology.

To formally define and elucidate the hybrid retrieval mechanism proposed in this study, we provide its formal description in Algorithm 1. The core of this mechanism lies in the parallel processing of structured queries to the knowledge graph and semantic queries to the vector database, allowing for the synergistic acquisition of both relational and textual contexts. This heterogeneous contextual information is subsequently merged and deduplicated, and it is ultimately used to construct an augmented prompt submitted to the large language model (LLM) to generate a final response with both depth and breadth.

Algorithm 1: Hybrid Retrieval Mechanism

3.1.1. Core Concept Definition

Document entities are the smallest informational units with independent semantic value, encompassing 10 core elements: issuing authority logo, document number, title of main text, primary addressee, primary heading, secondary heading, tertiary heading, issuing authority, date of creation, and the body of the text.

Entity relationships describe the semantic associations between entities, such as the issuance relationship “Issuing Authority → Document Number” and the semantic mapping relationship “Subject → Main Text Content”, as shown in Table 1.

Hybrid retrieval is a dual-channel mechanism that simultaneously leverages knowledge graphs (structured retrieval) and vector databases (semantic retrieval) to acquire contextual information.

3.1.2. System Flow

Entity Extraction and Knowledge Construction: Entity recognition is performed using the BERT-CRF model (Section 3.2.1), followed by relationship extraction with a large language model (Section 3.2.2), culminating in the construction of a dynamically updated document knowledge graph (Section 3.2.3).
Hybrid Retrieval Enhancement: Hierarchical keyword extraction techniques are employed to decompose query intent, and both graph-based relationship retrieval and document semantic retrieval are executed in parallel (Section 3.3.1).
Dynamic Response Generation: Dynamic prompt templates are constructed based on the retrieval results, with structured responses generated by the large language model, complemented by a cache optimization mechanism (Section 3.3.2).

3.2. Document Information Extraction Based on BERT-CRF

To address the problem of extracting structured element information from documents, this paper constructs an information element extraction framework based on BERT-CRF and knowledge graph retrieval enhancement, as illustrated in Figure 1.

3.2.1. Document Entity Recognition

The elements of a document include the issuing authority logo, document number, title, main text, issuing organization, issuing date, copied units, keywords, etc. These elements typically exist in the form of entities within the document text. To effectively recognize these entities, this paper proposes a BERT-CRF model architecture, which combines BERT and CRF, in order to accurately extract various informational entities from the document.

BERT [6] is constructed based on the Transformer architecture, which processes contextual data bidirectionally, from left to right and right to left. This bidirectional capability allows BERT to grasp intricate, context-specific relationships both locally and globally, thereby improving the precision of document element recognition and identification. Figure 2a offers a visual representation of the BERT model, detailing its fundamental structure.

CRF [10] is a discriminative sequence labeling model that has found widespread application in named entity recognition (NER) tasks. Unlike traditional models, CRF effectively models the dependencies between labels, which helps to mitigate prediction errors typically arising from label independence assumptions. In the context of document entity recognition, CRF ensures the integrity and correctness of entity labels, preventing common errors such as incorrectly splitting entities, like place names, into multiple segments. By optimizing the conditional dependencies between labels, CRF significantly enhances the overall accuracy of the entity recognition process. This is demonstrated in Figure 2b, which depicts an example of an entity labeling sequence, illustrating its practical application.

By combining BERT’s powerful language understanding capabilities with the sequence labeling abilities of CRF, the BERT-CRF model significantly enhances the accuracy and stability in this NER task. As illustrated in Figure 2c, this combined architecture is particularly effective in capturing the complex dependencies between document elements, thereby improving the precision of their identification. The detailed experimental training process can be found in the Experimental Validation section.

3.2.2. Document Entity Relationship Recognition

In the domain of official documents, entity relationship recognition is a critical step in extracting entities and the relationships between them from unstructured text. To execute practical tasks, this study utilizes a BERT-CRF-based model for entity recognition. Once entities are identified, the model further extracts the relationships between entities using large language models (LLM) and domain-specific prompts. These prompts guide the generation model to identify potential entity relationships within the text in a templated manner. For instance, the entity “issuing authority” may have a specific relationship with the “issuance date” in the context of document publication. By extracting such statements from the text and incorporating contextual information, the model can define and recognize more complex relationships. After identifying entities and their relationships, the information is stored in a predefined output format. The relationship data are also formatted and stored in a standardized data structure, facilitating subsequent knowledge graph construction and retrieval.

3.2.3. Knowledge Graph Construction and Refinement

The construction of a knowledge graph in the domain of official documents begins with the foundational step of recognizing entity relationships. By extracting entities and their interrelationships from unstructured text, our framework transforms this information into a structured graph database. In this database, each extracted entity is represented as a node, and the relationships between them are represented as edges, forming the core structure of the knowledge graph. Each node contains key information, such as its type, a description, and the source document ID, ensuring the data’s integrity and traceability.

The fundamental logic for populating the graph is detailed in Algorithm 2. When processing a newly extracted entity or relationship, the system first checks for its existence in the database. If it already exists, its descriptive information is updated (e.g., by merging new details); otherwise, a new node or edge is created and inserted. During this process, we also attach supplementary metadata to relationships, such as weights or keywords, to enhance their analytical depth.

Algorithm 2: Knowledge Graph Construction Algorithm

Beyond this fundamental construction process, ensuring the long-term quality and consistency of the knowledge graph requires robust strategies for refinement and maintenance. Our framework integrates several key mechanisms to handle the noise, conflicts, and data duplication that can arise from processing numerous documents.

Conflict and Noise Resolution: To address the issue of conflicting entity descriptions arising from multi-document extraction, this framework introduces an automated, LLM-based summarization and fusion strategy. Specifically, all conflicting descriptions are passed to an LLM guided by a dedicated prompt. The model is instructed to “resolve the contradictions and provide a single, coherent summary”, thereby creating a unified and noise-reduced entity description.
Graph Cleaning and Integrity Maintenance: The framework employs a deliberate deletion protocol to maintain graph integrity. When a source document is removed, its associated graph elements are not immediately deleted. Instead, the system checks whether these entities and relationships are still supported by other documents. An element is only expunged from the graph when all its documentary evidence has been removed—a mechanism that effectively prevents the inadvertent loss of valid information.
Data Deduplication: Redundancy is managed at the data ingestion phase. The system utilizes content-based checks to filter out duplicate documents, thereby avoiding redundant storage and processing. In the query response phase, contexts retrieved from the knowledge graph and vector search are also deduplicated to ensure that the context provided to the LLM is both concise and information-dense.

To better illustrate the knowledge graph resulting from this process, we present examples constructed from ten randomly selected government documents, as shown in Figure 3 and Figure 4. These graphs highlight the interconnections between different entities, emphasizing their hierarchical relationships and the diversity of entity attributes.

By constructing and maintaining a high-quality knowledge graph, our system enables efficient and reliable querying based on entity names or types. This structured data storage approach provides strong support for the subsequent retrieval enhancement and generation phases, making the entire process of information extraction and analysis more accurate and efficient.

3.3. Retrieval-Enhanced Generation

The core of this research framework is retrieval-augmented generation (RAG) technology, which, through a two-stage “retrieve-then-generate” paradigm, effectively enhances the accuracy and reliability of large language models (LLMs) in handling knowledge-intensive tasks. The entire process, in response to a user query, involves two primary phases: a retrieval phase to gather relevant evidence and a generation phase to synthesize this evidence into a coherent answer. Our key innovation lies in a sophisticated, two-part prompt engineering strategy that governs both phases. The end-to-end workflow is illustrated in Figure 5.

3.3.1. Keyword-Guided Parallel Retrieval

The retrieval process begins with a prompt-driven query analysis. A specialized prompt instructs the large language model (LLM) to deconstruct the user’s query into two distinct types of keywords: high-level keywords, which represent overarching topics or themes, and low-level keywords, which identify specific entities or details.

These extracted keywords then guide a concurrent retrieval process from two heterogeneous data sources, as depicted in Figure 6. This parallel approach ensures efficiency and comprehensiveness.

Knowledge Graph (KG) Retrieval: This path leverages the keywords to query the graph for structured entities and their interrelationships, providing factual and logical support.
Document Database Retrieval: This path performs a semantic search to acquire detailed textual passages relevant to the query, providing rich descriptive context.

3.3.2. Domain-Specific Prompting for Generation

After the retrieved results are merged and deduplicated, a critical step is constructing the augmented prompt for the final response. The key to effectively fusing the heterogeneous data lies in our domain-specific prompt design. These prompts are tailored to the unique characteristics of official documents and guide the LLM in synthesizing information from different sources. Unlike generic prompts that treat all retrieved context as a single block of text, our hybrid retrieval prompt explicitly categorizes the context into two sections: “knowledge graph (KG)” and “document chunks (DC)”.

This distinction is crucial, as it allows the model to leverage the structured, factual relationships from the knowledge graph alongside the detailed, unstructured descriptions from the document chunks. The prompt also, in its response rules, explicitly instructs the LLM on how to handle potential data conflicts and to cite the source of its information in the final answer. To validate the effectiveness of this design, we compare its features against a generic prompt used for standard vector retrieval in Table 2.

This comparison clearly shows that our hybrid prompt provides a richer, more structured context to the LLM. This enables the model to generate more comprehensive, accurate, and well-organized answers, particularly for complex queries that require both factual reasoning (from the KG) and detailed textual evidence (from documents). Ultimately, this augmented prompt, containing the synthesized context, is submitted to the LLM to generate the final, high-quality response.

4. Experimental Verification

4.1. Purpose and Significance of Experiment

The aim of this experiment is to evaluate a multi-component system, specifically its efficacy in integrating the BERT-CRF model, large language models (LLMs), knowledge graphs (KGs), and retrieval-augmented generation (RAG) for intelligent question-answering concerning official documents [17]. By synergistically combining these diverse components, the system can precisely extract entities and their relationships from official documents. Simultaneously, it capitalizes on knowledge graphs and advanced retrieval generation methods to enhance its information retrieval and generation capacities.

Our hypothesis is that this multi-component system, working collaboratively, can significantly boost the accuracy and contextual relevance of question-answering within the official document domain.

4.2. Review of Research Questions and Experimental Design

This paper addresses the issue of effectively extracting entities and their relationships in the task of document analysis and proposes the integration of knowledge graphs and enhanced retrieval models to improve the relevance and accuracy of answers.

The experiments in this paper mainly focus on the following aspects:

The entity recognition and relation extraction capabilities after combining the BERT-CRF model with LLMs;
The application performance of the RAG model in hybrid retrieval based on knowledge graphs and document data;
Proposed evaluation metrics: answer matching with reference answers, context inclusion of the answer, whether the answer appears in the context, and the overall score (the evaluation metric system is presented in Table 3).

To this end, this paper designs comparative experiments, a summary of which is as follows:

Assessing how the retrieval-augmented generation (RAG) model performs in contrast to answers produced solely by large language models;
Contrasting the BERT-CRF model with the baseline model (BiLSTM-CRF model) to validate the advantages of the BERT-CRF model in named entity recognition tasks.

4.3. Experimental Environment and Datasets Used

4.3.1. Experimental Environment

The experimental environment, used for all subsequent experiments, is detailed in Table 4.

4.3.2. Data Description and Preprocessing

This experiment utilizes two datasets for model training and evaluation: the Chinese Government Document Dataset and the Guangxi Zhuang Autonomous Region Government Information Disclosure Document Dataset [18,19].

The Chinese Government Document Dataset contains 10 entity categories, which include key entities in official documents, such as “Issuing Authority Mark”, “Issuing Date”, and “Document Number”, among others. This dataset has been cleaned and annotated using the BIO tagging scheme, with each document split into several sentences and each word corresponding to an entity label. For model training and stability evaluation, the dataset underwent five-fold cross-validation during the training phase.

To further gauge the model’s generalizability, the Guangxi Zhuang Autonomous Region Government Information Disclosure Document Dataset served as a separate validation set. This dataset contains 5130 entries, each labeled with four entity types: location, organization, product, and time.

In the data preprocessing phase, sentence segmentation was first performed, using the period symbol (“.”) to divide the text into sentences, ensuring that each sentence served as an independent input unit. Long texts exceeding 300 characters were then removed to reduce the computational overhead. Next, a vocabulary was constructed, mapping all words in the dataset to numerical IDs, and words not appearing in the vocabulary were labeled as <UNK>. Finally, label mapping was carried out, converting each entity category label into a corresponding integer ID, ensuring compatibility with the model. Ultimately, the dataset was divided into an 80% training set and a 20% testing set to guarantee the reliability and representativeness of the model evaluation results.

4.4. BERT-CRF Model Parameter Settings and Training Process

4.4.1. Model Architecture

The BERT-CRF model used in this study employs the pretrained bert-base-chinese model for the BERT component, with an output dimension of 768 (i.e., the representation of each token is 768-dimensional). By leveraging powerful contextual awareness, it generates contextual embedding representations for each token, effectively capturing the semantic and structural information in the text. The CRF layer, as the core component of the sequence labeling task, is responsible for modeling the conditional dependencies between labels and optimizing the output of the label sequence by maximizing the conditional probability. The output of this layer consists of a series of labels, totaling 22 labels, covering 10 entity categories (e.g., “Issuing Authority Mark”, “Issuing Date”, etc.), with the modeling through the CRF layer further improving the labeling accuracy of the model.

4.4.2. Training Process

During the model training process, the following parameter settings were applied to ensure training stability and efficiency (Table 5).

The final training results of the BERT-CRF model on the complete dataset are shown in Table 6.

4.4.3. Loss Function and Optimization

Regarding the loss function, the CRF layer calculates the model’s loss. When provided with a token sequence and its corresponding label sequence, the CRF layer optimizes the model’s loss by maximizing the conditional log-likelihood. At the end of each training epoch, the loss is propagated backward through the backpropagation algorithm, gradients are computed, and model parameters are updated. During training, to prevent the issue of gradient explosion, gradient clipping is employed, ensuring the stability of the training process.

4.4.4. Comparison with the BiLSTM Model

To confirm the superiority of the BERT-CRF model compared to the baseline model BiLSTM-CRF, this study conducted a systematic comparison between the two. The BiLSTM-CRF model integrates a bidirectional long short-term memory (BiLSTM) network with a conditional random field (CRF) layer. In this setup, the BiLSTM functions as an encoder, efficiently capturing temporal data in sequences, while the CRF layer is tasked with modeling interlabel dependencies.

Although BiLSTM is effective in handling short-term dependencies and capturing contextual information to some extent, it has certain limitations when dealing with long-range dependencies, particularly in tasks involving complex entity relationships and diverse label dependencies. Consequently, the performance of the BiLSTM-CRF model is constrained in certain tasks, especially when addressing annotation tasks with long-range dependencies and complex label structures (as shown in the comparison results in Figure 7).

In contrast, the BERT-CRF model leverages the powerful contextual representation capabilities of the BERT model. BERT captures deep contextual information in the text through its self-attention mechanism, enabling it to handle long-range dependencies and complex entity relationships more effectively, thus generating high-quality contextual embeddings. This allows the BERT-CRF model to exhibit higher accuracy and stronger generalization capabilities in named entity recognition (NER) tasks. Specifically, the BERT-CRF model significantly outperforms the BiLSTM-CRF model in terms of the F1 score, precision, and recall across multiple cross-validation folds (see Figure 8 and Table 7 and Table 8). Notably, BERT-CRF demonstrates significant performance improvements, particularly in handling complex entity label relationships. This indicates that the BERT-CRF model has a clear advantage in modeling long-range dependencies and complex entity relationships, thereby providing more precise annotation results in practical applications.

Indicator Calculation Notes: The Macro F1 Std and Weighted F1 Std in the cross-validation results quantify model stability by measuring the dispersion of the metrics across five folds, derived from the Macro F1 and Weighted F1 scores, respectively.

Macro F1 Std: This metric reflects the stability of equally category-weighted F1 scores, calculated using the Macro-Avg. (i.e., Macro F1) values across five folds. For BERT-CRF, the five-fold Macro F1 values are $[0.830, 0.870, 0.800, 0.880, 0.870]$ . Applying the formula

$Macro F 1 Std = \sqrt{\frac{1}{5} \sum_{i = 1}^{5} {(M_{i} - \bar{M})}^{2}}$

where $\bar{M}$ denotes the mean of the five-fold Macro F1 scores and $M_{i}$ represents the score of the i-th fold, yields 0.030 for BERT-CRF and 0.094 for BiLSTM-CRF. This confirms the superior cross-fold stability of BERT-CRF.
Weighted F1 Std: This metric reflects the stability of sample size-weighted F1 scores, calculated using the Weighted Avg. (i.e., Weighted F1) values across five folds. For BERT-CRF, the five-fold Weighted F1 values are $[0.980, 0.990, 0.980, 0.990, 0.980]$ . Applying the formula

$Weighted F 1 Std = \sqrt{\frac{1}{5} \sum_{i = 1}^{5} {(W_{i} - \bar{W})}^{2}}$

where $\bar{W}$ denotes the mean of the five-fold Weighted F1 scores and $W_{i}$ represents the score of the i-th fold, yields $0.005$ for BERT-CRF and $0.014$ for BiLSTM-CRF. This verifies that BERT-CRF is less sensitive to sample distribution variations, exhibiting stronger robustness.

4.4.5. Validation of Model Generalization Ability

To further validate the generalization ability of the BERT-CRF model, this study evaluates the performance of both the BERT-CRF and BiLSTM-CRF models on a completely new dataset—the Government Information Disclosure Documents Dataset of Guangxi Zhuang Autonomous Region. This evaluation aims to assess the models’ adaptability to different entity categories and their effectiveness in real-world applications. The generalization outcomes of the BERT-CRF model are presented in Table 9.

The experimental results demonstrate that the BERT-CRF model achieves stable performance on the new dataset, effectively handling various entity categories and achieving high accuracy across different entity annotation tasks. Notably, the model maintains strong performance even when dealing with entities characterized by complex label structures and long-range dependencies, underscoring its robust contextual understanding and broad adaptability.

In contrast, the BiLSTM-CRF model demonstrates relatively weaker performance on the new dataset. Although it is capable of capturing certain sequential dependencies, its effectiveness declines noticeably when confronted with diverse and complex entity categories—particularly those involving long-range dependencies and intricate label structures—where it falls significantly behind the BERT-CRF model.

4.4.6. Summary

This study consisted of a systematic analysis of our model’s architecture, including an ablation study to verify the contributions of its core components and a comparison against the BiLSTM-CRF baseline. To specifically isolate the influence of integrating the CRF layer, we performed an ablation test comparing our full BERT-CRF model with a BERT-Only variant (i.e., BERT with a linear classification layer). The experimental results on the DovDoc-CN dataset, shown in Figure 9, are definitive. The BERT-CRF model significantly outperforms the BERT-Only model across the precision, recall, and F1 scores, demonstrating that, while BERT provides powerful contextual embeddings, the CRF layer is crucial in modeling label dependencies and ensuring the structural integrity of the final entity predictions. Notably, this substantial performance gain is achieved with a negligible increase in training time. Furthermore, when compared to the BiLSTM-CRF baseline on the same dataset, the BERT-CRF model consistently achieves higher scores, confirming the superiority of BERT’s deep contextual representation capabilities for this task. Future research may further explore the extension and optimization of the model across different tasks and domains to enhance its performance and practical applicability.

4.5. Performance Evaluation and Comparative Analysis of Retrieval-Augmented Generation (RAG) Models

4.5.1. Experimental Design

To comprehensively assess the performance of different retrieval-augmented generation (RAG) architectures, this study conducts a comparative experiment involving three distinct systems.

Our Hybrid RAG System: The model proposed in this study, which integrates retrieval capabilities from both knowledge graphs and unstructured documents, coupled with an enhanced generation mechanism.
Vector-Only RAG System: A baseline RAG model that relies solely on conventional dense vector retrieval from unstructured documents.
Pure LLM (No RAG): A standard large language model generating answers without any external knowledge retrieval, serving as the foundational baseline.

The evaluation utilizes a curated set of questions grounded in a specific knowledge base to compare the quality of the answers from each system. This design allows not only a demonstration of the value of RAG over a pure LLM but also the isolation and quantification of the specific advantages of our hybrid retrieval approach compared to a standard vector-only RAG implementation.

4.5.2. Quantitative Analysis

The quantitative evaluation results reveal a distinct performance hierarchy among the three systems, as detailed in Table 10. The proposed Hybrid RAG system achieves an average score of 87.1, significantly outperforming both the Vector-Only RAG (53.5) and the Pure LLM (60.0).

Specifically, the score of our model represents a substantial 62.8% improvement over the conventional Vector-Only RAG, highlighting the effectiveness of integrating knowledge graphs for enhanced retrieval precision and completeness. Notably, the Vector-Only RAG system scored lower than the Pure LLM. This result suggests that a simplistic retrieval mechanism can introduce irrelevant or noisy context, which may in turn degrade the quality of the generated answer. This finding underscores the critical importance of the retrieval quality within the RAG pipeline.

The radar plot in Figure 10 is intended to visually corroborate these findings, illustrating the anticipated dominance of the Hybrid RAG system across key metrics such as relevance and completeness.

4.5.3. Qualitative Analysis: A Case Study

To illustrate how this numerical superiority translates into tangible answer quality, an expanded case study is presented in Table 11. The example compares the responses from all three systems for a detailed policy-related query.

The case study in Table 11 clearly illustrates the performance differences. The response from the Pure LLM is generic and contains speculative elements, lacking specific details from the source document. The Vector-Only RAG system’s answer, while identifying some relevant keywords, presents them as a disorganized list and fails to capture the structured, multifaceted nature of the policy. In stark contrast, our Hybrid RAG system precisely retrieves and synthesizes evidence, generating a detailed, clearly structured, and factually accurate response that is directly verifiable through the provided references.

4.5.4. Summary of Findings

The experimental results confirm that the proposed Hybrid RAG model holds significant advantages over both the conventional Vector-Only RAG and Pure LLM approaches. The quantitative improvement, evidenced by a 62.8% higher score than Vector-Only RAG and a 45.2% higher score than the Pure LLM, is qualitatively substantiated by the case study.

These findings indicate that the effectiveness of RAG systems is not uniform and is highly dependent on the sophistication of the retrieval component. A simplistic vector-based retrieval may fail to extract meaningful, structured information and can, in some cases, even degrade the performance below the baseline of a non-augmented LLM. By integrating knowledge graph-based structured data with unstructured text retrieval, our hybrid approach avoids the generic nature of pure LLMs and the superficiality of vector-only RAG. It consistently generates verifiable and well-supported answers, rendering it particularly suitable for high-stakes applications such as intelligent question-answering and policy analysis.

4.6. Latency and Efficiency of Hybrid Retrieval

To evaluate the practical applicability of our system, particularly its performance in near-real-time question-answering scenarios, we analyzed the latency and computational efficiency of the hybrid retrieval mechanism.

4.6.1. Theoretical Complexity Analysis

The end-to-end latency of a hybrid query is composed of three primary aspects.

Keyword Extraction: This stage relies on an LLM API call, with its latency primarily determined by the external service’s response time (approximately 500–2000 ms) rather than the local data volume. A caching mechanism is implemented to reduce the overhead of repeated calls for similar queries.
Parallelized Retrieval Module: As the core of the framework, knowledge graph retrieval and vector retrieval are executed concurrently. The overall latency of this stage is governed by the slower of the two paths (approximately 50–500 ms), effectively masking the latency of the faster one.
LLM Response Generation: This stage constitutes the main performance bottleneck, where the latency is proportional to the length of the input prompt and the generated response (approximately 1000–5000 ms).

4.6.2. Performance Optimization Strategies

The framework integrates several optimization measures to enhance operational efficiency.

Multi-Layer Caching: A robust caching mechanism is applied to both the keyword extraction results and final LLM responses, significantly reducing the latency for recurring or semantically similar queries.
Asynchronous Parallel Execution: The parallelization of core retrieval tasks minimizes wait times during the data acquisition phase.
Dynamic Context Truncation: Retrieved contexts are dynamically truncated based on a preset maximum token limit, which, while preserving information relevance, effectively reduces the LLM inference costs and time latency.

4.6.3. Empirical Latency Evaluation

Under baseline testing conditions (i.e., without a cache hit), the system’s end-to-end average response time is approximately 3000–4000 ms. With caching enabled, this latency can be substantially reduced, indicating the system’s viability for many interactive applications.

5. Conclusions

With the acceleration of e-government development, document element extraction technology is playing an increasingly vital role in improving administrative efficiency, supporting policy decision-making, and optimizing information retrieval. While the fusion of deep learning models like BERT-CRF with knowledge graphs has shown promise in other domains, its application to the distinct challenges of government document analysis has been limited. To the best of our knowledge, this paper is the first to systematically propose and validate the application of a BERT-CRF model, enhanced by a knowledge graph, for element extraction specifically within the domain of official government documents. In particular, we focus on the DovDoc-CN dataset, which contains a rich set of government document samples, providing new data support for this area of research. By combining the deep linguistic understanding of the BERT-CRF model with the structured, domain-specific knowledge provided by a purpose-built knowledge graph, our proposed system can efficiently and accurately extract key information from official documents, significantly enhancing both the processing accuracy and efficiency.

When compared to general-purpose models such as BiLSTM-CRF, our domain-specific approach shows significant advantages in handling format heterogeneity and specialized terminology in government documents. General-purpose models often struggle to capture the complex relationships and domain knowledge inherent in government documents, while our model, by integrating a knowledge graph, has enhanced capabilities in contextual modeling and professional term recognition. The experimental results validate the superiority of our proposed approach. Compared to the baseline BiLSTM-CRF model, a representative general-purpose sequence labeling method, our method achieves a macro-average F1 score of 0.850 (±0.034), representing a significant improvement of 0.0241. Moreover, the integration of domain knowledge profoundly enhances the model stability. Across five independent evaluations, the system demonstrates a standard deviation of 0.0055 in accuracy, which is substantially lower than the 0.0217 fluctuation observed in the more generalized baseline model. The weighted average F1 score reaches 0.984, further highlighting the advantages of our method in terms of both stability and precision. This is especially evident in processing heterogeneous and cross-domain documents, where the knowledge graph provides crucial contextual grounding that general-purpose models inherently lack, thereby granting the system greater adaptability and robustness.

The intelligent system developed in this study not only reduces the manual labor costs but also significantly enhances the efficiency of document management, thereby providing strong technical support for the digital transformation of government operations.

Building on the demonstrated success of integrating domain-specific knowledge, future research will further optimize the model architecture by incorporating more hybrid approaches that combine deep learning with expert knowledge bases. This aims to improve the accuracy and applicability of document element extraction in more complex scenarios. In particular, enhancing the generalization ability of the model when faced with increasingly diverse and structurally complex document types will remain a key direction for further work. Additionally, with the development of multimodal technologies, expanding the application scope of document element extraction will be a crucial step toward advancing intelligent government management.

Author Contributions

Conceptualization, Y.Y.; methodology, S.C. and X.Z. (Xiaomin Zhu); software, S.C.; validation, S.C., L.N., J.L. and X.Z. (Xuebin Zhuang); formal analysis, S.C. and X.Z. (Xuebin Zhuang); investigation, S.C.; data curation, S.C.; writing—original draft preparation, S.C. and X.Z. (Xiaomin Zhu); writing—review and editing, L.N., J.L., X.Z. (Xiaomin Zhu) and X.Z. (Xuebin Zhuang); visualization, S.C.; supervision, Y.Y., X.Z. (Xiaomin Zhu) and X.Z. (Xuebin Zhuang); project administration, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mahadevkar, S.V.; Patil, S.; Kotecha, K.; Soong, L.W.; Choudhury, T. Exploring AI-driven approaches for unstructured document analysis and future horizons. J. Big Data 2024, 11, 92. [Google Scholar] [CrossRef]
Li, W.; Zhou, W.; Lu, B.; Gao, H.; Bian, Y.; Zhang, H.; Na, C.; Xu, W. Official Document Knowledge Graph: Construction and Application. J. Chin. Comput. Syst. 2024, 45, 1281–1291. (In Chinese) [Google Scholar]
Huang, S.; Wang, B.; Zhu, J. Information Extraction from Financial Announcements Based on Document Structure and Deep Learning. Comput. Eng. Des. 2020, 41, 115–121. (In Chinese) [Google Scholar]
Zhuang, C.; Zhou, Y.; Ge, J.; Li, Z. Information Extraction from Chinese Judgment Documents. In Proceedings of the 14th Web Information Systems and Applications Conference (WISA), Liuzhou, China, 11–12 November 2017. [Google Scholar]
Zhang, X. Research on Key Technologies of Knowledge Graph-Based Question Answering. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2020. (In Chinese). [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Chen, J. Common Issues in the Format of Official Documents Issued by Administrative Organs. Off. Bus. 2023, 1, 4–6. (In Chinese) [Google Scholar]
Li, X. A Corpus-Based Study on the Linguistic Features of Report-Type Official Documents. Master’s Thesis, Central China Normal University, Wuhan, China, 2019. (In Chinese) [Google Scholar] [CrossRef]
Chan, Z.; Chen, X.; Wang, Y.; Li, J.; Zhang, Z.; Gai, K.; Zhao, D.; Yan, R. Stick to the Facts: Learning Towards a Fidelity-Oriented E-Commerce Product Description Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2020; pp. 4959–4968. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Lv, J.; Zhou, N.; Du, J.; Xue, Z. BERT-BIGRU-CRF: A Novel Entity Relationship Extraction Model. In Proceedings of the 2020 IEEE International Conference on Knowledge Graph (ICKG), Nanjing, China, 9–11 August 2020; pp. 157–164. [Google Scholar]
Wu, S. Research on Intelligent Classification Technology of Government Documents Based on Deep Learning. Master’s Thesis, University of Electronic Science and Technology of China, Chengdu, China, 2020. (In Chinese). [Google Scholar]
Shen, Z.; Li, Y.; Ding, Q.; Shao, W.; Ma, H. Research on Scientific Policy Text Classification Based on BERT Model. Digit. Libr. Forum 2022, 1, 10–16. (In Chinese) [Google Scholar]
Guo, Z.; Xia, L.; Yu, Y.; Ao, T.; Huang, C. LightRAG: Simple and Fast Retrieval-Augmented Generation. arXiv 2025, arXiv:2410.05779. [Google Scholar]
Hu, S.; Zhang, H.; Hu, X.; Du, J. Chinese Named Entity Recognition based on BERT-CRF Model. In Proceedings of the 2022 IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS), Zhuhai, China, 26–28 June 2022; pp. 105–108. [Google Scholar] [CrossRef]
Kau, A.; He, X.; Astudillo, A.; Nambissan, A.; Yin, H.; Aryani, A. Combining Knowledge Graphs and Large Language Models. arXiv 2024, arXiv:2407.06564. [Google Scholar] [CrossRef]
Linders, J.; Tomczak, J.M. Knowledge Graph-extended Retrieval Augmented Generation for Question Answering. arXiv 2025, arXiv:2504.08893. [Google Scholar] [CrossRef]
Wu, Y.; Xu, R.; Li, Z.; Li, B.; Wang, Y.; Zhang, S. GovDoc-CN: A Multi-Modal Dataset of Chinese Governmental Documents. GitHub Repository. 2025. Available online: https://github.com/RuilinXu/GovDoc-CN (accessed on 21 August 2025).
She, H.; Huang, H.; Yu, Z.; Qin, X.; Lu, S. Chinese Knowledge Graph Dataset. GitHub Repository. 2025. Available online: https://github.com/Echo-she/chinese-knowledge-graph (accessed on 21 August 2025).
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. System architecture diagram.

Figure 2. Overall diagram of BERT, CRF, and BERT-CRF model architectures. (a) Schematic diagram of the BERT model. (b) Example of an entity annotation sequence for the Chinese text “交通运输部函〔2018〕785号” (Ministry of Transport Letter, No. 785 (2018)). (c) BERT-CRF model architecture diagram.

Figure 3. Dynamic knowledge graph constructed based on entity relationships. In this graph, nodes are colored according to their entity type (e.g., organization, person, location, event). The thickness of an edge represents the strength or frequency of the relationship, with thicker lines indicating stronger or more frequent connections, and thinner lines indicating weaker or less frequent ones.

Figure 4. Entity relationship network example, centered on the “Financial Market Department of the People’s Bank of China” (中国人民银行金融市场司). This graph illustrates how the central financial authority connects with various policy measures and economic entities during specific periods, such as the pandemic. The connecting lines represent the relationships (e.g., issuance, support, regulation) between these entities. Legend for Chinese terms: ‘发行人’ (Issuers), ‘承销机构’ (Underwriting Institutions), ‘双创金融债券’ (“Dual Innovation” Financial Bonds), ‘金融债券’ (Financial Bonds), ‘疫情防控期间’ (During the Pandemic Prevention and Control Period), ‘绿色通道’ (Green Channel, an expedited process), ‘绿色债券’ (Green Bonds), ‘疫情较重地区’ (Severely Pandemic-Affected Areas), ‘延长债券额度有效期’ (Extending the Validity of Bond Quotas), and ‘远程招标方式发行债券’ (Issuing Bonds via Remote Bidding).

Figure 5. Prompt engineering diagram illustrating the overall workflow from document input to final answer generation. The ellipses (…) in the example text indicate that the content has been abbreviated for illustrative purposes.

Figure 6. Enhanced retrieval process diagram, showing keyword extraction followed by parallel retrieval from both global (KG) and local (document) models.

Figure 7. F1 score comparison by label.

Figure 8. Schematic illustration of the K-fold cross-validation process.

Figure 9. Comparison of performance indicators.

Figure 10. Radar plot comparing evaluation metrics across models.

Table 1. Document entity relationship table.

Source Entity	Relationship Type	Target Entity	Relationship Description
Issuing Authority	Issuance Relationship	Document Number	The authority issues the document number through an official process.
Title of Main Text	Subordination Relationship	Main Text	The title summarizes the core content of the main text.
Primary Addressee	Pointing Relationship	Main Text	The main addressee of the document is related to the execution requirements in the main text.
Primary Heading	Hierarchical Relationship	Secondary Heading	The hierarchical progression of headings, breaking down the content.
Secondary Heading	Hierarchical Relationship	Tertiary Heading	Further refinement and classification of content.
Date of Creation	Temporal Relationship	Main Text	Date is related to the timeliness of policies in the main text.
Issuing Authority Logo	Representation Relationship	Issuing Authority	Logo as the official symbol of organizational identity.

Table 2. Comparison of hybrid prompt vs. generic prompt.

Feature	Our Hybrid Prompt	Generic Prompt
Information Sources	Combines structured results from KG and unstructured results from vector search.	Contains only a single source of information (e.g., vector search results).
Data Organization	Explicitly separates context into “knowledge graph (KG)” and “document chunks (DC)”.	Treats all context as a single, undifferentiated block.
Context Management	Designed to handle different formats for different information types.	A single format for all context.
Citation Requirement	Requires the LLM to specify the source (KG or DC) for each piece of information.	Simple citation or no requirement to differentiate sources.
Structural Guidance	Explicitly instructs the LLM to organize the answer into sections.	Fewer requirements for structuring the answer.

Table 3. Evaluation metric system.

Dimension	Specific Metric	Measurement Method
Entity Recognition	Precision/Recall/F1	Strict Boundary Matching
QA Quality	Answer Matching Degree	Keyword Overlap Rate (Jaccard)
Context Relevance	Answer Containment Rate	Boolean Determination
Comprehensive Performance	Weighted Score (0–100)	Linear Weighting (0.4:0.3:0.3)

Table 4. Experimental environment.

Category	Details
Operating System	Windows
Virtual Environment	Anaconda
Hardware Configuration	NVIDIA GTX 1080 Ti
Programming Language	Python 3.7
Deep Learning Framework	TensorFlow 2.11.0
	PyTorch 1.10.0
CUDA	11.2

Table 5. Training configuration.

Item	Configuration Description
Batch Size	16, balancing training efficiency and hardware resource limitations
Optimizer	Adam [20]
Initial Learning Rate	3 × 10⁻⁵, experimentally tuned to effectively prevent gradient explosion or vanishing
Training Epochs	5, to ensure that the model fully learns the data features
Learning Rate Strategy	The adaptive adjustment mechanism built into the Adam optimizer, accelerating convergence
Training Method	The training samples are input in batches using a data loader and optimized through the loss function in PyTorch 1.10.0 with CUDA 11.2 support
Training Devices	GPU supported by CUDA, used to accelerate training and handle large-scale data

Table 6. Final training results of the BERT-CRF model.

Entity Label	Precision	Recall	F1 Score	Support
O	1.00	1.00	1.00	25,994
B_Issuing Authority Logo	0.92	0.85	0.98	466
I_Issuing Authority Logo	0.97	0.99	0.98	3798
B_Issuing Authority	0.99	0.98	0.99	573
I_Issuing Authority	1.00	0.99	0.99	3529
B_Issuing Date	0.89	1.00	0.94	123
I_Issuing Date	0.88	1.00	0.94	1099
B_Document Number	0.86	0.75	0.80	449
I_Document Number	0.96	0.99	0.97	5380
B_Main Text Title	0.50	0.52	0.51	427
I_Main Text Title	0.98	0.97	0.97	7409
B_Primary Addressee	0.00	0.00	0.00	18
I_Primary Addressee	0.97	0.28	0.43	107
B_Main Text	0.98	0.97	0.98	3652
I_Main Text	1.00	1.00	1.00	448,626
B_Heading Level 1	0.99	0.99	0.99	1998
I_Heading Level 1	0.99	0.99	0.99	34,568
B_Heading Level 2	0.98	0.98	0.98	1821
I_Heading Level 2	0.99	0.98	0.98	40,292
B_Heading Level 3	0.96	0.98	0.97	862
I_Heading Level 3	0.96	1.00	0.98	23,876
B_Completion Date	1.00	0.95	0.97	343
I_Completion Date	0.98	0.96	0.97	3050
Accuracy	–	–	0.99	608,460
Macro-Average	0.90	0.87	0.88	608,460
Weighted Average	0.99	0.99	0.99	608,460

Table 7. BERT-CRF cross-validation results.

Fold	Accuracy	Macro-Avg.	Weighted Avg.
1	0.980	0.830	0.980
2	0.990	0.870	0.990
3	0.980	0.800	0.980
4	0.990	0.880	0.990
5	0.980	0.870	0.980
Avg.	0.984	0.850	0.984

Table 8. BiLSTM-CRF cross-validation results.

Fold	Accuracy	Macro-Avg.	Weighted Avg.
1	0.410	0.660	0.700
2	0.690	0.690	0.690
3	0.440	0.680	0.690
4	0.710	0.450	0.710
5	0.700	0.510	0.670
Avg.	0.590	0.598	0.692

Table 9. Generalization results of the BERT–CRF model on the Guangxi Government Information Disclosure Dataset.

Entity Label	Precision	Recall	F1 Score	Support
B-BODY	0.890	0.905	0.898	3079
B-DISEASES	0.858	0.863	0.861	1311
B-DRUG	0.942	0.881	0.911	480
B-EXAMINATIONS	0.864	0.846	0.855	345
B-TEST	0.826	0.731	0.777	572
B-TREATMENT	0.869	0.821	0.844	162
I-BODY	0.757	0.861	0.806	2435
I-DISEASES	0.962	0.863	0.910	6858
I-DRUG	0.962	0.881	0.919	1433
I-EXAMINATIONS	0.853	0.875	0.864	833
I-TEST	0.800	0.742	0.770	344
I-TREATMENT	0.962	0.913	0.937	1870
Micro-Avg.	0.875	0.857	0.866	20,992
Macro-Avg.	0.881	0.847	0.869	20,992
Weighted Avg.	0.878	0.857	0.866	20,992

Table 10. Overall performance score comparison.

System	Average Score
Our Hybrid RAG System	87.1
Pure LLM (No RAG)	60.0
Vector-Only RAG System	53.5
Improvement of Hybrid RAG
vs. Pure LLM	+45.2%
vs. Vector-Only RAG	+62.8%

Table 11. Case study: comparison of answers from three different systems.

Query	“What are the specific support measures for SME financing in the ‘Several Opinions of the State Council on Promoting the Development of SMEs’?”
Answer with Our Hybrid RAG System	Based on the provided document, the ‘Several Opinions of the State Council on Promoting the Development of SMEs’ outlines several specific support measures for financing: Broaden Financing Channels: Encourages the development of direct financing for SMEs, supports them to list on the main board and GEM, and allows the issuance of SME collective bonds. Establish a Credit Guarantee System: Pushes for the establishment of a credit guarantee system for SMEs, with central and local governments providing financial support to guarantee institutions to increase their capacity. Incentivize Financial Institutions: Proposes implementing differentiated regulatory policies, such as lowering the risk weight for SME loans and increasing the tolerance for non-performing loans, to encourage commercial banks to lend to SMEs. Tax Incentives: Provides tax exemptions and reductions for qualified guarantee institutions to lower the financing costs for SMEs. References: [DC] Section 3, Item 12: “Broaden direct financing channels for SMEs…” [KG] (SME Financing) –>[supported by]–> (Credit Guarantee System)
Answer with Vector-Only RAG	The document mentions support for SME financing. Measures include using capital markets for financing and providing credit support. The government will support guarantee institutions. Commercial banks are encouraged to lend to SMEs. It also mentions tax policies for these institutions.
Answer without RAG (Pure LLM)	The ‘Several Opinions of the State Council on Promoting the Development of SMEs’ generally encourages financial support for small and medium-sized enterprises. This includes asking banks to increase lending, innovating financial products, and developing multi-level capital markets. It emphasizes solving financing difficulties but does not provide highly specific measures, as these are often detailed in subsequent documents by individual ministries.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, S.; Niu, L.; Li, J.; Zhu, X.; Zhuang, X.; Ye, Y. Structured Element Extraction from Official Documents Based on BERT-CRF and Knowledge Graph-Enhanced Retrieval. Mathematics 2025, 13, 2779. https://doi.org/10.3390/math13172779

AMA Style

Chen S, Niu L, Li J, Zhu X, Zhuang X, Ye Y. Structured Element Extraction from Official Documents Based on BERT-CRF and Knowledge Graph-Enhanced Retrieval. Mathematics. 2025; 13(17):2779. https://doi.org/10.3390/math13172779

Chicago/Turabian Style

Chen, Siyuan, Liyuan Niu, Jinning Li, Xiaomin Zhu, Xuebin Zhuang, and Yanqing Ye. 2025. "Structured Element Extraction from Official Documents Based on BERT-CRF and Knowledge Graph-Enhanced Retrieval" Mathematics 13, no. 17: 2779. https://doi.org/10.3390/math13172779

APA Style

Chen, S., Niu, L., Li, J., Zhu, X., Zhuang, X., & Ye, Y. (2025). Structured Element Extraction from Official Documents Based on BERT-CRF and Knowledge Graph-Enhanced Retrieval. Mathematics, 13(17), 2779. https://doi.org/10.3390/math13172779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structured Element Extraction from Official Documents Based on BERT-CRF and Knowledge Graph-Enhanced Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Sequence Labeling for Element Extraction with BERT-CRF

2.2. Knowledge Graph (KG)-Enhanced Information Extraction

2.3. Positioning Our Work

3. Methodology

3.1. Method Framework

3.1.1. Core Concept Definition

3.1.2. System Flow

3.2. Document Information Extraction Based on BERT-CRF

3.2.1. Document Entity Recognition

3.2.2. Document Entity Relationship Recognition

3.2.3. Knowledge Graph Construction and Refinement

3.3. Retrieval-Enhanced Generation

3.3.1. Keyword-Guided Parallel Retrieval

3.3.2. Domain-Specific Prompting for Generation

4. Experimental Verification

4.1. Purpose and Significance of Experiment

4.2. Review of Research Questions and Experimental Design

4.3. Experimental Environment and Datasets Used

4.3.1. Experimental Environment

4.3.2. Data Description and Preprocessing

4.4. BERT-CRF Model Parameter Settings and Training Process

4.4.1. Model Architecture

4.4.2. Training Process

4.4.3. Loss Function and Optimization

4.4.4. Comparison with the BiLSTM Model

4.4.5. Validation of Model Generalization Ability

4.4.6. Summary

4.5. Performance Evaluation and Comparative Analysis of Retrieval-Augmented Generation (RAG) Models

4.5.1. Experimental Design

4.5.2. Quantitative Analysis

4.5.3. Qualitative Analysis: A Case Study

4.5.4. Summary of Findings

4.6. Latency and Efficiency of Hybrid Retrieval

4.6.1. Theoretical Complexity Analysis

4.6.2. Performance Optimization Strategies

4.6.3. Empirical Latency Evaluation

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI