A Subject-Guided Two-Stage Joint Entity and Relation Extraction Method for Cultural Relic Knowledge Graphs

Song, Yanchao; Yu, Xia; Zhang, Liqian; Zhang, Quanping; Bai, Yunli

doi:10.3390/app16115584

Open AccessArticle

A Subject-Guided Two-Stage Joint Entity and Relation Extraction Method for Cultural Relic Knowledge Graphs

by

Yanchao Song

¹

,

Xia Yu

¹,

Liqian Zhang

¹,

Quanping Zhang

¹ and

Yunli Bai

^1,2,*

¹

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

²

Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot 010018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5584; https://doi.org/10.3390/app16115584

Submission received: 10 May 2026 / Revised: 30 May 2026 / Accepted: 1 June 2026 / Published: 3 June 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

To address the challenge of fragmented, unstructured knowledge in the cultural relic domain, where existing entity and relation extraction models suffer from boundary confusion and feature degradation for long entities and overlapping triples, this paper proposes a subject-guided two-stage joint entity and relation extraction model tailored to cultural relic texts, and constructs a Cultural Relic Knowledge Graph System. Building on CasRel’s cascaded labeling framework, we design a Multi-Head Self-Attention Decoder Enhanced by Relative Position Encoding (MHSA-RPE) to explicitly model inter-entity positional relations and alleviate boundary confusion. We further propose a Boundary–Global Dual-Branch Subject Fusion Module (BGDSFM) to encode local boundary and global contextual features in parallel, alleviating feature degradation from simple average pooling. Experiments on DuIE2.0 and our self-built Palace Museum Cultural Relic Entity–Relation Dataset (PM-CRER) show that the proposed model achieves F1-scores of 79.4% and 75.9% respectively. It outperforms mainstream baselines, surpassing its prototype CasRel by 3.6 percentage points on PM-CRER and the latest cascaded state-of-the-art CECRel by 2.6 percentage points on DuIE2.0. Based on this model, a Chinese Cultural Relic Knowledge Graph System supporting the multimodal display of cultural relic images is constructed, providing technical references for the digital protection, dissemination and utilization of cultural relic knowledge.

Keywords:

cultural relic knowledge graph; entity and relation extraction; two-stage joint extraction; relative position encoding; boundary–global dual-branch subject fusion

1. Introduction

Over thousands of years, Chinese civilization has accumulated a vast and diverse collection of cultural relics. These relics not only possess extremely high artistic and historical value but also contain rich cultural semantics and complex knowledge associations. However, the recording and dissemination of cultural relic information has long relied on unstructured carriers such as web texts and archival documents, resulting in the fragmented distribution of a large amount of cultural relic knowledge. Traditional management methods dominated by keyword-based retrieval and manual cataloging struggle to uncover the deep semantic associations between cultural relic entities, which restricts the systematic protection and dissemination of cultural relic knowledge [1]. Therefore, how Natural Language Processing (NLP) technology can be used to automatically and accurately extract structured knowledge from massive cultural relic texts and build a knowledge graph for the cultural relic domain has become a key research topic in the digitalization and intellectualization of cultural heritage [2].

Entity and Relation Extraction (ERE), as the core upstream task of knowledge graph construction, aims to identify entity boundaries from unstructured texts and predict semantic relations between entity pairs, ultimately forming structured triples in the form of (subject, relation, object). In recent years, deep-learning-based joint extraction models have made remarkable progress in this task. Ringwald et al. [3] systematically sorted out the evolutionary context of relation extraction technology since the emergence of Transformer. Currently, deep-learning-based relation extraction models widely adopt Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs), among which BERT-based methods still achieve state-of-the-art performance (Hu et al. [4]; Diaz-Garcia et al. [5]; Rao et al. [6]). In general, joint extraction models [7] can be divided into three main architectural paradigms: (1) Pipeline architecture, which decomposes the extraction task into sequentially executed subtasks; (2) Single-stage linking method, which directly generates triples from texts in one step; (3) Cascaded labeling framework, which decodes entities and relations step by step.

Among the above paradigms, the CasRel model proposed by Wei et al. [8] pioneered the cascaded binary tagging framework. It decomposes the complex joint extraction task into two stages: “subject recognition” and “relation–object tagging”, which effectively alleviates the overlapping triple problem that plagued early models to some extent and has become a landmark work in this field. Building on CasRel, subsequent studies have explored in multiple directions: PRGC (Zheng et al. [9]) introduces a latent relation prediction mechanism to reduce computational complexity and restricts entity extraction to relation-specific subsets, but this method exhibits a significant performance degradation in long-tail relation scenarios; TPLinker (Wang et al. [10]) reconstructed joint extraction as a token-pair linking problem, enabling overlapping triple processing within a single-stage framework, yet its annotation space complexity grows quadratically with sequence length; OneRel (Shang et al. [11]) further unified the extraction process into a fine-grained triple classification framework, but suffers from training instability; CECRel (Tong et al. [12]) improved the extraction performance of the cascaded paradigm through contrastive learning and feature enhancement mechanisms, while its boundary recognition capability in long-entity scenarios still needs improvement. Beyond the cascaded framework, PREBI (Liu et al. [13]) explicitly enhanced entity boundary recognition capability, but did not consider the semantic differences within entities; DERP (Xiao et al. [14]) introduced a dynamic entity pair generation strategy to improve long-tail relation extraction performance, but increased the inference complexity of the model. In addition, Cao et al. [15] proposes a rotary position-enhanced word pair detection paradigm for Chinese document-level nominal compound relation extraction, verifying the effectiveness of relative position information for Chinese entity and relation extraction, but this method has not been applied to the cascaded extraction framework.

Despite these significant advances, existing models still face several key challenges when applied to domain-specific texts such as cultural relic descriptions. First, the attention mechanism in most joint extraction models adopts absolute position encoding, which cannot explicitly model the relative positional dependencies between entities. In the common scenario of cultural relic texts where “a single cultural relic corresponds to multiple overlapping attributes (e.g., excavation site, age, material)”, this deficiency easily leads to boundary confusion and entity misclassification. Second, the feature aggregation mechanism used in cascaded models usually only performs simple average or max pooling operations on tokens within the subject span. This not only erases fine-grained semantic differences within entities but also loses critical boundary position information. This problem is particularly prominent when processing long entity names common in the cultural relic domain (e.g., “Silver Gilt Filigree Bracelet with Double Dragons Playing with a Pearl Pattern”), and easily leads to feature degradation.

To address the above issues, this paper proposes a subject-guided two-stage joint entity and relation extraction model, and builds a full-process intelligent Cultural Relic Knowledge Graph System based on this model. The model inherits the advantages of the cascaded labeling framework, while introducing two targeted improvements tailored to the characteristics of cultural relic texts:

Multi-Head Self-Attention Decoder Enhanced by Relative Position Encoding (MHSA-RPE): It replaces the absolute position encoding in the original attention mechanism with a sine function-based relative position encoding scheme. This enables attention scores to simultaneously capture the semantic similarity and relative distance relationships between tokens, thereby improving the model’s perception ability of entity boundary positions and alleviating the boundary confusion problem in overlapping triple scenarios of cultural relic texts.
Boundary–Global Dual-Branch Subject Fusion Module (BGDSFM): Aiming at the information loss caused by average pooling, a parallel dual-branch structure is designed. The boundary branch extracts features of the first and last tokens of the subject to strengthen boundary position information, while the global branch aggregates the average semantics of all tokens within the subject span to maintain internal integrity. Finally, we generate more discriminative subject-specific representations through element-wise addition fusion.

The organization of the rest of this paper is as follows: Section 2 elaborates on the full-process system framework and the network structure of the proposed model; Section 3 introduces the experimental datasets, evaluation metrics and experimental setup; Section 4 presents and analyzes the results of model comparison, ablation, overlapping triple, error analysis and complexity experiments; Section 5 describes the implementation and functional demonstration of the Knowledge Graph System; and Section 6 summarizes the full paper and outlines future work.

2. Method

2.1. Overall System Full-Process Framework

This paper constructs a full-process intelligent construction system for cultural relic domain knowledge graphs, adopting a “data-driven → model extraction → knowledge construction → visualization application” full-process architecture, as shown in Figure 1. The system completely covers the entire life cycle from raw data acquisition to the final practical application of knowledge graphs, and provides a solution to the problem that cultural relic knowledge scattered in web texts is difficult to utilize in a structured manner.

The system process starts from the knowledge data source layer, which simultaneously integrates semi-structured data crawled from web pages and unstructured data from text sources. After the data preprocessing and annotation stage, a high-quality “Cultural Relic Information Corpus” is constructed through data cleaning, denoising and standardized annotation, providing a solid data foundation for model training.

The core execution layer is the subject-guided two-stage joint entity and relation extraction model. Different from the complex parameter sharing method of traditional joint models, this model adopts a progressive design of “first extracting subjects, then extracting corresponding relations and objects”. The first stage accurately locates all subject entities in the text, and the second stage combines subject features with global context to generate complete entity–relation–object triples.

The extracted structured triples are directly input into the database storage layer. The Neo4j graph database is used to construct the cultural relic knowledge graph, realizing the structured storage of association relations between entities. On this basis, the system builds an upper functional presentation layer, providing services such as knowledge display, data query, graph visualization and human–computer interaction, making the complex knowledge in the cultural relic domain intuitively presented.

Compared with traditional single extraction models, this architecture deeply integrates complex NLP tasks with knowledge engineering through full-process collaborative design. It not only improves the extraction accuracy in overlapping triple scenarios, but also forms a set of practical intelligent management solutions for cultural relic knowledge from raw data to visualization applications, providing a technical foundation for addressing the problem that fragmented and unstructured knowledge in the cultural relic domain is difficult to utilize. The core of the system is the subject-guided two-stage joint entity and relation extraction model proposed in this paper, whose specific architecture and core improvements will be elaborated in subsequent subsections.

2.2. Proposed Subject-Guided Two-Stage Model

This paper proposes a subject-guided two-stage joint entity and relation extraction model, whose overall architecture is shown in Figure 2. The model adopts a progressive design of “first globally extracting subjects, then extracting corresponding relations and objects based on subjects”, decomposing the complex joint extraction task into two highly interpretable subtasks. The combination of global feature sharing and subject-wise independent decoding helps alleviate the problems of low extraction accuracy and frequent omission in overlapping triple extraction. As illustrated in Figure 2, the model executes a progressive bottom-up processing flow corresponding to this design: (1) Input text is sequentially encoded by BERT and BiLSTM to generate a global–local integrated feature sequence

h_{n}

. (2) First stage:

h_{n}

is fed into the MHSA-RPE enhanced Subject Extraction Positional Attention Layer to predict subject boundaries (the dashed box denotes a detected subject span

s_{1}

). (3) Subject features are enhanced by the BGDSFM to generate discriminative subject-specific vectors

v_{s u b}^{k}

. (4) Second stage:

v_{s u b}^{k}

and the original global feature

h_{n}

are jointly input into the Relational Object Extraction Positional Attention Layer to predict all corresponding relation–object pairs.

The first stage of the model is the subject extraction module. The input text first passes through the BERT Encoder to extract global contextual semantic features, obtaining initial contextual character-level representations. We then connect a BiLSTM layer to further capture local character-level dependencies in Chinese texts, compensating for BERT’s limitations in fine-grained sequence modeling, and finally generating a feature sequence

h_{n}

that integrates global and local semantics.

h_{n}

is input into the improved Subject Extraction Positional Attention Layer to enhance the feature discrimination of subject boundaries, and finally output the Subject Tagger Sequence. This layer retains all other components of CasRel’s attention decoder and only replaces the absolute position encoding with our proposed MHSA-RPE to strengthen positional semantics, generating subject attention features. Then, through the start–end pointer labeling mechanism, the set of boundary coordinates of all subjects is finally obtained. This stage adopts the start–end pointer labeling method (Figure 3a), which uses two independent binary classifiers to predict the probability of each character being the start and end position of a subject respectively, accurately locating the boundary range of subject entities.

The second stage of the model is the relation–object extraction module. The subject-corresponding features extracted in the first stage are input into the BGDSFM designed in this paper to generate more discriminative subject-specific semantic vectors

v_{s u b}^{k}

.

v_{s u b}^{k}

and the original global feature

h_{n}

are jointly input into the improved Relation Object Extraction Positional Attention Layer, which shares the same MHSA-RPE design and retains all other CasRel components, guiding the model to focus on the contextual regions related to the current subject and filtering out irrelevant noise. Finally, the Relational Object Taggers Sequence is output. Its labeling method is shown in Figure 3b, where start–end pointers are set for each predefined relation respectively, and the start and end positions of objects under that relation are predicted simultaneously, supporting the extraction scenario where one subject corresponds to multiple relations and objects.

In response to the two major limitations, this paper proposes two targeted improvements: first, introducing the MHSA-RPE to strengthen entity boundary position perception; second, designing the BGDSFM to optimize the subject feature aggregation method.

The subsequent subsections will sequentially introduce the basic implementation of the BERT encoder and BiLSTM layer, as well as the detailed design and training objective functions of the above two core improved modules.

2.3. BERT Encoder

This study adopts BERT-base as the backbone semantic encoder. Based on the bidirectional Transformer architecture, it can effectively capture the complex semantic dependencies of domain-specific proper nouns and long descriptive sentences in cultural relic texts, providing high-quality initial feature support for subsequent entity and relation extraction. After the input text is processed into a token sequence by the BERT tokenizer [16], it is converted into a continuous vector representation through the embedding layer and fed into the multi-layer Transformer encoder [17] for global contextual semantic modeling. The input and encoding process in this study is shown in Figure 4. The finally output hidden state vector of the last layer will serve as the input to the subsequent BiLSTM layer for further extraction of fine-grained local features of the sequence.

2.4. BiLSTM

Although BERT possesses powerful long-distance semantic modeling capabilities, it still has deficiencies in capturing fine-grained character-level dependencies in Chinese texts. Therefore, this paper introduces the Bidirectional Long Short-Term Memory network (BiLSTM) after the BERT encoder to further model the contextual dependencies of long descriptive sentences in cultural relic texts [18]. BiLSTM performs encoding from both the start and end directions of the sequence through two independent forward and backward LSTM networks [19], concatenates and outputs the bidirectional hidden states at the same time step, and finally generates a feature sequence

h_{n} ∊ R^{l \times 2 h}

that integrates global context and local character-level information, where

l

denotes the sequence length and

h

denotes the hidden layer dimension of the unidirectional LSTM. This feature sequence will be simultaneously input into the subsequent Subject Extraction Positional Attention Layer and Relational Object Extraction Positional Attention Layer, providing a unified feature foundation for entity boundary localization and relation semantic modeling.

2.5. Multi-Head Self-Attention Decoder with Relative Position Encoding

The attention mechanism of the original CasRel model adopts absolute position encoding, which cannot explicitly model the relative positional relationships between entities [20]. In overlapping triple scenarios, boundary confusion and misclassification are highly prone to occur when multiple entity boundaries are adjacent, which is one of the main problems in long descriptive sentences and multi-entity coexistence scenarios in cultural relic texts. Recently, Veisi et al. [21] proposes an additive relative position encoding method with context-aware bias, which verifies the effectiveness of relative position information for Transformer sequence modeling. To address the above issues, this paper completely retains the overall decoding framework and start–end pointer labeling logic of the CasRel model and only replaces the original absolute position encoding multi-head self-attention with the relative position encoding enhanced multi-head self-attention. Its core structure is shown in Figure 5.

For the input feature sequence H, the query matrix Q, key matrix K and value matrix V are first generated through linear transformation:

Q = H W_{q}, K = H W_{k}, V = H W_{v}

(1)

where

W_{q}

,

W_{k}

,

W_{v}

are learnable projection matrices.

In this work, we adopt a sine–cosine formulation to generate relative position vectors. This method requires no additional training parameters and can generalize to sequence lengths not present in the training set, making it more suitable for processing descriptive sentences with large length differences in cultural relic texts. For any position interval

m

, the calculation formulas for the relative position vector

R_{m}

and the reverse interval

R_{- m}

are:

R_{m}, R_{- m} = [\begin{matrix} \sin (c_{0} m) \\ \cos (c_{0} m) \\ ⋮ \\ \sin (c_{\frac{d_{k}}{2} - 1} m) \\ \cos (c_{\frac{d_{k}}{2} - 1} m) \end{matrix}], [\begin{matrix} - \sin (c_{0} m) \\ - \cos (c_{0} m) \\ ⋮ \\ - \sin (c_{\frac{d_{k}}{2} - 1} m) \\ - \cos (c_{\frac{d_{k}}{2} - 1} m) \end{matrix}]

(2)

where

c_{i} = 1 / (10000^{2 i / d_{k}})

is the preset frequency coefficient and

d_{k}

is the feature dimension of a single attention head.

Relative position encoding introduces relative distance information between tokens, such that attention weights depend not only on semantic similarity but also on the spatial positions of tokens. For the entity boundary recognition task, the relative position features of adjacent tokens can provide boundary discrimination signals. When two entity boundaries are adjacent, relative position encoding can explicitly distinguish the attention patterns of “intra-entity tokens” and “entity boundary tokens”, helping to reduce boundary confusion.

Based on the single-head attention score with relative position encoding, bidirectional relative position terms and learnable bias terms are added on the basis of the original dot product term of CasRel. The calculation method is:

A_{t, j} = Q_{t} K_{j}^{T} + Q_{t} R_{t - j}^{T} + K_{j} R_{j - t}^{T} + u K_{j}^{T} + v R_{t - j}^{T}

(3)

where

A_{t, j}

is the attention weight of the

t

-th token to the

j

-th token,

R_{t - j}

and

R_{j - t}

are the bidirectional relative position vectors respectively, and

u

and

v

are the learnable content bias and position bias respectively.

Compared with the attention mechanism of CasRel that can only model semantic similarity, the improved score can simultaneously capture the semantic association and positional distance between tokens, helping to distinguish the boundaries of overlapping entities.

After scaling the attention scores and normalizing them with Softmax, they are weighted with the value matrix

V

to obtain the single-head output. Then, the outputs of multiple heads are concatenated and linearly transformed to obtain the final attention layer output:

{h e a d}_{i} = S o f t m a x (\frac{A_{i}}{\sqrt{d_{k}}}) V

(4)

M H S A - R P E (H) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W_{o}

(5)

where

h

is the number of attention heads and

W_{o}

is the output fusion matrix.

In the subject extraction stage, this attention layer takes the global feature

h_{n}

output by BiLSTM as input and enhances the feature discrimination of adjacent entity boundaries through relative position encoding, which helps reduce the missed detection and false detection of overlapping subjects. In the relation–object extraction stage, the attention layer fuses the global feature

h_{n}

and the subject-specific semantic vector

v_{s u b}^{k}

, which can model the relative positional relationships between the subject and objects corresponding to different relations, helping to alleviate boundary confusion when one subject corresponds to multiple overlapping objects. This design is particularly suitable for the common overlapping triple scenario in cultural relic texts where “one cultural relic corresponds to multiple ages, excavation sites and materials”.

2.6. Boundary–Global Dual-Branch Subject Fusion Module

In the original CasRel model, subject features are obtained only by averaging all token features in the subject-corresponding interval of the global feature sequence

h_{n}

. This method has two core defects: First, simple averaging erases fine-grained semantic differences within subjects and easily loses the critical position information of entity boundaries. Second, it cannot distinguish the subject region from irrelevant context noise, which is prone to feature degradation in scenarios with long sentences and multiple overlapping entities common in cultural relic texts, leading to a significant drop in the accuracy of subsequent relation-object extraction.

To address the above issues, we propose the BGDSFM. It generates subject-specific feature vectors

v_{s u b}^{k}

that integrate both boundary discrimination and global semantic integrity by parallel encoding and fusing boundary local features and global contextual features. Its structure is shown in Figure 6.

This module adopts a dual-branch parallel encoding structure, which extracts subject features from the local boundary and global dimensions respectively, and finally realizes feature fusion through element-wise addition.

In the subject boundary feature branch, only the first and last token features

w_{i}

and

w_{j}

at the subject-corresponding positions are extracted. The two are concatenated and then encoded by a linear layer to obtain the boundary local feature. The calculation formula is:

s_{b o u n d} = L i n e a r (C o n c a t (w_{i}, w_{j}))

(6)

This branch captures the start and end boundary position information of the subject, improves the feature discrimination of entity boundaries, and to a certain extent compensates for the deficiency of boundary feature loss caused by average pooling in CasRel.

In the global context feature branch, average pooling is performed on all token features in the subject-corresponding interval of the global feature sequence

h_{n}

, and then encoded by a linear layer to obtain the global feature. The calculation formula is:

s_{g l o b a l} = L i n e a r (A v g P o o l (h_{n} [s_{s t a r t} : s_{e n d}]))

(7)

This branch aggregates the global context information of the subject region, retains the overall semantic representation inside the subject, and avoids information loss caused by using only the first and last features.

Finally, the dual-branch features are fused by element-wise addition. This method does not increase the feature dimension and can maintain the lightweight of the model, obtaining the final subject-specific feature vector:

v_{s u b}^{k} = s_{b o u n d} + s_{g l o b a l}

(8)

The dual-branch fusion mechanism alleviates the information loss problem of single average pooling through complementary feature extraction. The boundary branch provides strong discriminative signals for entity boundaries, while the global branch preserves the overall semantic information of entities. Their element-wise addition fusion endows the subject features with both boundary accuracy and semantic integrity. For long entity names, the boundary branch can accurately locate the start and end positions of entities, and the global branch can aggregate the semantic information within entities, thereby effectively alleviating the feature degradation problem.

The fused subject-specific feature vector

v_{s u b}^{k}

will be jointly input into the Relational Object Extraction Positional Attention Layer with the global feature

h_{n}

, guiding the model to focus on the context region related to the current subject.

2.7. Training Objective

This model inherits the log-likelihood objective function based on the probability chain rule from CasRel, realizing joint supervised optimization for the whole process of “subject-recognition–relation-classification–object-matching”. This objective function can simultaneously constrain the parameter updates of our proposed MHSA-RPE decoder and BGDSFM, ensuring that each module collaborates with one another during training.

From the perspective of joint probability decomposition logic: first, the joint probability

p ((s, r, o)| x_{i})

of subject

s

, relation

r

and object

o

is decomposed into the product of “subject recognition probability” and “relation–object joint probability given the subject”; it is further decomposed into three parts: “subject recognition probability”, “object probability when the relation exists” and “negative sample probability when the relation does not exist”, as shown in Formulas (9)–(11):

\prod_{i = 1}^{|D|} [\prod_{(s, r, o) ∊ T_{i}} p ((s, r, o)| x_{i})]

(9)

= \prod_{i = 1}^{|D|} [\prod_{s ∊ T_{i}} p (s| x_{i}) \prod_{(r, o) ∊ T_{i} |s} p ((r, o)| {s, x}_{i})]

(10)

= \prod_{i = 1}^{|D|} [\prod_{s ∊ T_{i}} p (s| x_{i}) \prod_{r ∊ T_{i} |s} p_{r} (o| {s, x}_{i}) \prod_{r ∊ {R \ T}_{i} |s} p_{r} (o_{\emptyset}| {s, x}_{i})]

(11)

where

|D|

represents the total number of texts in dataset

D

,

x_{i}

is the

i

-th text;

T_{i}

denotes the set of all ground-truth triples

(s, r, o)

in the

i

-th text;

T_{i} |s

denotes the set of relations in triples headed by subject

s

in the

i

-th text;

{R \ T}_{i} |s

denotes the set of remaining relations after removing the relations in

T_{i} |s

from all possible relation sets

R

; and

o_{\emptyset}

represents the object element that has no relation with subject

s

.

Based on the above probability decomposition, the optimizable objective function

J (θ)

is obtained through logarithmic transformation:

J (θ) = \sum_{i = 1}^{|D|} [\sum_{s ∊ T_{i}} l o g p (s| x_{i}) + \sum_{r ∊ T_{i} |s} l o g p_{r} (o| {s, x}_{i}) + \sum_{r ∊ {R \ T}_{i} |s} l o g p_{r} (o_{\emptyset}| {s, x}_{i})]

(12)

By maximizing this log-likelihood objective function, joint training of subject recognition, relation classification and object matching is realized. Meanwhile, the two core improved modules proposed by us can also be fully optimized, ultimately improving the overall extraction accuracy of the model in overlapping triple scenarios of cultural relic texts.

3. Experimental Setup

3.1. Datasets

All experiments in this work are conducted on two datasets: the public general dataset DuIE2.0 [22] and our self-built Palace Museum Cultural Relic Entity–Relation Dataset (PM-CRER), which are used to verify the model’s specialized extraction performance in the vertical cultural relic domain and cross-domain generalization ability in general scenarios respectively.

The self-built PM-CRER dataset is constructed based on public data from the official website of the Palace Museum, with a total of 7870 original cultural relic records crawled. In the data preprocessing stage, HTML tag stripping and redundant whitespace character cleaning are completed through regular expressions. Invalid texts with fewer than 10 characters are filtered out, traditional–simplified variant characters are normalized, full-width/half-width punctuation is standardized, and webpage garbled characters and special characters are corrected. Duplicate samples are removed based on text content similarity, and finally 7638 high-quality cultural relic description texts are obtained. The quantity distribution of various cultural relic entities is shown in Figure 7, covering a total of 23 core cultural relic types. Among them, paintings, ceramics, calligraphy and jade stone wares account for the highest proportion of data volume, which is highly consistent with the actual distribution characteristics of cultural relics in the collection of the Palace Museum.

The division of entity and relation types is refined with reference to the industry standards Specification for Archives of Cultural Relics Collections (WW/T 0020-2008) [23] issued by the National Cultural Heritage Administration and Census of State-owned Movable Cultural Relics—Classification Standard for Cultural Relics (Trial) [24]. Meanwhile, it follows the guiding principles of Artificial Intelligence—Technical Framework for Knowledge Graph (GB/T 42131-2022) [25] and is further refined combined with the semantic characteristics of cultural relic texts to ensure the completeness and domain adaptability of the division. Finally, 37 types of entities are defined, including 23 types of cultural relic entities: ceramics; paintings; calligraphy; inscriptions; bronzeware; enamel; lacquerware; sculpture; gold, silver and tin wares; jade and stone wares; seals; textiles and embroidery; stationery; furniture; clocks and instruments; glassware; bamboo, wood, ivory, horn and gourd; court and religious relics; jewelry; military and ceremonial wares; music and opera relics; daily utensils; and foreign cultural relics. A total of 14 types of attribute entities: cultural relic number, time, location, size, material, appearance, function, person, version, font, decoration, inscription, theme, and craftsmanship. Meanwhile, 14 core relations are defined: excavated at, numbered as, dating from, size of, type of, characterized by, used for, created by, calligraphy of, decorated with, engraved with, adopting, made from, and texture of, which fully cover the core dimensions of cultural relic knowledge.

The statistical results of the relation distribution in the PM-CRER dataset are shown in Table 1. It can be seen that the three relations of “Numbered as”, “Size of” and “Dating from” account for the highest proportions of samples, reaching 17.8%, 17.7% and 16.4% respectively, which is consistent with the semantic characteristics of cultural relic description texts. The relation distribution exhibits a certain degree of imbalance. The sample sizes of the “Type of” and “Made from” relations are the smallest, each accounting for 0.7%. This is because the type information of most cultural relics in the Palace Museum collection is already included in their names, while the “Made from” relation is only applicable to cultural relics of specific materials.

The PM-CRER dataset is mainly sourced from the official website of the Palace Museum, so there is a certain institutional bias in the distribution of cultural relic types. Imperial cultural relics and cultural relics of the Ming and Qing dynasties account for a relatively high proportion, while the sample sizes of folk cultural relics and early ancient cultural relics are relatively small. In addition, the dataset only contains Chinese cultural relic description texts and does not cover cultural relic materials in other languages. In subsequent work, we will expand the sources of the dataset to reduce this bias.

Data annotation is completed using the doccano open-source annotation platform [26], performed by three annotators who have received unified annotation specification training and have basic knowledge of cultural relics. The annotation process adopts a “double independent annotation + cross-validation” mechanism [27]. 10% (764 items) of all annotated samples are randomly selected for consistency test, and Cohen’s Kappa coefficient is used to measure annotation consistency: the Kappa value for entity boundary and type annotation is 0.89, and the Kappa value for relation-type annotation is 0.83, both reaching the “excellent” consistency level, indicating reliable annotation quality. For annotation results with discrepancies, the final adjudication is made by two domain experts with more than 5 years of research experience in Palace Museum cultural relics. The finally constructed cultural relic information corpus contains a total of 42,386 entity–relation–object triples, and Table 2 shows a typical annotation example of cultural relic text.

DuIE2.0 is a Chinese entity and relation extraction benchmark dataset released by the Language and Intelligence Technology Competition in 2020, and it is also the largest public Chinese extraction dataset in the industry at present. This dataset covers multiple general fields such as news, encyclopedias and films, containing more than 430,000 annotated triples and 210,000 Chinese sentences. It has rich relation types and a large number of overlapping triple samples, and is widely used to verify the general performance of entity and relation extraction models.

In this paper, the PM-CRER dataset is randomly divided into training set, validation set and test set in a ratio of 7:1:2, ensuring no overlapping samples among the three; the DuIE2.0 dataset adopts the standard division method officially released. The division statistics of the two datasets are shown in Table 3.

3.2. Evaluation Metrics

This study adopts Precision (P), Recall (R) and F1-score as the core evaluation metrics for the entity–relation triple-extraction task. To comprehensively evaluate model performance, we also adopt several commonly used binary classification metrics, including Geometric Mean (GMean), Sensitivity, Specificity, Jaccard Coefficient and Area Under the Receiver Operating Characteristic Curve (AUROC), to comprehensively evaluate the performance of the model. The calculation formulas of each metric are as follows:

P = \frac{T P}{T P + F P} \times 100 %

(13)

R = S e n s i t i v i t y = \frac{T P}{T P + F N} \times 100 %

(14)

F 1 = 2 \times \frac{P \times R}{P + R} \times 100 %

(15)

S p e c i f i c i t y = \frac{T N}{T N + F P} \times 100 %

(16)

G M e a n = \sqrt{S e n s i t i v i t y \times S p e c i f i c i t y} \times 100 %

(17)

J a c c a r d = \frac{T P}{T P + F P + F N} \times 100 %

(18)

A U R O C = \frac{1}{M \times N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} I (s_{i} > s_{j})

(19)

where TP denotes the number of correctly predicted relation triples, FP denotes the number of incorrectly predicted triples, FN denotes the number of undetected true triples, and TN denotes the number of correctly predicted negative samples. M is the number of positive samples, N is the number of negative samples,

s_{i}

is the prediction score of the i-th positive sample,

s_{j}

is the prediction score of the j-th negative sample, and

I (\cdot)

is the indicator function, which takes the value of 1 when the condition is satisfied and 0 otherwise.

3.3. Setup

The development, training and inference processes of all models in this study are completed in a unified software and hardware environment. At the software level, Python 3.11.8 is used as the development language, the core model logic is implemented based on the PyTorch 2.6.0 deep learning framework, and CUDA 11.8 is used for GPU parallel acceleration computing. At the hardware level, all experiments are run on a Linux server cluster equipped with multi-core CPUs and high-performance GPUs. The detailed configuration of the experimental environment is shown in Table 4.

The subject-guided two-stage joint entity and relation extraction model proposed in this paper takes the BERT-base pre-trained model open-sourced by the Hugging Face platform as the basic semantic encoder and uses its bidirectional context modeling capability to extract the initial character-level semantic features of cultural relic texts. The model training adopts the AdamW optimizer to realize adaptive iterative update of parameters and introduces an Early Stopping strategy. Training is terminated early when the validation set F1-score does not improve for 10 consecutive rounds, effectively avoiding model overfitting. The core hyperparameter configuration involved in the experiment is shown in Table 5.

All baseline models are implemented based on the official code publicly available in their original papers, and the optimal hyperparameter configurations recommended in the original papers are adopted. For models without official code available, we strictly reproduce them according to the descriptions in the original papers, and conduct training and testing in the same experimental environment. To ensure the fairness and comparability of the results, all models use the same dataset splits and preprocessing pipelines.

3.4. Baseline Models

To verify the performance of the subject-guided two-stage joint entity and relation extraction model proposed in this paper in the entity and relation extraction task, we selected 11 representative baseline methods covering different technical paradigms to construct a comprehensive comparative experimental system.

(1): Bert-LSTM [28]: A classic pipeline baseline, which adopts the “BERT semantic encoding + BiLSTM sequence feature extraction” architecture and serves as a general performance benchmark for all deep learning extraction models.
(2): CasRel [8]: The direct prototype of the model in this paper, which first proposed the cascaded binary labeling framework, effectively solved the overlapping triple problem, and has become an important milestone in the field.
(3): TPLinker [10]: A representative single-stage linking model, which transforms the extraction task into a token-pair linking problem and realizes true end-to-end joint extraction.
(4): PRGC [9]: A core improvement of CasRel, which proposes a three-stage architecture of “latent-relation-prediction–subject-extraction–object-extraction” and significantly reduces computational complexity.
(5): PREBI [13]: An enhanced model specifically for entity boundary recognition, which improves entity extraction accuracy by explicitly introducing boundary information.
(6): DERP [14]: A dual-head entity and relation prediction framework, which divides the entity recognition process into two stages: head entity recognition and tail entity recognition, and introduces a triple prediction module to improve the accuracy and completeness of extraction.
(7): SPN4RE [29]: A single-stage model based on semantic-aware pointer network, which optimizes the extraction effect in complex scenarios by introducing semantic prior knowledge.
(8): CECRel [12]: The latest cascaded state-of-the-art (SOTA) model, which improves extraction performance in general domains through context enhancement and entity-relation interaction mechanisms.
(9): LLaMA-3-8B-Instruct [30]: An instruction-tuned large language model that performs entity and relation extraction in a zero-shot setting.
(10): Qwen-2-7B-Instruct [31]: An open-source large language model with strong Chinese language capabilities, which performs entity and relation extraction in a zero-shot setting.
(11): GLM-4-9B-Chat [32]: A general-purpose large language model developed by Tsinghua University, performing entity and relation extraction in a zero-shot setting.

The above 11 comparative models comprehensively cover the core technical routes in the field of entity and relation extraction, including pipeline architecture, cascaded joint extraction, single-stage linking-based extraction, boundary-/semantic-enhanced methods, and the latest large language model (LLM)-driven extraction methods. By comparing against these models, the performance of the model proposed in this paper can be comprehensively and objectively evaluated from three dimensions: the effectiveness of the overall architecture, the pertinence of core improvements, and the adaptability to specific scenarios in the cultural relic domain such as small samples and blurred boundaries.

4. Experimental Results and Analysis

4.1. Model Comparative Experiment

This paper conducts comparative experiments on the general dataset DuIE2.0 and the self-built cultural relic domain dataset PM-CRER. All experiments on the PM-CRER dataset are independently repeated 10 times, and the mean results ± standard deviation are reported. The entity and relation extraction performance of different models is shown in Table 6 and Table 7. All baseline models adopt the optimal hyperparameter configurations recommended in their original papers, and the experimental environment is completely consistent with that of the proposed model to ensure the fairness and comparability of the results.

The experimental results show that the subject-guided two-stage joint entity and relation extraction model proposed in this paper achieves the best performance on all evaluation metrics across both datasets, verifying its cross-domain generalization capability and specialized adaptability to the cultural relic domain. On the general dataset DuIE2.0, the F1-score of the proposed model reaches 79.4, which is 2.6 percentage points higher than the current state-of-the-art (SOTA) cascaded model CECRel and 5.6 percentage points higher than the latest LLM baseline GLM-4-9B-Chat. On the domain-specific PM-CRER dataset, the advantage of the proposed model is more pronounced, achieving an F1-score of 75.9 ± 0.32%, which is 3.6 percentage points higher than its direct prototype CasRel, 1.2 percentage points higher than the best end-to-end model SPN4RE, and 10.0 percentage points higher than GLM-4-9B-Chat.

The experimental results indicate that Bert-LSTM performs the worst on both datasets. Its step-by-step processing approach suffers from inherent error propagation problems and cannot effectively handle overlapping triples, which verifies the overall superiority of the joint extraction architecture. On the DuIE2.0 dataset with sufficient data volume, the performance gaps between various models are relatively small. However, on the PM-CRER dataset with few samples and complex entity boundaries, the model performance shows obvious differentiation. Through the relative position encoding and dual-branch subject fusion mechanism, the proposed model exhibits good adaptability to the typical scenarios of long entities and multiple overlapping triples in the cultural relic domain.

The performance of large language models under the zero-shot setting is lower than that of specially designed joint extraction models. This is because LLMs lack specialized training for the entity and relation extraction task and are prone to hallucinations in few-shot domain scenarios. Nevertheless, the F1-score of GLM-4-9B-Chat on PM-CRER reaches 65.9%, indicating that large language models have certain potential in this task. In the future, their performance can be further improved through instruction tuning or Retrieval-Augmented Generation (RAG) technology.

A paired-samples t-test was conducted to analyze the statistical significance of the F1-scores from 10 independent experiments between the proposed model and the top five mainstream baseline models (SPN4RE, DERP, PREBI, CECRel, CasRel) on the PM-CRER dataset. The average F1-score of the proposed model on this dataset is 75.9 ± 0.32%, while that of the second-best model SPN4RE is 74.7 ± 0.41%. The paired t-test result shows t(9) = 8.92, p < 0.001. For the DERP model with an average F1-score of 74.4 ± 0.39%, the test yields t(9) = 10.17, p < 0.001. For the PREBI model with an average F1-score of 73.3 ± 0.43%, the test yields t(9) = 12.54, p < 0.001. For the CECRel model with an average F1-score of 73.8 ± 0.40%, the test yields t(9) = 11.89, p < 0.001. For the direct prototype CasRel of the proposed model with an average F1-score of 72.3 ± 0.47%, the test yields t(9) = 15.23, p < 0.001. All the above test results demonstrate that the performance improvement of the proposed model compared with other mainstream baseline models is highly statistically significant, ruling out the contingency of the experimental results.

4.2. Ablation Experiment

To quantitatively verify the contribution of each key module proposed in this paper to the performance of entity and relation extraction, ablation experiments are designed using the control variable method [33]. By sequentially removing the BiLSTM Layer, MHSA-RPE and BGDSFM from the model, the changes in the F1-score of different model variants on the DuIE2.0 and PM-CRER datasets are compared. The experimental results are shown in Table 8. All experiments are independently repeated 10 times, and the mean results ± standard deviation are reported.

The experimental results show that removing any module will lead to a decline in model performance, indicating that all modules contribute to the model’s performance. Among them, MHSA-RPE and BGDSFM have a more significant impact on model performance, resulting in a decrease of 1.6 and 1.5 percentage points in F1-score respectively on the PM-CRER dataset; their performance decline amplitudes are both greater than those on the DuIE2.0 dataset.

This result is consistent with expectations: MHSA-RPE alleviates the boundary confusion caused by absolute position encoding by explicitly modeling the relative positional relationships between entities; BGDSFM improves the feature degradation caused by simple pooling by fusing boundary and global features. These two modules are precisely designed for the core pain points of cultural relic domain texts, such as blurred entity boundaries, numerous long-name entities and common overlapping triples, so their gains are more obvious on the PM-CRER dataset. The contribution of the BiLSTM layer is relatively small, mainly used to assist in capturing the long-distance sequence dependencies of texts, which is because BERT itself already has strong long-distance semantic modeling capabilities. The ablation experimental results verify the effectiveness and domain adaptability of the two core improvements proposed in this paper.

4.3. Overlapping Triples Experiment

To deeply verify the extraction capability of the proposed model in different types of overlapping triple scenarios, we compare the F1-score performance of our model and its direct prototype CasRel in four typical scenarios. The experimental results are shown in Table 9 and Figure 8. Among them, Normal represents the non-overlapping triple scenario, SEO represents the single-entity overlapping scenario, EPO represents the entity pair overlapping scenario, and SOO represents the subject–object overlapping scenario [34].

The experimental results show that our model outperforms CasRel in all four scenarios, and the magnitude of performance improvement increases with the rise in scenario complexity. In the Normal non-overlapping scenario, our model achieves improvements of 4.3 and 2.8 percentage points over CasRel on the DuIE2.0 and PM-CRER datasets respectively, verifying the overall superiority of the model’s basic architecture. In the SEO single-entity overlapping scenario, the improvement margin is further expanded, reaching 1.9 and 3.3 percentage points on the two datasets respectively. In the EPO entity pair overlapping scenario, the performance improvement is more significant, reaching 5.6 and 4.8 percentage points respectively. In the most complex SOO subject–object overlapping scenario, the performance improvement reaches its peak, with increases of 6.9 and 7.9 percentage points on the DuIE2.0 and PM-CRER datasets respectively.

4.4. Error Analysis

To deeply analyze the error types and underlying causes of the model, we randomly selected 1000 incorrectly extracted triples from the PM-CRER test set. These errors are mainly categorized into boundary confusion errors, relation misclassification errors, object omission errors, subject omission errors and other types. Boundary confusion errors refer to incorrect entity boundary recognition, such as misidentifying “silver gilt filigree bracelet with double dragons playing with a pearl pattern” as “silver gilt filigree bracelet”. Relation misclassification errors refer to correct entity boundary recognition but incorrect prediction of relation types, such as mispredicting “decorated with” as “engraved with”. Object omission errors refer to correct recognition of subjects and relations but failure to detect the corresponding objects. Subject omission errors refer to failure to detect subject entities in the text. Other errors include entity-type errors, multi-label prediction errors, etc. The statistical distribution of various errors is shown in Table 10.

It can be seen that boundary confusion errors are the most dominant error type, accounting for 38.5%. This is mainly attributed to the prevalence of long entity names and nested entities in cultural relic texts, such as “Qing Qianlong period famille rose hollow revolving vase”, which contains multiple internal semantic units and is prone to boundary recognition errors. Relation misclassification errors account for 29.7% of total errors and mainly occur between semantically similar relations, such as “decorated with” and “engraved with”, “adopting” and “made from”. Object omission errors account for 19.7%, primarily because some objects are described implicitly in the text or are located far from their corresponding subjects.

We compared the performance of the CasRel model and the proposed model on typical error cases. For example, for the text “Qing hua chan zhi lian wen mei ping, Yongle period of Ming Dynasty, height 32.5 cm, mouth diameter 5.2 cm, foot diameter 11.5 cm. The body of the vase is painted with interlocking lotus pattern, and the shoulder is painted with ruyi cloud head pattern”, the CasRel model incorrectly identified “Qing hua chan zhi lian wen mei ping” as “Qing hua chan zhi lian wen” and omitted the triple “(Qing hua chan zhi lian wen mei ping, decorated with, ruyi cloud head pattern)”. In contrast, the proposed model accurately recognized the entity boundaries through relative position encoding and generated more discriminative subject features via the BGDSFM, successfully extracting all correct triples.

To intuitively verify the enhancement effect of relative position encoding on entity boundary recognition capability, we visualized the average weight distribution of multi-head self-attention in the subject extraction stage for the above typical case, and the results are shown in Figure 9. It can be clearly observed that the attention weights of the CasRel model are relatively uniformly distributed in the “Qing hua chan zhi lian wen” region, but a significant weight cliff appears between the real entity boundaries “wen” and “mei”, with the attention weights at the boundaries being only 0.02 and 0.01 respectively. This prevents the model from perceiving the semantic continuity between the subsequent “mei ping” part and the preceding text, ultimately truncating the entity into “Qing hua chan zhi lian wen”. In contrast, by introducing relative position encoding, the proposed model increases the self-attention weights of the entity’s start token “Qing” and end token “ping” to 0.42 and 0.39 respectively, forming a strong boundary anchoring signal. Meanwhile, the attention weights of adjacent tokens within the entity exhibit a smooth gradient change, which can completely cover the entire semantic range of the long entity “Qing hua chan zhi lian wen mei ping”.

4.5. Cross-Domain Validation Experiments

To evaluate the cross-domain generalization ability of the proposed model, we trained the model on the DuIE2.0 dataset and then tested it on the PM-CRER test set under the zero-shot cross-domain setting. We also conducted tests after fine-tuning the model with the PM-CRER training set in the few-shot cross-domain scenario. The experimental results are presented in Table 11.

The results demonstrate that the model possesses favorable cross-domain generalization capability. In the zero-shot setting, it achieves an F1-score of 62.3% on PM-CRER, outperforming LLaMA-3-8B-Instruct under the same zero-shot condition. After fine-tuning with 10% of PM-CRER samples, the F1-score rises to 71.5%. When fine-tuned with 50% of the samples, the performance is close to that of the model trained on the full PM-CRER dataset. This indicates that the proposed model can effectively transfer semantic knowledge learned from the general domain and rapidly adapt to the cultural relic extraction task with only a small number of domain-specific samples.

4.6. Complexity Analysis Experiment

To evaluate the engineering practicality and deployment feasibility of the proposed model, complexity analysis experiments are conducted only on the general standard dataset DuIE2.0. This is because the computational complexity of a model is determined solely by its network architecture and is independent of the training data size and domain; DuIE2.0, as an industry-wide benchmark, provides test results with broad comparability [35], while the self-built PM-CRER is a small-sample dataset, and differences in sample size would lead to large fluctuations in time test results, making them lack general reference value. The complexity comparison results between the proposed model and the baseline model CasRel are shown in Figure 10.

The experimental results show that the overall complexity increase in the proposed model is within an acceptable range. The number of parameters only increases by 6.3%, benefiting from the lightweight design of both MHSA-RPE and BGDSFM, which do not introduce a large number of trainable parameters. The training time and single-sample inference time increase by 27.4% and 24.6% respectively, mainly due to the additional computation of relative position encoding and the dual-branch feature fusion process. The runtime memory overhead increases minimally, with CPU memory usage increasing by only 2.7% and GPU memory usage increasing by 6.2%, making the hardware resource requirements basically the same as those of the baseline model.

To further evaluate the model’s scalability and online inference capability in real-world scenarios, we conduct supplementary performance comparisons under various test conditions. The results are presented in Table 12.

The experimental results show that under the standard test condition with a batch size of 64 and a sequence length of 128, the peak throughput of the proposed model reaches 37.9 samples per second, a drop of 19.7% compared with CasRel. Even so, it can still meet the requirements of most offline batch-processing scenarios. For single-sample inference, the average end-to-end latency is 12.3 ms and the 99th percentile latency is 18.9 ms, which satisfies the demands of latency-sensitive online services. When the sequence length is extended to 256, the variation rates of throughput and latency remain roughly the same as those under standard conditions, demonstrating the model’s favorable scalability for long sequences. In terms of concurrent request processing, the model achieves a QPS of 178 under the latency limit of 50 ms, representing a 17.2% decrease against CasRel, while still supporting medium-scale online services.

Overall, the proposed model achieves a comprehensive improvement in entity and relation extraction performance at the cost of minimal complexity increase, possesses good engineering deployability, and can be directly deployed in existing hardware environments.

5. Chinese Cultural Relic Knowledge Graph System

Based on the subject-guided two-stage joint entity and relation extraction model proposed earlier, this paper constructs a complete Chinese Cultural Relic Knowledge Graph System. Although this study takes text-driven knowledge graph construction as the core, in the data collection stage, high-definition image URLs corresponding to cultural relic entities are synchronously obtained through web crawlers and uniformly stored in the Neo4j graph database as the image attribute of entities. This design endows the system with multimodal display capability without changing the core semantic structure of the knowledge graph. When users view the details of cultural relic entities, browse entity lists and visualized graphs, they can intuitively see the real appearance of cultural relics, which greatly improves the experience and efficiency of knowledge acquisition. This chapter will elaborate on the overall architecture of the system, the implementation of core functional modules and interface demonstration.

5.1. Overall System Architecture

This system adopts the Browser/Server (B/S) architecture and is based on the design idea of front-end and back-end separation, which is divided into three layers: data layer, service layer and front-end presentation layer. The data layer is built upon the Neo4j native graph database, which stores 37 entity types (23 cultural relic entity types and 14 attribute entity types), 14 types of semantic relations and corresponding image attribute data, and supports efficient graph query and traversal operations [36]. The service layer provides data access interfaces through the Neo4j HTTP API; implements core business logic such as data query, statistical analysis and graph operation; and integrates the DeepSeek large model API to provide the system with natural language question answering and automatic knowledge triple-extraction capabilities [37]. The front-end presentation layer is developed based on native HTML5, CSS3 and JavaScript. It introduces Vis-Network to realize interactive visualization of the knowledge graph, uses Chart.js to draw data statistical charts, and provides a unified icon system through Font Awesome, finally building a responsive and cross-platform user interface [38]. The overall architecture of the system is shown in Figure 11.

5.2. Implementation of Core Functional Modules

Drawing on the methodological framework of multi-dimensional intelligent reorganization and utilization of knowledge [39], the system contains six core functional modules, covering the full-process requirements of cultural relic knowledge from overview to in-depth exploration. The data dashboard module provides real-time statistics and visual presentation of global knowledge graph data, including core indicators such as the total number of nodes, total number of relations, entity-type distribution and relation-type distribution, and provides a preview function for selected cultural relic entities. The entity browser module supports filtering by entity type and fuzzy keyword search, displays cultural relic entity information in a card-based layout. Clicking on an entity allows viewing complete details including images, attributes and association relations, and supports one-click positioning of the entity in the graph and automatic expansion of its first-level association relations. The entity browser and detail viewing interface are shown in Figure 12.

The relation explorer module provides an interactive visualization function for the cultural relic semantic relation network. Users can select specific relation types to load subgraphs, and can optimize the graph layout by adjusting physical parameters such as gravity strength, spring length and damping coefficient, supporting graph export and full-screen viewing.

The custom query module provides a Cypher query workbench for professional users, supporting free writing of query statements, and provides four result views: graph, table, text and raw JSON. Meanwhile, common query templates are preset to lower the usage threshold. The custom Cypher query workbench interface is shown in Figure 13.

The AI Chat Module integrates large model technology, supports natural language questioning, and can automatically call local graph data to answer questions. When local data is insufficient, the large model provides authoritative supplements and automatically extracts knowledge triples from the answers to assist users in understanding cultural relic knowledge. The intelligent question-answering assistant and knowledge triple-extraction interface are shown in Figure 14. The system also provides comprehensive configuration management functions, supporting users to customize database connection parameters, switch between light and dark themes and manage query history.

5.3. Chapter Summary

This chapter elaborates on the implementation and demonstration of the Chinese Cultural Relic Multimodal Knowledge Graph System. Based on a three-layer architecture design, the system integrates graph database, visualization technology and large model technology, and realizes core functions such as data dashboard, entity browser, relation explorer, custom query and AI chat. Through the design of taking images as entity attributes, the system maintains the advantages of text knowledge graphs while possessing multimodal display capabilities, providing platform support for the visualized dissemination and intelligent query of cultural relic knowledge.

6. Conclusions

Aiming at the core pain point of fragmented and unstructured knowledge that is difficult to utilize in the cultural relic domain, this paper proposes a subject-guided two-stage joint entity and relation extraction model tailored to the characteristics of cultural relic texts. Built upon the cascaded labeling framework of CasRel, we introduce incremental improvements and develop the Multi-Head Self-Attention Decoder Enhanced by Relative Position Encoding (MHSA-RPE) and the Boundary–Global Dual-Branch Subject Fusion Module (BGDSFM). From the perspectives of position encoding and feature aggregation, the two modules specifically optimize the extraction performance for long entities and overlapping triples in cultural relic texts.

Comprehensive comparative experiments are conducted on the general dataset DuIE2.0 and the self-built cultural relic dataset PM-CRER. The results reveal that the proposed model achieves state-of-the-art performance on both datasets, with F1-scores of 79.4% and 75.9% respectively. Ablation experiments verify the necessity of all core modules. Specifically, removing MHSA-RPE and BGDSFM leads to F1-score drops of 1.6 and 1.5 percentage points respectively on PM-CRER. Experiments on overlapping triples show that the performance gain rises along with the increase in scenario complexity, reaching the maximum in the most complex subject–object overlapping scenario. Error analysis identifies boundary confusion as the predominant error type, accounting for 38.5% of total errors. Cross-domain validation proves the model has satisfactory generalization ability, and it can achieve promising adaptation with merely 10% of domain samples. Complexity analysis demonstrates that the computational overhead is acceptable: the total parameter count only increases by 6.3%, which enables practical engineering deployment.

Based on the proposed extraction model, this paper constructs a complete Chinese Cultural Relic Knowledge Graph System. Adopting a three-layer Browser/Server architecture, the system integrates graph database technology, visualization technology and large language model technology, and realizes six core functions including intelligent question answering and multimodal display. It provides technical references for the digital management and application of cultural relic knowledge.

This work also has several limitations. First, our innovations mainly focus on targeted improvements and domain adaptation of existing components, rather than proposing a completely novel extraction paradigm. Second, the PM-CRER dataset is primarily collected from the Palace Museum, leading to inherent institutional bias. Third, only large language models under the zero-shot setting are adopted as baselines, while more effective LLM adaptation approaches such as instruction tuning are not explored in the experiments.

Future work will continue to advance in three directions: dataset construction, model optimization and multimodal fusion. In terms of dataset construction, the coverage of the PM-CRER dataset will be extended to other national museums such as the National Museum of China and the Shanghai Museum to further enrich cultural relic types and relation types and improve the representativeness and completeness of the dataset. In terms of model optimization, we will adopt knowledge distillation techniques to lightweight the proposed model, significantly reducing inference latency while ensuring extraction accuracy to meet the deployment requirements of mobile terminals and edge devices. In terms of multimodal fusion, vision–language pre-trained models such as CLIP will be introduced to realize the joint extraction of cultural relic images and texts; extract visual features such as decorations, shapes and colors; and construct a more comprehensive multimodal cultural relic knowledge graph.

Author Contributions

Conceptualization, Y.S. and Y.B.; methodology, Y.S.; software, Y.S.; validation, Y.S., X.Y., L.Z. and Q.Z.; formal analysis, Y.S.; investigation, Y.S., X.Y. and L.Z.; resources, Y.B.; data curation, X.Y. and Q.Z.; writing—original draft preparation, Y.S.; writing—review and editing, Y.B. and X.Y.; visualization, Y.S.; supervision, Y.B.; project administration, Y.B.; funding acquisition, Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Inner Mongolia Autonomous Region Science and Technology Plan Project (grant numbers 2025KJHZ0005 and 2022YFHH0102), the Basic Business Funding Project for Universities Directly under Inner Mongolia Autonomous Region (grant number BR220145), the Natural Science Foundation of Inner Mongolia of China (grant numbers 2025MS06007 and 2025MS06013), the Grassland Animal Husbandry Disciplinary Cluster at Inner Mongolia Agricultural University, and the Interdisciplinary Research Fund of Inner Mongolia Agricultural University (grant number BR231506). The APC was funded by the corresponding author.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The DuIE2.0 dataset used for baseline comparison in this study is publicly available at https://aistudio.baidu.com/datasetdetail/180082 (accessed on 2 May 2026). The self-built PM-CRER dataset generated during this study, which contains 7638 annotated cultural relic entity–relation triples, is publicly available at https://tianchi.aliyun.com/dataset/226201 (accessed on 2 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ferro, S.; Giovanelli, R.; Leeson, M.; Bernardin, M.D.; Traviglia, A. A novel NLP-driven approach for enriching artefact descriptions, provenance, and entities in cultural heritage. Neural Comput. Appl. 2025, 37, 21275–21296. [Google Scholar] [CrossRef]
Barzaghi, S.; Moretti, A.; Heibi, I.; Peroni, S. CHAD-KG: A Knowledge Graph for Representing Cultural Heritage Objects and Digitisation Paradata. Int. J. Semant. Web Inf. Syst. 2026, 22, 46. [Google Scholar] [CrossRef]
Ringwald, C.; Gandon, F.; Faron, C.; Michel, F.; Akl, H.A. A systematic review of relation extraction task since the emergence of Transformers. arXiv 2025, arXiv:2511.03610. [Google Scholar] [CrossRef]
Hu, F.; Pei, W.; Wu, Y.; Hu, Q.; Wang, B.; Sun, S. Star-transformer based semantic enhanced union relation extraction. J. Supercomput. 2025, 81, 1144. [Google Scholar] [CrossRef]
Diaz-Garcia, J.A.; Lopez, J.A.D. A survey on cutting-edge relation extraction techniques based on language models. Artif. Intell. Rev. 2025, 58, 287. [Google Scholar] [CrossRef]
Rao, D.; Wu, Q.; Huang, G. A Pipeline Model for Chinese Relation Extraction Based on Entity Cascaded Types. Appl. Res. Comput. 2024, 41, 2685–2689. [Google Scholar] [CrossRef]
Zhao, X.; Deng, Y.; Yang, M.; Wang, L.; Zhang, R.; Cheng, H.; Lam, W.; Shen, Y.; Xu, R. A Comprehensive Survey on Relation Extraction: Recent Advances and New Frontiers. ACM Comput. Surv. 2024, 56, 1–39. [Google Scholar] [CrossRef]
Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1476–1488. [Google Scholar] [CrossRef]
Zheng, H.; Wen, R.; Chen, X.; Yang, Y.; Zhang, Y.; Zhang, Z.; Zhang, N.; Qin, B.; Ming, X.; Zheng, Y. PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Online; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 6225–6235. [Google Scholar] [CrossRef]
Wang, Y.; Yu, B.; Zhang, Y.; Liu, T.; Zhu, H.; Sun, L. TPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020); International Committee on Computational Linguistics: Barcelona, Spain, 2020; pp. 1572–1582. [Google Scholar] [CrossRef]
Shang, Y.; Huang, H.; Mao, X. OneRel: Joint Entity and Relation Extraction with One Module in One Step. In Proceedings of the 36th AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2022; pp. 11285–11293. [Google Scholar] [CrossRef]
Tong, Y.; Tong, J.; Xia, S.; Zhou, Q.; Shen, Y. CECRel: A joint entity and relation extraction model for Chinese electronic medical records of coronary angiography via contrastive learning. J. Biomed. Inform. 2025, 164, 104792. [Google Scholar] [CrossRef] [PubMed]
Liu, F.; Xu, M.; Chen, L.; Chen, J.; Huang, M.; Chen, J. Joint Relation Extraction Model Based on Potential Relation and Entity Boundary Information. In Proceedings of the 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI 2023); IEEE: Guangzhou, China, 2023; pp. 215–218. [Google Scholar] [CrossRef]
Xiao, Y.; Chen, G.; Du, C.; Li, L.; Yuan, Y.; Zou, J.; Liu, J. A Study on Double-Headed Entities and Relations Prediction Framework for Joint Triple Extraction. Mathematics 2022, 11, 4583. [Google Scholar] [CrossRef]
Cao, J.; Li, B.; Liu, J.; Ji, D. NCRE: A Benchmark for Document-level Nominal Compound Relation Extraction. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025); Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2025; pp. 10531–10540. Available online: https://aclanthology.org/2025.coling-main.701/ (accessed on 2 May 2026).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 2 May 2026).
Xu, C.; Shen, K.; Sun, H. Supplementary Features of BiLSTM for Enhanced Sequence Labeling. arXiv 2023, arXiv:2305.19928. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018); Association for Computational Linguistics: New Orleans, LA, USA, 2018. [Google Scholar] [CrossRef]
Veisi, A.; Amirzadeh, H.; Mansourian, A.M. Context-aware Biases for Length Extrapolation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), Suzhou, China, November 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 30374–30395. [Google Scholar] [CrossRef]
Li, S.; He, W.; Shi, Y.; Jiang, W.; Liang, H.; Jiang, Y.; Zhang, Y.; Lyu, Y.; Zhu, Y. DuIE: A Large-scale Chinese Dataset for Information Extraction. In Natural Language Processing and Chinese Computing, Proceedings of the 8th CCF International Conference (NLPCC 2019), Dunhuang, China, October 2019; Springer: Cham, Switzerland, 2019; pp. 791–800. [Google Scholar] [CrossRef]
WW/T 0020-2008; Specification for Archives of Cultural Relics Collections. National Cultural Heritage Administration of the People’s Republic of China: Beijing, China, 2009. Available online: http://www.ncha.gov.cn/module/download/downfile.jsp?classid=0&filename=2108050950236836298.pdf (accessed on 9 May 2026).
National Cultural Heritage Administration of the People’s Republic of China. Census of State-Owned Movable Cultural Relics—Classification Standard for Cultural Relics (Trial). Available online: http://wwj.shaanxi.gov.cn/ztzl/ndzt/2016n/dycqgkydwwpc/zcgf/201203/t20120312_2141945.html (accessed on 9 May 2026).
GB/T 42131-2022; Artificial Intelligence—Technical Framework of Knowledge Graph. State Administration for Market Regulation: Beijing, China; Standardization Administration of China: Beijing, China, 2022. Available online: https://openstd.samr.gov.cn/bzgk/std/newGbInfo?hcno=B6D2A5EB6F6A5206FC03B9D44E069D07 (accessed on 9 May 2026).
Nakayama, H.; Kubo, T.; Kamura, J.; Taniguchi, Y.; Liang, X. doccano: Text Annotation Tool for Human. Available online: https://github.com/doccano/doccano (accessed on 9 May 2026).
Richard, A.; Alonzo Canul, L.C.; Portet, F. FRACAS: A FRench Annotated Corpus of Attribution relations in newS. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, May 2024; ELRA: Paris, France; ICCL: New York, NY, USA, 2024; pp. 7417–7428. Available online: https://aclanthology.org/2024.lrec-main.654/ (accessed on 2 May 2026).
Kalusivalingam, A.K.; Sharma, A.; Patel, N.; Singh, V. Leveraging BERT and LSTM for Enhanced Natural Language Processing in Clinical Data Analysis. Int. J. AI M 2021, 2, 1–24. Available online: https://www.cognitivecomputingjournal.com/index.php/IJAIML-V1/article/view/82 (accessed on 2 May 2026).
Sui, D.; Zeng, X.; Chen, Y.; Liu, K.; Zhao, J.; Zeng, X.; Liu, S. Joint Entity and Relation Extraction With Set Prediction Networks. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 12784–12795. [Google Scholar] [CrossRef] [PubMed]
AI@Meta. Llama 3 Model Card (Meta-Llama-3-8B-Instruct). Available online: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (accessed on 26 May 2026).
Qwen Team. Qwen2-7B-Instruct Model Card. Available online: https://modelscope.cn/models/Qwen/Qwen2-7B-Instruct (accessed on 26 May 2026).
ZhipuAI. GLM-4-9B-Chat Model Card. Available online: https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat (accessed on 26 May 2026).
Ouyang, J.; Zhang, J.; Liu, T. Attention Weight is Indispensable in Joint Entity and Relation Extraction. Intell. Autom. Soft Comput. 2022, 34, 1707–1723. [Google Scholar] [CrossRef]
Zeng, X.; Zeng, D.; He, S.; Liu, K.; Zhao, J. Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 506–514. [Google Scholar] [CrossRef]
Zhao, S.; Duan, J.; He, L.; Wang, H.; Zhang, Q.; Liu, J. Continuous Adaptive Knowledge Distillation for Few-Shot Relation Extraction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 1–13. [Google Scholar] [CrossRef]
Wang, Y.; Liu, J.; Wang, W.; Chen, J.; Yang, X.; Sang, L.; Wen, Z.; Peng, Q. Construction of Cultural Heritage Knowledge Graph Based on Graph Attention Neural Network. Appl. Sci. 2024, 14, 8231. [Google Scholar] [CrossRef]
Liu, Z.; Sack, H.; Gesese, G. HyP-KGRAG: Hypothetical Path-Based Knowledge Graph Retrieval Augmented Generation with DeepSeek. In Proceedings of the RAGE-KG Workshop, Co-Located with the 24th International Semantic Web Conference (ISWC 2025), Nara, Japan, November 2025; CEUR-WS: Aachen, Germany, 2025; p. 4. Available online: https://ceur-ws.org/Vol-4079/paper4.pdf (accessed on 2 May 2026).
Chen, H.; Hou, M.; Sun, Y.; Gao, C.; Gao, M. A Knowledge Representation Method for Virtual Restoration of Ancient Chinese Stone Arch Bridges. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, XLVIII-M-9, 217–223. [Google Scholar] [CrossRef]
Liu, J.; Liu, Z.; Shen, Y.; Zhang, R.; Song, N.; Liu, J.; Pei, L. Multi-dimensional intelligent reorganization and utilization of knowledge in ‘Biographies of Chinese Thinkers’. npj Herit. Sci. 2025, 13, 135. [Google Scholar] [CrossRef]

Figure 1. Overall System Framework for Cultural Relics Knowledge Graph Construction.

Figure 2. Architecture of subject-guided two-stage joint entity and relation extraction model.

Figure 3. Start–end pointer tagging mechanism for subjects and multi-relational objects. (a) Subject Tagger Sequence; (b) Relational Object Taggers Sequence.

Figure 4. Schematic diagram of BERT input embeddings and overall network structure.

Figure 5. Architecture of Multi-Head Self-Attention Decoder with Relative Position Encoding.

Figure 6. Architecture diagram of dual-branch subject feature fusion module.

Figure 7. Quantity distribution of various cultural relic entities in the self-built dataset.

Figure 8. Performance comparison of different models under different overlapping triple scenarios.

Figure 9. Attention Distribution Comparison on the Long Entity.

Figure 10. Computational complexity comparison between our model and CasRel.

Figure 11. Overall architecture of the Chinese Cultural Relic Knowledge Graph System.

Figure 12. Interface of the entity browser and detail view.

Figure 13. Interface of the custom Cypher query workbench.

Figure 14. Interface of the intelligent Q&A assistant and knowledge triple extraction.

Table 1. Statistics of relation distribution in PM-CRER Dataset.

Relation Type	Number of Triples	Proportion (%)	Relation Type	Number of Triples	Proportion (%)
Excavated at	519	1.2	Created by	1952	4.6
Numbered as	7534	17.8	Calligraphy of	809	1.9
Dating from	6972	16.4	Decorated with	3174	7.5
Size of	7509	17.7	Engraved with	1280	3.0
Type of	303	0.7	Adopting	2164	5.1
Characterized by	3819	9.0	Made from	311	0.7
Used for	351	0.8	Texture of	5689	13.4

Table 2. Annotation example of cultural relic entity and relation extraction.

Text	Subject	Subject_Type	Relation	Object	Object_Type
Silver bracelet, No. 00071656-2/19, Qing Dynasty. The bracelet is 2 cm wide and about 9 cm in diameter. It is silver-gilt, adopts filigree technique, and its main decoration is two dragons playing with a pearl. The craftsmanship is exquisite. The dragon heads serve as the bracelet opening, which needs to pull out the movable bolt of the dragon mouth when wearing. The bracelet has a delicate and luxurious style, which is very similar to another Guangdong-made bracelet with inscription collected in the Palace Museum in terms of craftsmanship and style.	Silver bracelet	Gold, Silver & Tin Wares	Numbered as	00071656-2/19	Cultural Relic Number
	Silver bracelet	Gold, Silver & Tin Wares	Dating from	Qing Dynasty	Time
	Silver bracelet	Gold, Silver & Tin Wares	Size of	2 cm wide, about 9 cm in diameter	Size
	Silver bracelet	Gold, Silver & Tin Wares	Texture of	Silver-gilt	Material
	Silver bracelet	Gold, Silver & Tin Wares	Decorated with	Two dragons playing with a pearl	Decoration
	Silver bracelet	Gold, Silver & Tin Wares	adopting	Filigree	craftsmanship
	Silver bracelet	Gold, Silver & Tin Wares	Characterized by	Dragon heads	Appearance
	Silver bracelet	Gold, Silver & Tin Wares	Characterized by	Delicate and luxurious	Appearance

Table 3. Basic statistics of datasets.

Dataset Split	DuIE2.0	PM-CRER
Training Set	171293	5343
Validation Set	20674	765
Test Set	50583	1530

Table 4. Experimental environment setup.

Experimental Environment	Configuration
Operating System	Linux 3.10.0
GPU	Tesla V100S-PCIE-32GB (NVIDIA Corporation, Santa Clara, CA, USA)
CPU	Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Intel Corporation, Santa Clara, CA, USA)
Programming Language	Python 3.11.8
Deep Learning Framework	PyTorch 2.6.0 + cu118

Table 5. Core hyperparameter configuration of the model.

Parameter	Value	Meaning
maxlen	128	Truncation length of the text
batch_size	64	Batch size of training data
hidden_size	768	Number of neurons in hidden layer
transformer_layers	12	Number of Transformer layers in BERT
attention_heads	8	Number of multi-head self-attention heads
epochs	100	Number of rounds of training iterations
learning rate	6 × 10⁻⁵	Step size for parameter updates
weight_decay	1 × 10⁻⁴	Weight decay coefficient
threshold	0.6	Threshold for valid results
dropout	0.1	Dropout probability
warmup_steps	1000	Number of learning rate warmup steps
early_stopping_patience	10	Early stopping patience

Table 6. Performance comparison of different models on DuIE2.0 dataset.

Model	P (%)	R (%)	F1 (%)	Jaccard (%)	GMean (%)	AUROC (%)
Bert-LSTM	73.4	71.5	72.4	56.7	72.1	84.2
CasRel	75.6	76.7	76.1	61.5	75.8	87.3
TPLinker	78.8	71.0	74.7	59.6	74.3	86.1
PRGC	75.3	76.4	75.8	61.1	75.5	87.0
PREBI	78.1	74.9	76.5	62.0	76.2	87.6
DERP	79.3	75.1	77.1	62.7	76.8	88.1
SPN4RE	78.8	76.7	77.7	63.5	77.4	88.5
CECRel	79.7	74.1	76.8	62.3	76.5	87.9
LLaMA-3-8B-Instruct (Zero-shot)	72.1	68.3	70.1	53.9	69.7	82.7
Qwen-2-7B-Instruct (Zero-shot)	74.3	70.5	72.3	56.5	72.0	84.1
GLM-4-9B-Chat (Zero-shot)	75.8	72.0	73.8	58.4	73.5	85.3
Our Model	80.5	78.4	79.4	65.8	79.1	89.7

Table 7. Performance comparison of different models on PM-CRER dataset.

Model	P (%)	R (%)	F1 (%)	Jaccard (%)	GMean (%)	AUROC (%)
Bert-LSTM	66.5 ± 0.67	63.3 ± 0.79	64.8 ± 0.52	48.0 ± 0.69	64.2 ± 0.58	79.5 ± 0.45
CasRel	74.6 ± 0.51	70.2 ± 0.58	72.3 ± 0.47	56.6 ± 0.55	71.9 ± 0.42	84.7 ± 0.36
TPLinker	69.8 ± 0.62	71.2 ± 0.57	70.5 ± 0.56	54.4 ± 0.61	70.1 ± 0.49	83.2 ± 0.38
PRGC	73.1 ± 0.56	70.5 ± 0.52	71.8 ± 0.45	56.0 ± 0.58	71.4 ± 0.40	84.3 ± 0.34
PREBI	74.9 ± 0.48	71.8 ± 0.55	73.3 ± 0.43	57.7 ± 0.50	72.9 ± 0.44	85.4 ± 0.30
DERP	75.2 ± 0.47	73.7 ± 0.50	74.4 ± 0.39	59.2 ± 0.48	74.0 ± 0.37	86.2 ± 0.28
SPN4RE	74.3 ± 0.49	75.1 ± 0.46	74.7 ± 0.41	59.6 ± 0.47	74.3 ± 0.38	86.5 ± 0.27
CECRel	76.4 ± 0.45	71.4 ± 0.54	73.8 ± 0.40	58.3 ± 0.49	73.4 ± 0.39	85.8 ± 0.31
LLaMA-3-8B-Instruct (Zero-shot)	62.5 ± 0.78	58.7 ± 0.85	60.5 ± 0.67	43.4 ± 0.76	59.8 ± 0.63	76.3 ± 0.49
Qwen-2-7B-Instruct (Zero-shot)	65.2 ± 0.75	61.4 ± 0.78	63.2 ± 0.67	46.2 ± 0.73	62.5 ± 0.62	78.1 ± 0.47
GLM-4-9B-Chat (Zero-shot)	67.8 ± 0.71	64.1 ± 0.80	65.9 ± 0.64	49.1 ± 0.69	65.2 ± 0.60	80.2 ± 0.45
Our Model	77.2 ± 0.38	74.7 ± 0.41	75.9 ± 0.32	61.1 ± 0.35	75.5 ± 0.29	87.3 ± 0.26

Table 8. Ablation experimental results of key modules.

Model Variant	DuIE2.0 F1 (%)	PM-CRER F1 (%)
Our Model	79.4 ± 0.23	75.9 ± 0.32
- BiLSTM Layer	78.9 ± 0.28	75.2 ± 0.36
- MHSA-RPE	77.8 ± 0.34	74.3 ± 0.41
- BGDSFM	78.1 ± 0.31	74.4 ± 0.33

Table 9. F1-score comparison of different models under different overlapping triple scenarios.

Dataset	Model	Normal	SEO	EPO	SOO
DuIE2.0	CasRel	73.6	77.8	75.5	58.7
DuIE2.0	Our Model	77.9	79.7	81.1	65.6
PM-CRER	CasRel	80.5	73.8	69.7	61.3
PM-CRER	Our Model	83.3	77.1	74.5	69.2

Table 10. Statistics of model error-type distribution.

Error Type	Number of Errors	Proportion (%)
Boundary Confusion Error	385	38.5
Relation Misclassification Error	297	29.7
Object Omission Error	197	19.7
Subject Omission Error	85	8.5
Other Errors	36	3.6

Table 11. Cross-domain validation experimental results.

Experimental Setup	PM-CRER F1 (%)
Trained directly on PM-CRER	75.9
Trained on DuIE2.0, zero-shot test	62.3
Trained on DuIE2.0, fine-tuned on PM-CRER (10% samples)	71.5
Trained on DuIE2.0, fine-tuned on PM-CRER (50% samples)	74.2

Table 12. Performance comparison of models in real-world scenarios.

Performance Metric	Test Condition	CasRel	Our Model	Change Rate (%)
Peak Throughput (samples/s)	batch_size = 64, sequence length = 128	47.2	37.9	−19.7
Single-Sample Inference Time (ms)	batch_size = 1, sequence length = 128	10.2	12.3	+20.6
99th Percentile Latency (ms)	batch_size = 1, sequence length = 128	15.7	18.9	+20.4
Throughput (samples/s)	batch_size = 16, sequence length = 256	62.3	49.8	−20.1
Average Latency (ms)	batch_size = 1, sequence length = 256	18.5	22.1	+19.5
Concurrent Request Processing Capacity (QPS)	Latency ≤ 50 ms	215	178	−17.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Y.; Yu, X.; Zhang, L.; Zhang, Q.; Bai, Y. A Subject-Guided Two-Stage Joint Entity and Relation Extraction Method for Cultural Relic Knowledge Graphs. Appl. Sci. 2026, 16, 5584. https://doi.org/10.3390/app16115584

AMA Style

Song Y, Yu X, Zhang L, Zhang Q, Bai Y. A Subject-Guided Two-Stage Joint Entity and Relation Extraction Method for Cultural Relic Knowledge Graphs. Applied Sciences. 2026; 16(11):5584. https://doi.org/10.3390/app16115584

Chicago/Turabian Style

Song, Yanchao, Xia Yu, Liqian Zhang, Quanping Zhang, and Yunli Bai. 2026. "A Subject-Guided Two-Stage Joint Entity and Relation Extraction Method for Cultural Relic Knowledge Graphs" Applied Sciences 16, no. 11: 5584. https://doi.org/10.3390/app16115584

APA Style

Song, Y., Yu, X., Zhang, L., Zhang, Q., & Bai, Y. (2026). A Subject-Guided Two-Stage Joint Entity and Relation Extraction Method for Cultural Relic Knowledge Graphs. Applied Sciences, 16(11), 5584. https://doi.org/10.3390/app16115584

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Subject-Guided Two-Stage Joint Entity and Relation Extraction Method for Cultural Relic Knowledge Graphs

Abstract

1. Introduction

2. Method

2.1. Overall System Full-Process Framework

2.2. Proposed Subject-Guided Two-Stage Model

2.3. BERT Encoder

2.4. BiLSTM

2.5. Multi-Head Self-Attention Decoder with Relative Position Encoding

2.6. Boundary–Global Dual-Branch Subject Fusion Module

2.7. Training Objective

3. Experimental Setup

3.1. Datasets

3.2. Evaluation Metrics

3.3. Setup

3.4. Baseline Models

4. Experimental Results and Analysis

4.1. Model Comparative Experiment

4.2. Ablation Experiment

4.3. Overlapping Triples Experiment

4.4. Error Analysis

4.5. Cross-Domain Validation Experiments

4.6. Complexity Analysis Experiment

5. Chinese Cultural Relic Knowledge Graph System

5.1. Overall System Architecture

5.2. Implementation of Core Functional Modules

5.3. Chapter Summary

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI