3.1. Datasets
All experiments in this work are conducted on two datasets: the public general dataset DuIE2.0 [
22] and our self-built Palace Museum Cultural Relic Entity–Relation Dataset (PM-CRER), which are used to verify the model’s specialized extraction performance in the vertical cultural relic domain and cross-domain generalization ability in general scenarios respectively.
The self-built PM-CRER dataset is constructed based on public data from the official website of the Palace Museum, with a total of 7870 original cultural relic records crawled. In the data preprocessing stage, HTML tag stripping and redundant whitespace character cleaning are completed through regular expressions. Invalid texts with fewer than 10 characters are filtered out, traditional–simplified variant characters are normalized, full-width/half-width punctuation is standardized, and webpage garbled characters and special characters are corrected. Duplicate samples are removed based on text content similarity, and finally 7638 high-quality cultural relic description texts are obtained. The quantity distribution of various cultural relic entities is shown in
Figure 7, covering a total of 23 core cultural relic types. Among them, paintings, ceramics, calligraphy and jade stone wares account for the highest proportion of data volume, which is highly consistent with the actual distribution characteristics of cultural relics in the collection of the Palace Museum.
The division of entity and relation types is refined with reference to the industry standards Specification for Archives of Cultural Relics Collections (WW/T 0020-2008) [
23] issued by the National Cultural Heritage Administration and Census of State-owned Movable Cultural Relics—Classification Standard for Cultural Relics (Trial) [
24]. Meanwhile, it follows the guiding principles of Artificial Intelligence—Technical Framework for Knowledge Graph (GB/T 42131-2022) [
25] and is further refined combined with the semantic characteristics of cultural relic texts to ensure the completeness and domain adaptability of the division. Finally, 37 types of entities are defined, including 23 types of cultural relic entities: ceramics; paintings; calligraphy; inscriptions; bronzeware; enamel; lacquerware; sculpture; gold, silver and tin wares; jade and stone wares; seals; textiles and embroidery; stationery; furniture; clocks and instruments; glassware; bamboo, wood, ivory, horn and gourd; court and religious relics; jewelry; military and ceremonial wares; music and opera relics; daily utensils; and foreign cultural relics. A total of 14 types of attribute entities: cultural relic number, time, location, size, material, appearance, function, person, version, font, decoration, inscription, theme, and craftsmanship. Meanwhile, 14 core relations are defined: excavated at, numbered as, dating from, size of, type of, characterized by, used for, created by, calligraphy of, decorated with, engraved with, adopting, made from, and texture of, which fully cover the core dimensions of cultural relic knowledge.
The statistical results of the relation distribution in the PM-CRER dataset are shown in
Table 1. It can be seen that the three relations of “Numbered as”, “Size of” and “Dating from” account for the highest proportions of samples, reaching 17.8%, 17.7% and 16.4% respectively, which is consistent with the semantic characteristics of cultural relic description texts. The relation distribution exhibits a certain degree of imbalance. The sample sizes of the “Type of” and “Made from” relations are the smallest, each accounting for 0.7%. This is because the type information of most cultural relics in the Palace Museum collection is already included in their names, while the “Made from” relation is only applicable to cultural relics of specific materials.
The PM-CRER dataset is mainly sourced from the official website of the Palace Museum, so there is a certain institutional bias in the distribution of cultural relic types. Imperial cultural relics and cultural relics of the Ming and Qing dynasties account for a relatively high proportion, while the sample sizes of folk cultural relics and early ancient cultural relics are relatively small. In addition, the dataset only contains Chinese cultural relic description texts and does not cover cultural relic materials in other languages. In subsequent work, we will expand the sources of the dataset to reduce this bias.
Data annotation is completed using the doccano open-source annotation platform [
26], performed by three annotators who have received unified annotation specification training and have basic knowledge of cultural relics. The annotation process adopts a “double independent annotation + cross-validation” mechanism [
27]. 10% (764 items) of all annotated samples are randomly selected for consistency test, and Cohen’s Kappa coefficient is used to measure annotation consistency: the Kappa value for entity boundary and type annotation is 0.89, and the Kappa value for relation-type annotation is 0.83, both reaching the “excellent” consistency level, indicating reliable annotation quality. For annotation results with discrepancies, the final adjudication is made by two domain experts with more than 5 years of research experience in Palace Museum cultural relics. The finally constructed cultural relic information corpus contains a total of 42,386 entity–relation–object triples, and
Table 2 shows a typical annotation example of cultural relic text.
DuIE2.0 is a Chinese entity and relation extraction benchmark dataset released by the Language and Intelligence Technology Competition in 2020, and it is also the largest public Chinese extraction dataset in the industry at present. This dataset covers multiple general fields such as news, encyclopedias and films, containing more than 430,000 annotated triples and 210,000 Chinese sentences. It has rich relation types and a large number of overlapping triple samples, and is widely used to verify the general performance of entity and relation extraction models.
In this paper, the PM-CRER dataset is randomly divided into training set, validation set and test set in a ratio of 7:1:2, ensuring no overlapping samples among the three; the DuIE2.0 dataset adopts the standard division method officially released. The division statistics of the two datasets are shown in
Table 3.
3.2. Evaluation Metrics
This study adopts Precision (P), Recall (R) and F1-score as the core evaluation metrics for the entity–relation triple-extraction task. To comprehensively evaluate model performance, we also adopt several commonly used binary classification metrics, including Geometric Mean (GMean), Sensitivity, Specificity, Jaccard Coefficient and Area Under the Receiver Operating Characteristic Curve (AUROC), to comprehensively evaluate the performance of the model. The calculation formulas of each metric are as follows:
where TP denotes the number of correctly predicted relation triples, FP denotes the number of incorrectly predicted triples, FN denotes the number of undetected true triples, and TN denotes the number of correctly predicted negative samples. M is the number of positive samples, N is the number of negative samples,
is the prediction score of the i-th positive sample,
is the prediction score of the j-th negative sample, and
is the indicator function, which takes the value of 1 when the condition is satisfied and 0 otherwise.
3.3. Setup
The development, training and inference processes of all models in this study are completed in a unified software and hardware environment. At the software level, Python 3.11.8 is used as the development language, the core model logic is implemented based on the PyTorch 2.6.0 deep learning framework, and CUDA 11.8 is used for GPU parallel acceleration computing. At the hardware level, all experiments are run on a Linux server cluster equipped with multi-core CPUs and high-performance GPUs. The detailed configuration of the experimental environment is shown in
Table 4.
The subject-guided two-stage joint entity and relation extraction model proposed in this paper takes the BERT-base pre-trained model open-sourced by the Hugging Face platform as the basic semantic encoder and uses its bidirectional context modeling capability to extract the initial character-level semantic features of cultural relic texts. The model training adopts the AdamW optimizer to realize adaptive iterative update of parameters and introduces an Early Stopping strategy. Training is terminated early when the validation set F1-score does not improve for 10 consecutive rounds, effectively avoiding model overfitting. The core hyperparameter configuration involved in the experiment is shown in
Table 5.
All baseline models are implemented based on the official code publicly available in their original papers, and the optimal hyperparameter configurations recommended in the original papers are adopted. For models without official code available, we strictly reproduce them according to the descriptions in the original papers, and conduct training and testing in the same experimental environment. To ensure the fairness and comparability of the results, all models use the same dataset splits and preprocessing pipelines.
3.4. Baseline Models
To verify the performance of the subject-guided two-stage joint entity and relation extraction model proposed in this paper in the entity and relation extraction task, we selected 11 representative baseline methods covering different technical paradigms to construct a comprehensive comparative experimental system.
- (1)
Bert-LSTM [
28]: A classic pipeline baseline, which adopts the “BERT semantic encoding + BiLSTM sequence feature extraction” architecture and serves as a general performance benchmark for all deep learning extraction models.
- (2)
CasRel [
8]: The direct prototype of the model in this paper, which first proposed the cascaded binary labeling framework, effectively solved the overlapping triple problem, and has become an important milestone in the field.
- (3)
TPLinker [
10]: A representative single-stage linking model, which transforms the extraction task into a token-pair linking problem and realizes true end-to-end joint extraction.
- (4)
PRGC [
9]: A core improvement of CasRel, which proposes a three-stage architecture of “latent-relation-prediction–subject-extraction–object-extraction” and significantly reduces computational complexity.
- (5)
PREBI [
13]: An enhanced model specifically for entity boundary recognition, which improves entity extraction accuracy by explicitly introducing boundary information.
- (6)
DERP [
14]: A dual-head entity and relation prediction framework, which divides the entity recognition process into two stages: head entity recognition and tail entity recognition, and introduces a triple prediction module to improve the accuracy and completeness of extraction.
- (7)
SPN4RE [
29]: A single-stage model based on semantic-aware pointer network, which optimizes the extraction effect in complex scenarios by introducing semantic prior knowledge.
- (8)
CECRel [
12]: The latest cascaded state-of-the-art (SOTA) model, which improves extraction performance in general domains through context enhancement and entity-relation interaction mechanisms.
- (9)
LLaMA-3-8B-Instruct [
30]: An instruction-tuned large language model that performs entity and relation extraction in a zero-shot setting.
- (10)
Qwen-2-7B-Instruct [
31]: An open-source large language model with strong Chinese language capabilities, which performs entity and relation extraction in a zero-shot setting.
- (11)
GLM-4-9B-Chat [
32]: A general-purpose large language model developed by Tsinghua University, performing entity and relation extraction in a zero-shot setting.
The above 11 comparative models comprehensively cover the core technical routes in the field of entity and relation extraction, including pipeline architecture, cascaded joint extraction, single-stage linking-based extraction, boundary-/semantic-enhanced methods, and the latest large language model (LLM)-driven extraction methods. By comparing against these models, the performance of the model proposed in this paper can be comprehensively and objectively evaluated from three dimensions: the effectiveness of the overall architecture, the pertinence of core improvements, and the adaptability to specific scenarios in the cultural relic domain such as small samples and blurred boundaries.