Full-Abstract Biomedical Relation Extraction with Keyword-Attentive Domain Knowledge Infusion

: Relation extraction (RE) is an essential task in natural language processing. Given a context, RE aims to classify an entity-mention pair into a set of pre-deﬁned relations. In the biomedical ﬁeld, building an efﬁcient and accurate RE system is critical for the construction of a domain knowledge base to support upper-level applications. Recent advances have witnessed a focus shift from sentence-to document-level RE problems, which are more challenging due to the need for inter- and intra-sentence semantic reasoning. This type of distant dependency is difﬁcult to understand and capture for a learning algorithm. To address the challenge, prior efforts either attempted to improve the cross-sentence text representation or infuse domain or local knowledge into the model. Both strategies demonstrated efﬁcacy on various datasets. In this paper, a keyword-attentive knowledge infusion strategy is proposed and integrated into BioBERT. A domain keyword collection mechanism is developed to discover the most relation-suggestive word tokens for bio-entities in a given context. By manipulating the attention masks, the model can be guided to focus on the semantic interaction between bio-entities linked by the keywords. We validated the proposed method on the Biocreative V Chemical Disease Relation dataset with an F1 of 75.6%, outperforming the state-of-the-art by 5.6%. Abstract 20633755|a|Suxamethonium caused prolonged apnea in patients in whom pseudocholinesterase enzyme gets deactivated by organophosphorus (OP) poisons. Here, we present a similar incident in a severely depressed patient who received electroconvulsive therapy (ECT). Prolonged apnea, in our case, ensued because the information about a suicidal attempt by OP compound was concealed from the treating team. expanded to form the ﬁnal keyword set. The BioBERT is changed to Kw-BioBERT and further tuned on the training set with the keyword attention mechanism enabled. Finally, the tuned Kw-BioBERT is evaluated on the CDR test set.


Introduction
Relation extraction (RE) is a primitive task in natural language processing (NLP). In the context of supervised learning, RE refers to the classification of an entity pair to a set of known relations [1] in a given document or sentence. RE is widely used in biomedical text mining and is usually performed after named entity recognition (NER), jointly discovering and extracting patterns and knowledge from unstructured textual data. Powered by the latest NER and RE algorithms, computers can quickly and accurately identify biomedical entity mentions and the relations between them to build a domain-specific knowledge base to support upper-level applications.
Traditional learning-based methods for RE can be divided into two categories, including feature-based and kernel-based methods [1], which either rely on hand-crafted features or elaborately-designed kernels to perform classification. These methods usually incur error propagation through the learning pipeline, which largely limits the model performance. The rise of deep learning-based models has accelerated the development of a broad spectrum of learning tasks, and RE has also benefited from deep neural models.
One line of efforts takes advantage of the pretrained language models, such as Bidirectional Encoder Representations from Transformers (BERT) [2], Embeddings from Language Models (ELMo) [3], and XLNet [4], which can be fine-tuned for the RE task and present superior performance. On the other hand, graph neural networks (GNNs) [5][6][7][8] have also been extensively investigated in RE due to their intuitive modeling and semantic interpretation capability. In the biomedical field, human annotation is cost-ineffective because of the inaccessible domain knowledge required for annotators. Distant supervised learning [9] was, thus, developed to alleviate the problem and speed up annotation.
Recent RE advances have seen a shift from sentence-level RE to document-level RE. The latter is more challenging due to the inter-and intra-sentence reasoning. In other words, the relation of an entity pair could span multiple sentences, creating a long-range semantic dependency that is hard to detect. Prior efforts attempted to tackle this challenge in two ways: (1) encoding cross-sentence text representations to facilitate distant semantic reasoning [5][6][7]10] and (2) the infusion of domain or local knowledge into the model to guide training and inference [11][12][13]. Our study belongs to the latter category. The main hypothesis of this study is that keyword-based domain knowledge can benefit the learning task of document-level RE in the biomedical field.
Our goal is to investigate the role of keywords in RE and their function in performance boosting. To verify this hypothesis, we propose a keyword-attentive knowledge infusion strategy that can be integrated into the BERT neural architecture. The strategy is driven by a custom process of domain keyword collection that aims to discover the most informative tokens that are highly relation-suggestive for bio-entity pairs in a given context. Through the keyword attention masks, the model is guided to focus on the semantic interaction between the bio-entities linked by the keywords. We adopt BioBERT, which has been pretrained on over a million PubMed articles. BioBERT is fine-tuned with the addition of a keyword attention layer for relation classification. Thus, the proposed method is named Kw-BioBERT. Our main contributions are as follows. • We employ a BERT-based keyword attentive neural architecture, named Kw-BioBERT, for document-level biomedical RE. • A novel domain keyword collection mechanism is proposed to effectively capture relation-suggestive keywords for knowledge infusion. • The proposed method is validated on the Biocreative V Chemical Disease Relation (CDR) dataset. The results show that the proposed method outperformed the SOTA by 5.6% in F1 and, thus, can serve as a credible baseline for the CDR dataset.
The rest of this paper is structured as follows. Section 2 covers the prior efforts relevant to this study. Section 3 describes the CDR dataset and the design details of the proposed method. Section 4 provides the implementation details, experimental settings, and results. Section 5 summarizes the work with the limitations and future directions.

Related Work
Recent advances in RE have witnessed a wide spectrum of methods and models. In this section, A review of the closely relevant efforts is provided.

Knowledge Infusion in RE
Knowledge infusion is a common strategy [14,15] to handle low-resource learning tasks with limited supervision. In RE, knowledge infusion has also been found effective. Roy et al. [11] employed the Drug Abuse Ontology (DAO) [16] to determine entity mentions and relations. Similar efforts have appeared in Sousa et al. [12]. In addition to the domain knowledge, local semantic knowledge can also be infused to guide the training. Yu et al. [13] added a position-enhanced module to the BERT neural architecture to encode relative locations between entities. Our proposed method infuses domain knowledge in two ways through (1) biomedical knowledge infused by the pre-trained BioBERT language model and (2) a keyword attentive layer that guides the training to focus on the entity interaction via informative keywords, which, to our best knowledge, has not been seen in prior studies.
We focus on reviewing the second line of work since it is more relevant to our study. Shi et al. [27] proposed a strategy that directly utilizes BERT for RE by only changing the input format to include a document and the entity mentions separately by a [sep] token. Su and Vijay-Shanker [28] proposed a novel fine-tuning process that utilizes all of the outputs from the last transformer layer in BERT, leading to a performance gain. In addition to the base BERT, two of its variants, BioBERT [29], and SciBERT [30] have emerged with a stronger embedding capability to work with scientific publications and gain popularity in the RE task [31][32][33][34]. In this work, BioBERT was chosen since it has been pre-trained on over a million PubMed articles, making it very competitive in biomedical NLP tasks.

Document-Level RE
Document-level RE has recently gained increasing interest in the NLP community since documents provide richer semantic information than sentences. Several datasets that focus on document-level RE have been developed, such as CDR [35,36], DocRED [37], and GDA [38], which have driven the development of innovative models. One line of prior efforts [5][6][7]10] explored ways to conduct inter-and intra-sentence reasoning [25,39], a major challenge in document-level RE.
Gu et al. [10] employed a maximum entropy (ME) model and a CNN model for interand intra-sentence RE, respectively. Bi-affine Relation Attention Network (BRAN) [40] stacks a series of transformers [41] followed by head and tail MLPs and a bi-affine operation that encodes the pairwise token prediction in a 3D tensor. Graph neural networks (GNN) [5][6][7][8] have also been a popular choice due to their intuitive modeling ability in RE, where named entities and relations can be modeled as nodes and edges in a graph. Sahu et al. [5] developed a GNN-based model to capture both local and non-local dependency between entity mentions.
Similarly, Wang et al. [6] designed a GNN model, named GLRE, that encodes and aggregates global and local entity and relation representation. Christopoulou et al. [7] proposed an edge-oriented (EoG) GNN that leverages multi-instance learning to enhance intra-and inter-sentence reasoning. Compared to the prior studies that focused on modeling and reasoning, our work focuses on domain knowledge infusion, which has not been extensively explored in the field of document-level RE. One relevant work is by Sousa et al. [12], which injected domain ontology knowledge into the model, resulting in performance gains. On the other hand, our work investigates the role of keywords, which is a novel method of knowledge infusion.

Datasets
The CDR dataset [35,36] was adopted in this work to evaluate the proposed model. The CDR dataset models the chemical-disease relations, namely chemical-induced disease (CID) relations, and is created at the abstract level with entity-linked mention annotations, which are featured by long-range and cross-sentence relations. Specifically, A CID relation marked in the dataset could be either a putative mechanistic relation or a biomarker relation. The former means that the chemical is a potential etiology of the disease (e.g., cancer x is caused by exposure to chemical y); the latter, on the other hand, indicates a correlation between the chemical and the disease (e.g., an increased abundance of chemical X in the brain correlates with disease Y).
The two relation sub-types are treated as a unified CID relation, creating a binary classification problem, i.e., CID/non-CID relation. According to [35], the development of the dataset involved four annotators with medical training background. A doubleannotation strategy was adopted; namely, each abstract was independently labeled by two annotators. The dispute was resolved by a third and senior annotator. All annotations were performed using PubTator [42]. Table 1 shows the statistical information of the CDR dataset, where 1500 abstracts are equally divided into training, development, and test sets. The mentions of diseases and chemicals are about equally distributed into the three sets of data. The size of positive samples, namely the chemical-disease (CD) pairs that present a CID relation, is 3116, which is about one-fourth of the size of negative samples, i.e., the CD pairs without a CID relation. The imbalanced distribution of classes brings difficulty to both training and evaluation. In addition to the original training set shown in Table 1, the task contains an additional training set [43] of 15,448 weakly labeled PubMed abstracts with 26,657 positive CID relations and 146,057 negative ones. This extra data is used as a secondary source for training.  Table 2 displays an abstract sample with annotations in the CDR dataset. The first two sections are the original article title and abstract. The third section lists the entity mentions, where each row follows a format of "PMID offset length mention_text entity entity_ID", which describes an entity mention with an exact location. The last section lists the CID relations that follow a format of "PMID relation_type head_entity tail_entity".

Learning Problem
Given the CDR dataset, the learning task can be formulated as follows. Let D train , D dev , and D test denote the training, development, and test set, respectively. Each instance in the dataset can be defined as (x i , y i ), where x i = (A k , C m , D n ), representing a CD pair (C m , D n ) that appears in an abstract A k (i, k, m, and n are indices), and y i is a binary target; where 1 indicates a CID relation of the CD pair, i.e., a positive instance, and 0 otherwise. The learning problem is to develop a model that takes x i ∈ D test as input and makes a predictionŷ i that should approximate the ground truth y i as much as possible. It is noted that the problem belongs to document-level RE, where the head and tail entity mentions could span across multiple sentences in the abstract.

System Framework
The system framework is given in Figure 1. First, the BioBERT base model is fine-tuned using the training set. The tuned BioBERT model is used for keyword extraction, generating a collection of seed keywords that are highly relation-suggestive. The seed keyword set is then expanded to form the final domain-specific set of keywords. We modify the BioBERT network by adding a keyword-attentive layer in parallel with the last transformer layer, similar to [44]. The resulting Kw-BioBERT model is then fine-tuned on the training set, with the keywords injected into the model as external domain knowledge. The tuned Kw-BioBERT is evaluated on the test set to obtain the final result.  Figure 1. System framework. BioBERT is fine-tuned on the CDR training set. Then, the keyword extraction algorithm is applied to the tuned BioBERT model to generate a set of seed keywords, expanded to form the final keyword set. The BioBERT is changed to Kw-BioBERT and further tuned on the training set with the keyword attention mechanism enabled. Finally, the tuned Kw-BioBERT is evaluated on the CDR test set.

Network Architecture
A network architecture (as shown in Figure 2) similar to [44] is adopted. However, our version has two differences compared to the original design, including the input form and keyword manipulation. The input is a sequence pair (seq A , seq B ), where seq A starts with a [cls] token, ends with a [sep] token, and has a tokenized full abstract A in the middle; seq B specifies the head and tail entity mentions. For the CDR dataset, seq B consists of a chemical entity mention C (the head) and a disease entity mention D (the tail), both appearing in A. The model's job is to understand the semantic relation between C and D, given A as a context. The neural architecture is modified from BioBERT (with the same neural architecture with BERT), adding a keyword attentive layer side by side with the last transformer encoder layer in BERT. [cls] Suxamethonium induced prolonged apnea .
[  The keyword attentive layer differs from a transformer encoder in two aspects. First, the attention masks, in the transformer encoder, are used to mask the padding tokens so that they do not participate attention; in the keyword attentive layer, however, the attention masks are also manipulated to allow tokens in seq A to only attend the two entity mentions in seq B and allow the two tokens in seq B to only attend the keywords in seq A . In other words, for each token in seq A , we only care about its attention (or impact) on the two entity mentions in seq B ; also, for each token in seq B , we only consider its attention on the keywords in seq A .
With the manipulation attention masks, the model learns how the entity mentions and the keywords interact and jointly determine the relation. The output of the keyword attentive layer is a vector of hidden states [h cls , h A1 , h A2 , ..., h B1 , h B2 , h sep ], which has the same size as the input. The hidden state vector can be divided to two sections corresponding to seq A and seq B . Then, a pooling operation is applied to each section individually, producing h kw,A and h kw,B , which represent the aggregated and keyword-attentive embeddings for seq A and seq B , respectively. Now, the semantic difference between h kw,A and h kw,B is denoted as h di f f , which is defined as where [; ] is the concatenation operation. Next, the four pieces of information are concatenated, including the h cls from the last transformer encoder, h kw,A , h kw,B , and h di f f , and feed the resulting vector into the detection head, which consists of a dense layer and a softmax function.

Keywords Collection
The mission keywords collection is to discover the word tokens that are relationsuggestive. For instance, in "Suxamethonium induced prolonged apnea in a patient receiving electroconvulsive therapy.", the two entity mentions "Suxamethonium" and "apnea" are linked via the verb "induced", making it a keyword that suggests a CID relation. In another example, "Myasthenia gravis presenting as weakness after magnesium administration.", the dominate keyword is not apparent, and words "presenting", "after", and "administration" jointly affect the CID relation between "magnesium" and "Myasthenia gravis".
The process of identifying the seed keywords is as follows. For each positive CD pair in each abstract within the CDR training set, we do the following: the instance (A, C, D) is fed into BERT that has been tuned on the training set and obtains a prediction. If the prediction does not match the ground truth, we move on to the next CD pair; otherwise, the abstract is scanned token by token; specifically, for each token that is not a (1) entity mention, (2) punctuation, or (3) stop word, we mask it in the abstract and obtain A masked ; then, (A masked , C, D) is fed into BERT again and we record a change in the output confidence.
The top three tokens that cause the most confidence drop are kept and added into the candidate keyword set. The rationale is that if a token, masked in the abstract, leads to a significant confidence drop, it means that the token is highly relation-suggestive for the CD pair, given the abstract as a context. This way, a set of candidate keywords is collected and further manually selected to form a seed keyword set. Examples of these keywords include "induced", "statistically", "maintenance", "consumption", and "idiopathic", etc.
More keyword examples are provided in Section 4.2.
To enhance the keyword diversity, the synonyms of the seed keywords are added into the keyword set. For instance, the word "induce" is semantically close to "produce", "cause", "effect", and "provoke", and could be used interchangeably when describing a CID relation. After adding the synonyms, the final keyword set is created and ready for use.

Keyword-Attentive Knowledge Infusion
We take advantage of the attention mask feature implemented in BERT. Figure 3a shows a positive instance with a chemical entity mention (CEM) "Suxamethonium" and a disease entity mention (DEM) "apnea", linked by a keyword "induced". To ensure that each token in seq A only participates attention to the two tokens in seq B and that each token in seq B only participates attention to the keyword tokens in seq A , we employ a binary matrix, denoted by M AttMsk , of size l × l. Let l A and l B denote the length of seq A and seq B , respectively.
We then have l = l A + l B . Generally, the ith row in M AttMsk specifies how token i of the input attends other tokens. In particular, M AttMsk (i, j) = 1 indicates that token i attends token j, giving a one at row i and column j in the matrix; also, M AttMsk (i, j) = 0 means that token i does not attend token j, posting a zero at position (i, j) of the matrix. To fulfill our needs, for all tokens in seq A , a common attention mask vector, with all zeros in the first l A positions and two ones in the two positions corresponding to the entity mentions in seq B , can serve the purpose. On the other hand, all tokens in seq B share an attention mast vector, with all zeros in all positions except the ones where the keywords reside. Figure 3b shows a complete example of M AttMsk , given the input in Figure 3a.

Training Setting
The proposed model and keyword collection procedure were implemented using Python 3.6.10 and TensorFlow 1.13. Experiments were conducted on an Nvidia V100. On the CDR training set, each epoch took about 14 min (for Kw-BioBERT); on the additional training data, the running time per epoch was about 182 min. Two hyperparameters were tuned, including the number of transformer layers and the training epochs, with a learning rate of 3 × 10 −5 . The results are reported in the following sections.

Keywords
As described in Section 3.5, for each positive instance, when masked and fed into the BERT model, the top three tokens that caused the largest confidence drop were recorded and considered as candidate keywords. The process was applied to the CDR training set, and three token sets that store the top-three relation-suggestive tokens were obtained. Figure 4 selectively displays the seed keywords discovered in the process, sorted by frequency in a decreasing order. Subfigures (a), (b), and (c) correspond to the tokens resulting in the most, second-most, and third-most confidence drop. In our experiment, a total of 1736 candidate keywords is identified. After a round of manual selection, 235 tokens remain to form the seed keyword set, which is further expanded to a keyword set of 943 tokens, with the synonyms (found through WordNet [45]) added.

Performance Metric
Due to the imbalanced class distribution, accuracy is not adequate, because it may drive the learning algorithm to classify all instances to the major class. For our case, this could yield numerous false negatives, meaning that the positive CID relations are not detected. Thus, F1 is adopted as the main performance metric for model evaluation since F1 is superior to accuracy in the case of imbalanced class distribution. We also report precision (Pre), which reflects the number of false alarms, and recall (Rec), which implies the number of missed CID relations. Intuitively, the higher the precision, the fewer the false alarms; also, the the higher the recall, the fewer the missed CID relations. In addition, the Pre-Rec gap should be monitored: if the gap is too large, it means that the model focuses on the optimization of a single metric, rather than both, which should be avoided. With the given true positives (TP), true negatives (TN), and false positives (FP), the definitions of Pre, Rec, and F1 can be given below.

Benchmark
We selected the following models that were evaluated on the CDR test set. The performance results of these models are quoted from the original papers.

•
Gu et al. [10] adopted a maximum entropy (ME) model and a CNN model for interand intra-sentence RE, respectively. • Bi-affine Relation Attention Network (BRAN) [40] consists of a stack of modified transformers, a head and tail MLP, and a bi-affine operation to output a 3D tensor that models pairwise token relations. • Sahu et al. [5] employed a GCNN to model entities and their relations in a document and demonstrated that GCNN can capture both local and non-local dependency, which helps to boost performance. • Christopoulou et al. [7] proposed an edge-oriented graph (EoG) neural model to learn intra-and inter-sentence via multi-instance learning. • Nan et al. [39] developed a latent structure refinement strategy that allows reasoning across sentences and automated latent graph construction.
• Sousa et al. [12] proposed to utilize external domain-specific ontologies to enhance the performance of biomedical RE. The proposed system, named BiOnt, injects additional knowledge into the model, leading to performance gain. • Wang et al. [6] developed a graph-based neural model named GLRE that encodes and aggregates global and local entity and relation representation for document-level RE. • Zeng et al. [25] designed a neural architecture named SIRE that can represent intraand inter-sentential relations. In addition, SIRE is featured with a novel logical reasoning module that covers more reasoning chains compared to the prior efforts. SIRE posts the highest F1 on the CDR dataset among all of the investigated studies; thus, SIRE represents the SOTA.

Key Design Choices
Two hyperparameters are tuned.
• For the number of transformer layers, 2, 4, 6, 8, 10, and 12 layers were tested. Each experiment ran for five epochs. As shown in Table 3, the performance of Kw-BioBERT with twelve transformer layers was the best, with an F1 of 75.8%. • For the training epochs, we reported training and test performance with 1, 2, through 5 epochs. Since BioBERT has been pre-trained, the effort of fine-tuning a medium sized dataset can be greatly reduced. In our experiments, the test F1 started to stabilize after the first epoch and reached a peak at the third epoch, with an F1 of 76.4%, as shown in Table 4. It is also noted that performance gap between training and test F1, indicating overfitting, which can usually be addressed by an increase of training data.  Table 5 shows the result of an ablation study, in which four models are evaluated, including BERT, Kw-BERT, BioBERT, and Kw-BioBERT. There are two primary observations. First, adding a keyword attention layer to the base models brought a performance gain of about two points, with a 2.1-point gain (in F1) on BERT and a 1.9-point gain on BioBERT. Second, switching BERT to BioBERT brought a gain of around eight points, by looking at BERT vs. BioBERT, and Kw-BERT vs. Kw-BioBERT. This gain is surprising but explainable since BioBERT is pretrained on corpora in the biomedical domain at a large scale; thus, BioBERT can better encode and represent the semantic meaning of PubMed abstracts. This experiment also validates the efficacy of the proposed method of keywordattentive knowledge infusion, which nicely complements the pretrained language models. Essentially, the utilization of BioBERT and keywords can be both regarded as knowledge infusion, but at two levels. BioBERT receives domain knowledge by pretraining, which is self-supervised; the keyword-attentive layer, on the other hand, injects task-specific knowledge (i.e., relation-suggestive tokens and semantic interaction between bio-entities) during training, which is supervised.

Comparison with the Benchmarks
We present the performance of the benchmarks and the proposed Kw-BioBERT in Table 6 on the CDR test set. We observed that Kw-BioBERT outperformed the SOTA, namely SIRE, by 5.6% in F1. When trained on additional data (denoted by model + data in the last two rows of the table), our method posted an F1 of 80.8%, outperforming Bran by 14.6%. The latter has been used as a credible baseline in many prior studies. In addition, the Pre-Rec gap of our model is only two points, which is smaller than that of other benchmarks listed in the table, e.g., GLRE (7.1%), EoG(3.1%), and Bran (15.2%), further demonstrating the superiority of our model that seeks for optimizing both Pre and Rec.  Table 7 reports an overhead comparison between Kw-BioBERT and BioBERT in terms of the training and inference speed, both in examples per second (ex/s). BioBERT posted an average speed of 11 ex/s during training, and Kw-BioBERT was almost two times faster, with a speed of 20.5 ex/s. During inference, the speeds of BioBERT and Kw-BioBERT were 66.5 and 72.6 ex/s, respectively. The increase of speed brought by Kw-BioBERT is mainly due to the attention layer added into BERT and replacing the Nth standard transformer encoder. In other words, the original design of self-attention performs pair-wise attention between every pair of tokens, and the proposed keyword masked self-attention only concerns the attention (1) from tokens in seq A to the two entity mentions in seq B and (2) from the two mentions in seq B to the keywords in seq A , greatly reducing the attention calculations.

Conclusions
Document-level RE has been given increased research attention recently. A broad spectrum of deep models has been explored, including CNN, GNN, and transformer-based models. To address the challenge of distant dependency reasoning, there are two lines of efforts. The first category focuses on improving cross-sentence representation, and GNNs become an intuitive modeling choice due to their straightforward way of representing entities and relations as nodes and edges, facilitating long-range reasoning. The second line of studies, on the other hand, explores the utilization of external knowledge. Our work in this study belongs to the second category.
To verify the hypothesis that domain keywords can improve the model performance for the document-level RE task, a keyword-attentive knowledge infusion strategy was proposed. We developed a custom process of domain keyword collection to identify and store the highly relation-suggestive tokens in a given document. By manipulating the attention masks, these keywords were injected into BERT to guide the learning algorithm to focus on the semantic interaction between the bio-entities linked by the keywords.
In addition, we adopt BioBERT, a BERT variant pretrained on over a million PubMed articles, for fine-tuning. These joint efforts brought together created a model with superior performance, outperforming the SOTA by 5.6%, on the CDR dataset. Thus, we concluded that the hypothesis of this study can be accepted, and the goal was achieved. The new-high F1 score indicates that the proposed Kw-BioBERT can serve as a credible benchmark of the CDR dataset for future research.
This study has the following limitations, which will be addressed in future work. First, the imbalanced sample distribution issue brought difficulties in training an accurate model, which is a common issue for most RE datasets. We plan to adopt positive instance sampling or augmentation techniques to rebalance the samples in different classes. Second, it will be of interest to evaluate the proposed Kw-BioBERT to other document-level RE datasets, with more types of bio-entities and relations. In an ongoing study, a novel biomedical RE dataset is being developed, with five types of entities and six types of relations. The proposed Kw-BioBERT will be evaluated on this new dataset. Lastly, we studied neither the impact of keywords quality on the model performance nor the alternatives of domain knowledge infusion, which are worthy of further investigation.
Author Contributions: Conceptualization and methodology, X.Z., L.Z., J.D. and Z.X.; software, validation, and original draft preparation, X.Z., L.Z. and J.D.; review and editing, X.Z. and Z.X. All authors have read and agreed to the published version of the manuscript.