Prompt Learning with Structured Semantic Knowledge Makes Pre-Trained Language Models Better

: Pre-trained language models with structured semantic knowledge have demonstrated remarkable performance in a variety of downstream natural language processing tasks. The typical methods of integrating knowledge are designing different pre-training tasks and training from scratch, which requires high-end hardware, massive storage resources


Introduction
In recent years, pre-trained language models (PLMs) such as BERT [1], XLNet [2], and RoBERTa [3] have achieved promising results in many natural language processing (NLP) tasks [4][5][6]. Although no explicit syntactic rules and concepts are introduced, these models can perform well with extensive pre-training on large-scale unlabeled corpora in various self-supervised ways. Nevertheless, recent works illustrate that external semantic knowledge can improve downstream NLP tasks, including named entity recognition [7,8], relation extraction [9,10], and machine translation [11][12][13]. However, traditional approaches to introducing knowledge involve mostly training from scratch, which is time-consuming and computationally expensive, making it infeasible for most users. Recently, prompt learning has achieved promising results for certain few-shot classification tasks [14][15][16][17], and it can also be used to integrate knowledge.
The Xinhua Dictionary, the most authoritative and influential modern Chinese dictionary, contains massive and comprehensive content, such as word forms, pronunciation, precise definitions, and rich examples. In contrast to WordNet [18] and HowNet [19], the "senses" in the Xinhua Dictionary are more abundant, detailed, and fine-grained. It can offer strong and efficient support for language models to understand Chinese word semantics. In the Xinhua Dictionary, the "sense" is the meaning of the word [20], and a Chinese word can have more than one sense. An example from the Xinhua Dictionary is shown in Table 1. The "sense" is composed of a long string of tokens, but the typical methods of prompt learning accept one token as the answer. Thus, the issue of how to properly use the wealth of long-answer information is a challenging problem. Table 1. An example of a word and its senses and phrases in the Xinhua Dictionary.

Word
Sense Phrase order ID:06029 the way in which people or things are placed or arranged in relation to each other in alphabetical order in chronological order in descending/ascending order ID:06030 the state that exists when people obey laws, rules or authority keep the class in good order maintain order in the capital restore public order ID:06031 a request for food or drinks in a restaurant; the food or drinks that you ask for May I take your order? an order for steak and fries a side order To address this challenge, we propose the long-answer prompt learning method (KLAPrompt), with three different long-answer strategies, and collect a word sense prediction dataset (WSP) based on the Xinhua Dictionary to introduce fine-grained semantic knowledge. According to the different forms of answers, they can be divided into three strategies: discrete answers, continuous answers, and sentence similarity. In the discrete answer strategy, instead of considering the long answer as a whole, we split the answer space into several answer subspaces according to the token's position in the long answer. For instance, the answer subspaces of "order" in Table 1 are {"the","a"}, {"way","state","request"}, {"in","that"," f or"}, {"which","exists"," f ood"}, . . . , {" f or"}. Then, we train pre-trained language models on the WSP dataset to predict the sense, and each word of the sense will be predicted independently. In the continuous answer strategy, we use virtual answer tokens, which can be optimized through gradient descent, to replace the natural language in discrete answers. Due to the many senses in the Xinhua Dictionary, we assign several virtual tokens to each sense and optimize the token embeddings for each sense together with prompt token embeddings. In the sentence similarity strategy, we average the embeddings of the masked part in the masked language model's (MLM) output, calculate the cosine similarity between this and the sentence embedding of the original long answer, and then maximize the similarity during the training procedure.
Furthermore, we explore the effectiveness of the KLAPrompt method in the medical field. Firstly, we collect a disease and category prediction dataset (DCP) based on Medi-calKG (https://github.com/zhihao-chen/QASystemOnMedicalKG accessed on 28 April 2023), which contains specific disease knowledge such as descriptions, departments, symptoms, causes, prevention, checking items, recommended foods, and recommended drugs. Then, we apply the KLAPrompt method to introduce fine-grained disease knowledge into the pre-trained language models.
We conduct comprehensive experiments on five open-domain NLP datasets and five health-related datasets. Experimental results demonstrate that pre-trained language models achieve superior performance based on the strength of the semantic knowledge in the Xinhua Dictionary and the disease knowledge in MedicalKG. Empirical studies also verify that KLAPrompt with the discrete answer strategy is the best method to integrate structured semantic knowledge into pre-trained language models.
In a nutshell, the main contributions of our work are as follows.
(1) We introduce more abundant and fine-grained semantic knowledge from the Xinhua Dictionary and disease knowledge from MedicalKG into pre-trained language models, enhancing the models' ability to understand Chinese word semantics and medical science.
(2) We propose a novel long-answer prompt learning method (KLAPrompt), which provides a reasonable solution to two main challenges in answer engineering: (a) When there are many classes, how can we seek the proper answer space? (b) How can we decode the multi-token answers? (3) Extensive experiments on five Chinese NLP datasets and five biomedical datasets demonstrate that the proposed method significantly empowers the widely adopted pretrained language models. The empirical studies also confirm that KLAPrompt with the discrete answer strategy is the best method to integrate structured semantic knowledge. (4) We generate a word sense prediction dataset (WSP) based on the Xinhua Dictionary, which is available at https://github.com/Xie-Zuotong/WSP (accessed on 28 April 2023). We also collect a disease and category prediction dataset (DCP) based on MedicalKG, which is available at https://github.com/Xie-Zuotong/DCP (accessed on 28 April 2023).
The rest of the paper is organized as follows. Section 2 briefly reviews the existing methods of integrating semantic knowledge, previous approaches in prompt learning, and several biomedical pre-trained language models. In Section 3, we introduce the KLAPrompt approach and its application in the medical field. Section 4 shows the experimental results on five Chinese NLP datasets. Section 5 presents experimental studies on five biomedical datasets. Finally, the conclusions of this research are drawn in Section 6.

Semantic Knowledge
Semantic knowledge includes the meanings of words, phrases, and sentences, examining how meaning is encoded in a language. It has been extensively used in various natural language processing tasks [21][22][23][24]. ERNIE [25] has improved BERT's masking strategy to integrate entity information into the knowledge graph. In Chinese, an entity or phrase is composed of several Chinese words. If only a single word is masked, the model can easily predict the masked content only through the context information, without paying attention to the composition of phrases and entities, as well as the syntactic and semantic information in sentences. Therefore, ERNIE masks all tokens that compose a whole phrase or entity at the same time. However, a phrase in ERNIE usually consists of two or three tokens. When the number of consecutive tokens exceeds twenty, the model is difficult to train, and the performance will decline. KnowBERT [26] integrates WordNet [18] and a subset of Wikipedia into BERT and uses the knowledge attention and recontextualization mechanism to explicitly model entity spans in the input text. SenseBERT [27] adds a masked-word sense prediction task as an additional task to learn the "sense" knowledge in WordNet. WordNet lexicographers organize all word senses into 45 supersense categories. Hence, it predicts not only the masked words but also their supersenses during pre-training. Both KnowBERT and SenseBERT introduce WordNet into BERT, but, compared with the Xinhua Dictionary or Oxford Dictionary, the supersenses in WordNet are relatively limited, and the word meaning is coarse-grained. Furthermore, most of these methods require training from scratch, which is time-consuming and computationally expensive, making it infeasible for most users.

Prompt Learning
Prompt learning is based on the language model used to calculate the probability of text [28]. Unlike adapting pre-trained language models to downstream tasks through objective engineering, prompt learning utilizes additional textual prompts to make downstream tasks resemble those solved during the original language model training. Radford et al. [29] illustrate that language models can learn NLP tasks without direct supervision, and prompt learning has gradually become the most popular research direction in natural language processing. Prompt learning includes prompt engineering and answer engineering. For discrete prompts, Brown et al. [30] manually created prefix prompts to deal with diverse natural language processing tasks. For continuous prompts, P-tuning [14] proposes prompts learned by inserting trainable variables into the embedded input. A recent work [15] manually designed the constrained answer spaces for named entity recognition tasks. However, there are still two challenges in answer engineering: (a) When there are many classes, how can we seek the proper answer space? (b) How can we decode the multi-token answers?
BioBERT is the first model to utilize continuous pre-training on biomedical domain corpora. BlueBERT was pre-trained on PubMed abstracts and MIMIC-III clinical notes, and then evaluated on the Biomedical Language Understanding Evaluation (BLUE) benchmark. ClinicalBERT uses clinical notes, including lab values and medications, instead of plaintext data based on BERT. Moreover, PubMedBERT learns model weights from scratch via a large-scale training corpus.
While great efforts have been made to build English biomedical PLMs, there are only a few studies discussing building biomedical PLMs in Chinese, such as MC-BERT [36], MedBERT (https://github.com/trueto/medbert accessed on 15 April 2023) and SMed-BERT [37], derived from a general-domain BERT, with the latter two further developed in a knowledge-enhanced manner. MC-BERT proposes entity masking and phrase masking strategies in a coarse-grained context to learn the medical word representations from a medical corpus, while neglecting the internal relations of medical entities. The MedBERT model was pre-trained on 650 million Chinese clinical natural language texts. SMedBERT proposes the mention-neighbour hybrid attention to learn heterogeneous entity information, which infuses the semantic representations of entity types into the homogeneous neighboring entity structure.

Methodology
In this section, we introduce the KLAPrompt approach and its application in the medical field. There are two steps in our KLAPrompt method: prompt engineering and answer engineering. Thus, we elaborate on our method from these two aspects.

Prompt Engineering
Prompt engineering, also known as template engineering, aims to design a prompting function that results in the most effective performance in a downstream task. There are two main categories of prompts: discrete prompts and continuous prompts. In this section, we construct these prompts for the word sense prediction dataset (WSP) to introduce the semantic knowledge from the Xinhua Dictionary.

Discrete Prompts
Discrete prompts are composed of the alignment of words in natural language. The most common method to create discrete prompts is to manually create crafted templates to handle different tasks. A template is a textual string with two slots: an input slot [X] for input x and an answer slot [Y]. For example, in the case of sentiment analysis, where x = "I love this movie", the template may take a form such as "[X] Overall, it was a [Y] movie". Then, the discrete prompt would become "I love this movie. Overall, it was a [Y] movie".
The word sense prediction dataset (WSP) contains one word, phrase, sense, and sentence for each example. For discrete prompts, we first copy the word [C] mentioned in the sentence [X], and then add a few natural language words followed by the sense [Y] that the model will predict. The number of [X] slots and the number of [Y] slots can be flexibly changed according to the needs of the task at hand. The detailed composition of [Y] will be described in Section 3.2. The manual templates are as follows: An example of a discrete prompt is given in Figure 1, where [C] is "order", the form of [Y] is "The state that exists when people obey laws, rules or authority", and [X] is "The police try to restore public order".   Figure 1. Illustration of KLAPrompt with discrete answer strategy. There are two main categories of prompts: discrete prompts and continuous prompts. In continuous prompts, we use auxiliary virtual tokens [P 1 ], [P 2 ], . . . , [P l ] to replace natural language words. In the discrete answer strategy, we split the whole answer space into several answer subspaces according to the token's position in the long answer.

Continuous Prompts
In many cases, these template words are not necessarily composed of natural language tokens; they could be virtual words that are embedded in a continuous space later and optimized through gradient descent.
For continuous prompts in WSP, we use some auxiliary virtual tokens in the vocabulary of the pre-trained language model, and l is a predefined hyper-parameter. This method performs prompting directly in the embedding space of the model. An example of a continuous prompt is given in Figure 1, for which the complete prompt becomes where T(·) is the template for the WSP dataset, [C] is the word mentioned in the input sentence x, [P i ] is the virtual token, [X] is the input slot for sentence x, and [Y] is the answer slot for sense y. Each embedding of prompts is randomly initialized and optimized during training.

Answer Engineering
Unlike prompt engineering, which discovers suitable prompts, answer engineering tries to seek a proper answer space and a map to the original output that brings about an effectual predictive model. There are two main challenges in answer engineering: (a) when there are too many classes, the selection of an appropriate answer space becomes a difficult combinatorial optimization problem; (b) when using multi-token answers, the issue of how to best decode multiple tokens using PLMs remains unresolved [28]. In this section, we propose three independent and different long-answer prompt learning strategies for the word sense prediction dataset (WSP) to integrate the "sense" knowledge from the Xinhua Dictionary.

Discrete Answers
In prompt learning, for each class y ∈ Y, the mapping function φ(·) will map it to the answer φ(y) ∈ V, where V is the answer space. It is easy to find the appropriate answer space and the mapping function when the classes are limited, and all the answers consist of a single token. Unfortunately, there is a massive number of classes in the WSP dataset (it includes 7390 words and 16,495 senses; each word has one to thirteen senses), and the answer is quite long sometimes. Take the word "order" as an example. The template and the label word set can be formalized as follows: Here, we still take the word "order" as an example. As shown in Figure 1, the template and the label word set can be formalized as follows: In a conventional supervised learning system for natural language processing, we take an input x ∈ X and predict an output y ∈ Y based on the language model p(y|x). As the template may contain multiple [MASK] tokens, we must consider all masked positions to make predictions, i.e., where n is the number of masked positions in T(x), and φ j (y) is to map the class y to the set of label words V [MASK] j for the j-th masked position [MASK] j . Equation (3) can be used to tune PLMs and classify classes.
With the pre-trained language model predicting the masked tokens, the loss function of KLAPrompt is given by

Continuous Answers
In contrast to discrete answers, continuous answers use virtual answer tokens optimized directly in the embedding space. WARP [38] utilizes a virtual token for each class label and optimizes the token embedding for each class together with prompt token embeddings.
However, for the word sense prediction dataset (WSP), we do not have 16,495 unused virtual tokens in the vocabulary for 16,495 classes. Thus, we design the answer space according to the sense ID in the Xinhua Dictionary. The continuous answers consist of five virtual tokens, and each token belongs to the answer space unused99] in the vocabulary of the pre-trained language model, which would be embedded in a continuous space later and optimized through gradient descent. In Figure 2, the sense ID is "06030", and the true answer is " In this way, each class has a different set of virtual tokens. We can also use more virtual tokens and adopt different strategies.
In our continuous answer method, the "sense" answer is w = {w 1 , w 2 , w 3 , w 4 , w 5 }, where w j = φ j (y) ∈ Q ∈ V, and φ j (y) is to map the class y to the set of label words V [MASK] j for the j-th masked position [MASK] j . As we attempt to obtain the predictions of the masked tokens, the objective is similar to Equation (4):  In the continuous answer strategy, we use virtual answer tokens, which can be optimized through gradient descent, to replace the natural language in discrete answers.

Sentence Similarity
Different from the above two methods, we propose another approach to dealing with the problem whereby the masked language model (MLM) is unable to predict the whole long answer at once.
We average the embeddings of the masked part and the original true answer to obtain their sentence embedding. Thus, in this method, the answer space is a sentence embedding space. Then, we maximize the cosine similarity between these two sentence embeddings to make the predicted tokens as similar to the original answer tokens as possible.
When calculating the similarity, the predicted answer and the true answer share the same MLM head to ensure that the two sentence embeddings are generated by the same model. An example of this method is shown in Figure 3.  Figure 3. Illustration of KLAPrompt with sentence similarity strategy. In the sentence similarity strategy, we average the embeddings of the masked part and the original true answer to obtain their sentence embedding. Then, we maximize the cosine similarity between these two sentence embeddings.

KLAPrompt's Application in the Medical Field
We also explore the effectiveness of the KLAPrompt method in the medical field. Firstly, we collect a disease and category prediction dataset (DCP) based on MedicalKG (https://github.com/zhihao-chen/QASystemOnMedicalKG accessed on 3 April 2023), which contains specific disease knowledge such as descriptions, departments, symptoms, causes, prevention, checking items, recommended foods, and recommended drugs. Then, we design the continuous prompt for the DCP dataset. An example of a continuous prompt and disease knowledge in MedicalKG is shown in Figure 4.  The continuous prompt for the DCP dataset can be formalized as follows: Finally, we apply the answer space partitioning strategy to the masked language model (MLM) to predict the disease and category simultaneously.
In prompt learning, for each class y ∈ Y, the mapping function φ(·) will map it to the answer φ(y) ∈ V, where V is the answer space. Take the disease "Angina pectoris" and category "description" as an example. The template and the label word set can be formalized as follows: Here, we still take the disease "Angina pectoris" and category "description" as an example. As shown in Figure 5, the template and label word set can be formalized as follows:    As the template may contain multiple [MASK] tokens, we must consider all masked positions to make predictions, i.e., where m + n is the number of masked positions in T(x), and φ j (y) is to map the answer y to the set of label words V [MASK] j for the j-th masked position [MASK] j . The loss function is given by

Experiments
In this section, we present the details of implementation and conduct experiments on five open-domain NLP datasets and five health-related datasets to evaluate the efficiency and effectiveness of our approach. Book Review. The Book Review dataset [40] is collected from Douban, a Chinese online review website that provides information about books, movies, and music. It is a one-sentence text classification dataset.
XNLI. In our experiment, only the Chinese part of the Cross-Language Natural Language Inference (XNLI) [41] dataset is retained. In XNLI, the model should read the two sentences and determine whether the relationship between them is "Entailment", "Contradiction", or "Neutral".
Chnsenticorp. Chnsenticorp [40] is a sentiment analysis dataset that contains 12,000 hotel reviews. In total, 6000 reviews are positive, and the other 6000 reviews are negative.
IFLYTEK. The IFLYTEK [42] dataset has more than 17,000 long texts containing application descriptions, including various application topics related to daily life, with a total of 119 categories.
The datasets above contain 8.05 K, 40.0 K, 40.0 K, 12.0 K, and 17.3 K samples, respectively. We follow the evaluation metrics and settings used in [40,42].

Biomedical Domain
We utilize five biomedical datasets over three different tasks to evaluate our method. The CMeIE, CHIP-CDN, CHIP-CTC, KUAKE-QQR, and KUAKE-QTR datasets are proposed in [43].
CMeIE. The task of Chinese Medical Information Extraction (CMeIE) is to identify medical entities from complex medical text data and determine the relationships between such medical entities. This dataset includes 518 pediatric diseases and 109 common diseases, providing rich resources for medical-related natural language processing research. CMeIE contains nearly 75,000 triplet data, 28,000 sentences about disease descriptions, and 53 different schemas.
CHIP-CDN. The goal of the Clinical Diagnosis Normalization (CHIP-CDN) task is to find a unified and comparable standard for different terms. Based on standardized terminology, researchers can effectively conduct statistical analysis and obtain more accurate results. CHIP-CDN is a semantic matching task. It provides 2500 standardized surgical data to improve task performance.

CHIP-CTC.
Recruiting clinical trial subjects requires careful comparison and strict screening. Due to the complex and time-consuming recruitment process, many clinical trials cannot proceed as planned, and a large number of participants withdraw in the middle of the experiment, which seriously affects the effectiveness of the experiment. The Clinical Trial Criterion (CHIP-CTC) task is based on clinical trial screening standards to classify subjects in clinical trials.
KUAKE-QQR. The Query-Query Relevance (KUAKE-QQR) task mainly evaluates the degree of matching between two query topics. It aims to determine whether Query-A and Query-B have undergone translation and the degree of translation. Calculating the correlation between two query terms is an important task that can optimize the search quality of long-tail queries.
KUAKE-QTR. The Query-Title Relevance (KUAKE-QTR) task mainly evaluates the degree of matching between the query topic and the title topic, which is related to the accuracy of the search results. This task requires determining whether the Query topic and Title topic are consistent.
The datasets above contain 22.4 K, 18.2 K, 40.6 K, 18.2 K, and 32.6 K samples, respectively. We follow the evaluation metrics used in [43].

Implementation Details
KLAPrompt is based on pre-trained language models. In the open domain, we choose BERT [1], RoBERTa [3], and MacBERT [44] as our basic models. In the biomedical domain, we choose BERT [1], MacBERT [44], MC-BERT [36], MedBERT (https://github.com/trueto/ medbert accessed on 15 April 2023), and SMedBERT [37] as our baselines. For all these models, the number of layers is 12, the hidden size is 768, the number of heads is 12, and it contains 110 M parameters. These models are optimized with the Adam optimizer [45], with the initial learning rate of 1 × 10 −5 . The training batch size is 64. Each model is trained for 10 epochs and evaluated on the validation set for every epoch. All experiments are carried out using a single NVIDIA GeForce RTX 3090 24 GB card.

Results on Open Domain
The experimental results on the development set of five Chinese natural language processing datasets are presented in Table 2. We show each original model and the model trained with the KLAPrompt method (e.g., "BERT" and "BERT + KLAPrompt"). We find that all pre-trained language models trained with the KLAPrompt method have achieved significant improvements compared to the original PLMs. For the STS-B, Book Review, and XNLI datasets, "RoBERTa + KLAPrompt" increases the final results by 2.14%, 2.04%, and 2.32%. Moreover, for the IFLYTEK dataset, the method still raises the accuracy by more than 1%. This superior performance proves that infusing external semantic knowledge via the KLAPrompt approach can empower the widely adopted pre-trained language models. Table 2. Experimental results of baseline methods and our method on five datasets (Acc.%). "+ KLAPrompt" indicates that we train the PLMs with the KLAPrompt method via semantic knowledge infusion training before fine-tuning. In our proposed KLAPrompt, five components may affect the performance: the discrete prompt, continuous prompt, discrete answer, continuous answer, and sentence similarity. To explore such effects, we conduct an ablation experiment using the XNLI dataset. The experimental results are presented in Table 3.

STS-B
We first compare BERT with BERT + WSP † to showcase the advantages of external semantic knowledge in the WSP dataset. BERT + WSP † is trained on the WSP dataset with its original masked language model (MLM), and it does not use the KLAPrompt method. The experimental results demonstrate that introducing semantic information from the Xinhua Dictionary can consistently improve language modeling and downstream tasks.
Then, we compare "BERT + WSP † " with "BERT + Discrete Prompt † " and "BERT + Continuous Prompt † ". In order to control the variables for comparison, in this group of experiments, both "BERT + Discrete Prompt † " and "BERT + Continuous Prompt † " use BERT's original masked language model (MLM) without using long-answer strategies. We find that both discrete and continuous prompts can improve the performance of the model. Table 3. Ablation study on XNLI dataset (Acc.%). "+ WSP" indicates that we train BERT on the WSP dataset without the KLAPrompt approach. † indicates that we train these models on the WSP dataset before fine-tuning. We further compare "BERT + Continuous Prompt + Discrete Answer † ", "BERT + Continuous Prompt + Continuous Answer † ", and "BERT + Continuous Prompt + Sentence Similarity † ". Both the discrete answer and continuous answer use answer space partitioning strategies. Among these three long-answer strategies, the discrete answer with the answer space partitioning strategy results in the best performance.

Models
In Table 4, we offer a detailed comparison between different discrete prompts and continuous prompts. We find that continuous prompts outperform discrete prompts in every dataset. However, it is not possible to consider all possible discrete prompts because the manually crafted prompts are complicated and infinite. Thus, in most cases, we can directly utilize the continuous prompts in prompt learning.
The hyper-parameter l is the number of virtual tokens in continuous prompts. To explore its impact on the performance of our prompt learning methods, we test them with different values of hyper-parameter l = {1, 2, 3, 4, 5}. As shown in Table 4 and Figure 6, different tasks reach their best performance with different values of hyper-parameter l. Thus, the hyper-parameter l needs to be tuned according to downstream NLP task.
We also investigate the consistency of the improvements with different percentages of downstream training data. The experiment results in Figure 7 illustrate that the improvement is more obvious when the amount of data is smaller. In other words, prompt learning with semantic knowledge can benefit data-scarce downstream tasks because, when the training data are limited, the task depends on the pre-trained language model and the additional semantic knowledge.

Discussion
In view of the comparison results on five open-domain datasets, namely STS-B (https://github.com/pluto-junzeng/CNSD accessed on 15 April 2023), Book Review [40], XNLI [41], Chnsenticorp [40], and IFLYTEK [42], it is clear that our KLAPrompt method can achieve superior performance in open-domain downstream tasks based on the strength of the fine-grained semantic knowledge. In order to further investigate the effectiveness of the KLAPrompt method in the medical field, we conduct extensive experiments on five biomedical domain datasets: CMeIE, CHIP-CDN, CHIP-CTC, KUAKE-QQR, and KUAKE-QTR [43].

Results on Biomedical Domain
The experimental results on the development set of five Chinese biomedical datasets are presented in Table 5. We find that all pre-trained language models trained with the KLAPrompt method have achieved significant improvements compared to the original PLMs. For the MacBERT, MC-BERT, and SMedBERT models trained with the KLAPrompt method, they increase the final results on the CMeIE dataset by 2.75%, 2.73%, and 2.12%. Moreover, for most datasets, the method still can increase the results by more than 1%. This superior performance proves the effectiveness of the KLAPrompt method in the medical field.
In KLAPrompt, other two components may affect the performance: disease prediction and category prediction. To explore such effects, we conduct an ablation experiment using the CMeIE dataset. We first compare "SMedBERT" with "SMedBERT + DCP † " to showcase the advantages of external disease knowledge in the DCP dataset. "SMedBERT + DCP † " is trained on the DCP dataset with its original masked language model (MLM), and it does not use the KLAPrompt method. The experimental results demonstrate that introducing disease information from MedicalKG can consistently improve language modeling and downstream tasks.
Then, we explore the effects of the two components. "-Disease Prediction" indicates that the disease prediction component is removed during training. As shown in Table 6, both components can improve the performance on this dataset. In addition, we observe the worst result when we remove the disease prediction, which shows that disease prediction is more effective than category prediction.

Conclusions
In this work, we propose a long-answer prompt learning method (KLAPrompt) with three different long-answer strategies to introduce fine-grained semantic knowledge from the Xinhua Dictionary. According to the different forms of answers, they can be divided into three strategies: discrete answers, continuous answers, and sentence similarity. In the discrete answer strategy, we split the answer space into several answer subspaces according to the token's position in the long answer and predict each word of the sense independently. In the continuous answer strategy, we use virtual tokens to replace the natural language in discrete answers. These virtual tokens can be embedded in a continuous space and optimized through gradient descent. In the sentence similarity strategy, we average the embeddings of the masked part in the MLM output, calculate the cosine similarity between this and the sentence embedding of the original long answer, and then maximize the similarity during the training procedure. Furthermore, we explore the effectiveness of the KLAPrompt method in the medical field. We apply the KLAPrompt method to introduce fine-grained disease knowledge from MedicalKG into pre-trained language models. Furthermore, we collect a word sense prediction dataset (WSP) based on the Xinhua Dictionary and a disease and category prediction dataset (DCP) based on MedicakKG.
Experimental results on five open-domain datasets demonstrate that all pre-trained language models trained with the KLAPrompt method have achieved significant improvements compared to the original PLMs. The superior performance proves that the infusion of external semantic knowledge from the Xinhua Dictionary can empower the widely adopted pre-trained language models. We also find that both discrete and continuous prompts can improve the performance of the model, and continuous prompts outperform discrete prompts in all datasets. Thus, in most cases, we can directly utilize continuous prompts in prompt learning. Among the three long-answer strategies, the discrete answer strategy is the best method to integrate structured semantic knowledge. Moreover, we investigate the consistency of the improvements with different percentages of downstream training data. The experimental results illustrate that the improvement is more obvious when the amount of data is smaller. In other words, prompt learning with semantic knowledge can benefit data-scarce downstream tasks.
Additionally, we conduct comprehensive experiments on five health-related datasets to explore the effectiveness of our KLAPrompt method in the medical field. Extensive experiments verify that pre-trained language models with the KLAPrompt method can also achieve superior performance based on the strength of the disease knowledge in MedicalKG. Then, we explore the effects of two components: disease prediction and category prediction. We observe the worst result when we remove the disease prediction, which shows that disease prediction is more effective than category prediction.
In our future work, we will add some virtual tokens on the left of the predicted word [C], rather than only on one side. We will also infuse common-sense information, domainspecific information, and knowledge graphs into the pre-trained language models. In the event that incorrect semantic content is provided, the outcome of the pre-trained learning might also be misleading, and this may significantly influence the outcome of the language processing. Thus, we will delve deeper into this problem in our future work.

Conflicts of Interest:
The authors declare no conflict of interest.