Taxonomy-Aware Prototypical Network for Few-Shot Relation Extraction

: Relation extraction aims to predict the relation triple between the tail entity and head entity in a given text. A large body of works adopt meta-learning to address the few-shot issue faced by relation extraction, where each relation category only contains few labeled data for demonstration. Despite promising results achieved by existing meta-learning methods, these methods still struggle to distinguish the subtle differences between different relations with similar expressions. We argue this is largely owing to that these methods cannot capture unbiased and discriminative features in the very few-shot scenario. For alleviating the above problems, we propose a taxonomy-aware prototype network, which consists of a category-aware calibration module and a task-aware training strategy module. The former implicitly and explicitly calibrates the representation of prototype to become sufﬁciently unbiased and discriminative. The latter balances the weight between easy and hard instances, which enables our proposal to focus on data with more information during the training stage. Finally, comprehensive experiments are conducted on four typical meta tasks. Furthermore, our proposal presents superiority over the competitive baselines with an improvement of 3.30% in terms of average accuracy.


Introduction
Relation extraction (RE) is designed to extract the relation between two entities in a given text [1], and has been widely applied in downstream tasks of Nature Language Processing, e.g., knowledge base population and question answering [2]. Traditional deep neural network methods [3] for RE are typically challenged by the need to gather large amounts of high-quality annotation data, which is expensive and laborious. Therefore, few-shot relation extraction is feasible for realistic applications [4]. Furthermore, metalearning methods are proposed to address such a low-resource dilemma [5]. The core of meta-learning (ML) is to optimize methods via diverse meta-tasks, each with several labeled instances, so that the methods can rapidly learn to identify new relations with only few instances. Figure 1 illustrates an instance of two-way one-shot for few-shot RE.
These ML approaches can be broadly classified into three categories, namely model-, optimization-and metric-based ML methods [6]. As a popular solution, the metric-based ML methods focus on designing a metric function in order to identify the distance between instances in the query set and the categories (illustrated with a few instances) appearing in the support set. Prototypical network [7], a simple and effective metric-based ML method, approximately represents each category via a prototype, which is achieved through averaging the embeddings of these instances that belong to the class. A great deal of works are devoted to improving the representation of prototypes, e.g., Gao et al. [8] modifies the representation of prototypes by highlighting the crucial instances and features, and Wen et al. [5] integrates the transformer model into prototype nets for greater expressiveness. In addition, some recent works have utilized external knowledge to provide more clues to the representation of prototypes, e.g., Qu et al. [9] optimizes the posterior distribution of a prototype via a global relation graph as the initial prior of the prototype, Yang et al. [10] employs the text descriptions of relations and entities to enhance representations of a prototype, and Yang et al. [11] fuses the entity concept to constrain the representations of a prototype.  However, there are two main limitations to these methods. First, the prototype representation of the above models bears some bias and the discriminative ability is insufficient in few-shot scenarios, which restricts the performance of these ML methods. Additionally, these improved methods usually design complex structures and introduce excessive parameters, which increases the computational burden and also easily leads to overfitting in the few-shot schema. Second, current ML methods treat all training instances equally [6] or pay more attention to very hard instances [12,13], which prevents these methods from extracting useful information from the training instances. Intuitively, on the one hand, tasks that are overly simple provide no valuable information; on the other hand, even humans can only extract critical information from moderately hard instances and struggle with very hard instances, let alone neural network models.
With the aim of alleviating the above problems, we propose a taxonomy-aware prototypical network (TAPN) method, consisting of two modules: a category-aware calibration module and a task-aware training strategy module. Specifically, the category-aware calibration module leverages relation description to explicitly calibrate the prototype distribution in order to obtain unbiased representations and applies prototype-aware contrastive learning to implicitly calibrate the prototype representations to be more discriminative. The task-aware training strategy module leverages the task-aware difficulty to balance the weights of easy and hard instances, which also dynamically adapt different meta tasks.
We evaluate our proposal on four classic meta tasks, and the broad results of the experiment indicate that TAPN is markedly superior to baselines. Additionally, ablation research further validates the effectiveness of these two modules and an error analysis shows the interpretability of TAPN's good performance.
In summary, our major contributions can be summarized as follows: (1) To the best of our knowledge, we are the first to explicitly and implicitly calibrate the prototype representation simultaneously without introducing extra or even harmful parameters. (2) We design a category-aware calibration module to enable the representation of an unbiased and more discriminative prototype by relation description and prototypeaware contrastive learning, respectively. (3) We propose a task-aware training strategy module to extract beneficial knowledge by exploring hard task and sample instances. (4) The experimental findings confirm the validity of our model in terms of accuracy against the competing baselines.
The remainder of the paper is organized as follows: We review the related work for few-shot RE in Section 2, detail our approach in Section 3, design our experiments in Section 4, analyze the results of our proposal in Section 5 and conclude our work in Section 6.

Relation Extraction
RE is designed to determine the relations between entities in a given sentence. Most traditional RE models extract the relations under supervised settings [14], which can be classified into three categories: neural-, kernel-and feature-based methods [1].
Typically, the aforementioned approaches work well based on numerous labeled data. However, it is time-consuming [9] and impractical to collect such massive annotated data in some professional domains. We focus on extracting relation triple in the few-shot scenario.

Few-Shot Relation Extraction
Meta-learning methods [28] have been extensively applied to the few-shot RE. The ML models are trained in various meta-tasks with few instances as demonstrations, then can be generalized to new meta tasks. In general, these ML methods are divided into three category [29]: metric-, optimization-and model-based methods [17].
Model-based methods [30] emphasize on designing the architecture of the model to address the few-shot task. To be specific, MANN [31] designs a memory-enhanced neural network to quickly absorb new data and proposes an effective strategy for accessing the external memory, which provides the ability to quickly predict new relations. Optimizationbased methods [32,33] try to initialize the parameters well. For instance, Finn et al. [32] optimize parameters with few training data so that they can be adapted to novel tasks with a limited number of gradient descent steps. The metric-based approaches focus on learning a metric function to determine the similarity between support sentences and query sentences. For instance, relation networks [34] learn a deep distance metric on the basis of the neural network instead of the fixed Euclidean distance or dot product. The prototypical network [7] predicts relation labels through computing the similarity between the prototype of each class and query sentences, which is derived from averaging the representations of all the examples belonging to a particular class. In addition, a great deal of works are designed to improve the prototypical network: Gao et al. [8] present hybrid attention-based prototypical networks to deal with the diversity and noise of text, Han et al. [12] introduce external relation description and combine global and local features as hybrid prototypes, that learns better characterization through utilizing relational label information.
However, these improved prototypical networks almost introduce extra parameters, e.g., parameters of the attention mechanism, which require sufficient data for optimization and is not realistic in the few-shot scenario. In addition, the prototype representations are usually biased and insufficiently discriminative. In this paper, compared to vanilla prototypical networks, we calibrate the prototype representation without introducing additional parameters.

Contrastive Learning
Contrastive Learning achieves success in computer vision (CV) [35] through pulling together positive instances and pushing negative instances away simultaneously. Different from positives produced by cropping, flipping, distortion and rotation in CV, methods to construct positives for discrete text sequences present a critical problem. Moreover, there are quantities of works dedicated to solving the above problem. For example, to design proper positives, Wu et al. [36] and Meng et al. [37] design word deletion, reordering and substitution techniques, Yan et al. [38] propose four new data-augmentation techniques (adversarial attack, token shuffling, cutoff and dropout), Gao et al. [39] apply random dropout as noise for sentence text and Jiang et al. [40] introduce different templates to express the same sentence text.
However, the aforementioned works usually construct positives and negatives at the instance-level and ignore the connection between instances and categories. Inspired by this, we design a prototype-aware contrastive learning prototype at the category-level, which drives the representations of categories to become more discriminating.

Approaches
In this section, we will display the details of our proposed technique. As exhibited in Figure 2, the structure of our proposal includes two modules: the category-aware calibration module and the task-aware training strategy module. In detail, the featureencoder first transforms the input sentences and relation descriptions into corresponding embedding. Next, the category-aware calibration module can obtain the unbiased and discriminative prototype representations according to the embeddings of sentences and relation description. We can predict the label of each query sentence based on the similarity between all category prototypes and this query sentence. Finally, the task-aware training strategy module can balance the weight between simple and hard data and then ensure propagation of the correct information.  In the next section, we first describe the formulation of the problem in Section 3.1. We then detail the category-aware calibration module in Section 3.2 and the task-aware training strategy module in Section 3.3.

Task Definition
Relation Extraction. Given an L x -word text x with a head entity e h and a tail entity e t , i.e., x = {w 1 , · · · , e h , · · · , e t , · · · , w L x }, the RE task can be formulated as training a model to predict the relation label r between e h and e t , where r belongs to a pre-defined relation label set R. It is worth noting that the entity span may consist of multiple words. Few-shot Relation Extraction. Few-shot relation extraction aims to identify the emerging novel relation labels without sufficient labeled data. Therefore, the predefined relation label set R is divided into the base categories R b and novel categories R n for the training and test stage, respectively. This setting simulates the test environment, where R b ∪ R n = R and R b ∩ R n = . Next, quantities of meta tasks are constructed for few-shot RE. Specifically, a meta task T consists of a support set S and a query set Q: T = (S, Q). Following the typical N-way K-shot setting of ML learning, the support set S = x j r ; r = 1, · · · , N, j = 1, · · · , K contains N categories, each category with K labeled instances. The query set Q includes the same N relation categories as S. The few-shot RE methods are trained on meta tasks sampled from the base categories R b , learn general knowledge, and are tested on other meta tasks sampled from the novel categories R n . External Knowledge. The relation description d r = {w 1 , · · · , w L d } for each relation r is also given, where L d denotes the word length of d r .

Category-Aware Calibration Module
In this section, we first calibrate the representation of each category, and then predict the label of the query sentence by a metric function, which calculates the distance between the query sentence and these categories.

Feature Encoder
We employ E to denote the text feature encoder. We use BERT base as E, as shown in Figure 3. We can then obtain the contextual semantic representation h x of an instance x: where  Additionally, we can gain the relation description representation for each relation r: where h d r ∈ R 2d , E(d r )[CLS] ∈ R d demonstrates embedding of the start token of the relation description text, and E(d r )[w i ] ∈ R d illustrates the embedding of word w i in the relation description text.

Category Distribution Calibration
Following vanilla prototypical network [7], we average all the instance embeddings in the support set for each relation as vanilla prototype: where we treat prototype c x r as the representation of category r. However, c x r is vulnerable to outliers in the few-shot scenario where there are only very few instances for demonstration, leading to semantic distribution and discrimination bias. Fortunately, the relation description summarizes the semantic characteristics, which elaborates the real meaning. Therefore, we leverage the relation description to calibrate the distribution of the corresponding category representation: compared to c x r , c r is an unbiased prototype and much closer to the real distribution of relation r. In other words, we calibrate the category distribution without introducing supernumerary parameters. We can then predict the category of a query instance by the following metric function: where p(y = r|q) means the probability of query q belonging to relation r. h q , yielded from Equation (1), is a representation of the query sentence. d(·, ·) is the distance function based on dot production. Subsequently, we apply cross-entropy loss to optimize prototype representation: where I r is an indication function, I r = 1 when query q belongs to relation r, otherwise I r = 0.

Category Discrimination Calibration
As for semantic discrimination bias, we apply contrastive learning to discriminate the representations of instances for each relation. In detail, instances should be close to the prototype belonging to the same category and far away from other prototypes as follows: where τ is a temperature hyper-parameter, h x j r denotes the representation of j-th instance in relation r. Thus, here we can obtain discriminative prototypes based on Equation (7).

Task-Aware Training Strategy Module
For a sentence with true relation label r, we predict that it belongs to r with a confidence of p(y = r|q) by Equation (5). We define the easily classified sentence as a very simple instance when p(y = r|q) → 1. Conversely, the extremely difficult classified sentence is the very hard instance with very low confidence. The task-aware training strategy module is designed to optimize our proposal with moderately hard data.

Hard Instances
Intuitively, the models will benefit if they focus more on hard instances instead of treating all instances equally [13]. Therefore, we apply the focal loss function [13] on hard instance to modify the cross entropy loss L f ocal : where ϕ > 0 is a hyper-parameter [13] and reduces the relative loss contributed by very simple instances. Furthermore, Equation (8) is cross-entropy loss when ϕ = 0.

Hard Meta Tasks
However, the harder the sentence, the higher the L f ocal weight assigned to this sentence, which may lead to TAPN failing to learn knowledge since L f ocal focuses excessively on very hard sentences. Therefore, we design an inverse focal loss function at the meta-task level, which pays less attention to the very hard task consisting of very hard classified sentences. We can observe that the greater the inter-class similarity in a meta task, the harder this meta task becomes. We then use the inter-class similarity matrix M ∈ R N×N to measure the difficulty (hardness) of meta task: where · represents the Euclidean norm. Next, we use a scalar m to determine the hard magnitude of a specific meta task in the current mini-batch as follows: where B is the batch size in training stage, m b is the difficulty of b-th meta task in current batch, and · F is the Frobenius norm. The task-aware loss is then defined as follows: L u f f pays less attention to a hard meta task but focuses on hard instances in the meta task. Namely, L u f f balances the weight between easy and hard data; therefore, it can learn useful knowledge from moderately hard data. Lastly, the final objective loss is designed as follows:

Experiments
In this section, we first discuss several research questions in Section 4.1. We then introduce the dataset and baselines to compare with those in Sections 4.2 and 4.3, respectively. Finally, we provide some implementation details in Section 4.4.

Research Questions
We design the following research questions to guide our experiments and examine the effectiveness of our proposal.

Datasets
We conduct our experiment on FewRel [4]. There are 64, 20 and 16 relations for training, validation and testing, respectively. Since the 20 test relations are not reported, we re-split the original published relations into 50, 14 and 16 for training, validation and testing, respectively, according to existing methods [10,11]. In addition, the statistics of FewRel are listed in Table 1. Moreover, the test relation descriptions are listed in Table 2. Table 1. Statistics of FewRel. "#Rel" and "#Instance" denote the number of relations and instances, respectively. "Length" means the average token length of instances.

Id Relation Name Relation Description
"follows" follows immediately prior item in a series of which the subject is a part. P177 crosses obstacle (body of water, road, . . .) which this bridge crosses over or this tunnel goes under. P206 located in or next to body of water located in or next to "body of water", "sea, lake or river". constellation the area of the celestial sphere of which the subject is a part (from a scientific standpoint, not an astrological one). P641 sport sport in which the subject participates or belongs to. P921 main subject primary topic of a work.

Model Summary
We introduce two group competitive baselines for the few-shot RE task to be compared with. We first illustrate the basic ML methods: -Snail [41] applies the temporal convolutions to aggregate information from past experience and designs a soft attention mechanism to pinpoint specific pieces of information.
-GNN [42] defines a graph neural network architecture to propagate label information from labeled data to unlabeled data. -Siamese [43] uses two twin networks with shared weights to calculate the similarity of two inputs and then determines whether they belong to the same category. -Proto [7] predicts the relation labels by calculating the similarity between the query sentences and the prototype of each category, which is obtained by averaging the representations of all instances belonging to a specific category. -BERT-PAIR [44] concatenates the query instance with all supporting instances with a particular label as a series of sequences, and then calculates the similarity of two pairs of instances for predicting the relation of query instance.
We then list some improved prototypical networks by introducing external knowledge through carefully designed complex modules: -KEFDA [45] designs a knowledge-enhanced prototypical network to conduct instance matching and a relation-meta learning network for implicit relation matching. -ConceptFERE (We use the ConceptFERE(simple) version here to allow for computation overheads. Ref. [11] develops a self-attention-based fusion module to incorporate sentence embedding and entity concept embedding, which is valuable for the relation classifier. -HCRP [12] introduces external relation description and combines global and local features as hybrid prototypes, which better learn representations by exploiting relation label information.
Finally, we present the model proposed in this paper: -TAPN leverages the relation description to calibrate the prototype representation without introducing extra parameters and designs an effective training strategy to optimize the model.

Implementation Details
The model configurations are kept the same across all models discussed, including our proposal and the selected baselines. In detail, following [4,46], we assess the performance of DRK on four classic meta-tasks: 5-way 1-shot, 5-way 5-shot, 10-way 1-shot and 10way 5-shot. We apply BERT base as the feature encoder and use ADAM to optimize all the models. In addition, we follow the parameter setting of FewRel [4] and tune other hyperparameters through performing a grid search on a validation set. Furthermore, we present the parameter settings in Table 3. It is worth noting that we set τ to 0.4 on the 10-way 1-shot meta task through conducting a grid search on a validation set.

Overall Evaluation
For answering RQ1, we assess the RE performance of TAPN along with eight competing baselines on four meta-tasks. The overall results in terms of accuracy are listed in Table 4.
Generally, for meta tasks with the same shot number, the performance of all models deteriorates as the number of relational categories (ways) increases. In addition, for the same way-number meta tasks, all the models achieve better performance as the shotnumber increases. The above phenomenon indicates that the difficulty of the relationextraction task increases as the number of shots reduces and the number of ways increases. This can contribute to under-fitting for test tasks, which suffers from a lack of data. Table 4. Overall performance of our proposal and baselines in terms of average accuracy(%) on four typical meta-tasks. The results of the best baseline and the best performer in each column are underlined and boldfaced, respectively. Statistical significance of pairwise differences of the best baseline against our proposed TAPN is determined via a t-test ( for p < 0.05). † marks the results quoted from the original published papers.

Model
Avg 5-Way 1-Shot 5-Way 5-Shot 10-Way 1-Shot 10-Way 5-Shot Subsequently, we focus on the baseline. For the first group methods, BERT-PAIR achieves the best results due to the carefully designed model structure. For the second group baselines with external knowledge, most models achieve better performance than first group models. In addition, HCRP is the best baseline on four meta tasks. This demonstrates that external knowledge provides rich information to alleviate the few-shot dilemma.
Next, we focus on the performance of our proposal on four meta tasks. Generally, our suggested TAPN is superior to all baselines on all meta-tasks and gains a 3.30% improvement in average accuracy, which confirms the validity of the TAPN. In detail, TAPN exhibits 1.38%, 4.16%, 2.98% and 4.68% improvements in the accuracy of HCRP on 5-way 5-shot, 5-way 1-shot, 10-way 5-shot and 10-way 1-shot meta-tasks, respectively, and the performance growth of our proposal increases as the way-number grows and the shotnumber reduces. This demonstrates that TAPN can capture unbiased and discriminative features in the harsh few-shot scenario. In addition, we also evaluate the performance precision of our proposed method and the state-of-the-art baseline HCRP in Table 5. We can observe that our proposed method still outperforms HCRP by 2% improvement in terms of average precision.

Sentence Length
As for RQ2, we study the influence of sentence length on the behavior of all models, in accordance with the sentence length L s . In detail, considering the distribution of testing data, we group the sentences into four groups, i.e., L s ∈ (0, 15), [15,30), [30,45), [45, +∞). The results are plotted in Figure 4.
Generally, almost all the performances of the models drop with an increase in sentence length, which can be clearly observed in Figure 4c. This phenomenon may be caused by the model failing to capture key information as the sentence length increases. In addition, long sentences are more likely to introduce noise.
Next, we compare the results of our proposed method against the baselines. Furthermore, we take the worst-performing 10-way 1-shot meta task for instance to analyze the results. We find that our proposal obtains the best results at every sentence length on all four meta tasks. Furthermore, our proposal is less sensitive to the length of the input sentence than other baselines. For instance, compared to the best baseline HCRP degrades by 24.98% from 91.97% at L s ∈ (0, 15) to 66.99% at L s ∈ [45, +∞), our proposed TAPN only decreases by 15.93% from 97.36% at L s ∈ (0, 15) to 82.01% at L s ∈ [45, +∞). In addition, the improvement magnitude of TAPN consistently increases along with an increasing length of the input sentence, e.g., TAPN outperforms the best baseline HCRP by an improvement of 5.96%, 13.25%, 14.64% and 15.02% at L s ∈ (0, 15), [15,30), [30,45), [45, +∞), respectively. This demonstrates that our proposal can capture discriminative features to alleviate the noise caused by long and tedious sentences. Similar results can be observed for the 10-way 5-shot, 5-way 5-shot, and 5-way 1-shot meta tasks.  . Effect on the performance of our proposed method and baselines affected by sentence length on four typical meta tasks: 5-way 1-shot, 5-way 5-shot, 10-way 1-shot and 10-way 5-shot.

Ablation Study
For RQ3, we perform an ablation study to understand the contribution of the various components of our proposal. In the ablation study, we replace or remove some specific components to measure their influence on TAPN, which is marked with the notation"wo". Specifically, "wo/rel" and "wo/cons" denote removal of the category distribution calibration in Section 3.2.2 and category dicriminative calibration in Section 3.2.3, respectively. The "wo/task" and "wo/instance" refer to removal of the hard meta-task finding component in Section 3.3.2 and hard instance-finding component in Section 3.3.1, respectively. It is worth noting that we only conduct an ablation study on 5-way 1-shot and 5-way 5-shot meta tasks given the high computation cost on 10-way 1-shot and 10-way 5-shot meta tasks. Furthermore, the results are presented in Table 6.
As displayed in Table 6, the removal of components leads to model degeneration, proving the efficacy of each component. Additionally, "wo/rel" leads to the biggest drop among the four components as marked in Table 6. The "wo/rel" plays the most important role, which verifies that the previous prototype representation is vulnerable in the fewshot scenario, which affects subsequent classification accuracy. Furthermore, the category distribution calibration module calibrates prototype representation to be unbiased and discriminative without introducing extra parameters.

Error Analysis
To answer RQ4, we first analyze the accuracy of each test relation, and then determine the error sources via error analysis.
First, we present the accuracy of the best baseline HCRP and our proposal on each test relation in Figure 5. Specifically, following Brody et al. [47], we use the parameters of 10-way 5-shot to evaluate the performance on test data by relation. Specifically, for each test relation, we randomly select 5 examples (that is, K = 5) and 50 examples of that relation and place them into the support set and query set, respectively. As displayed in Figure 5, we can observe that the performance of our model is more stable than HCRP. The good performance of HCRP is contributed to by some easily distinguished relations but fails on some difficulty relations, e.g., the accuracy of HCRP is under 40% on relation P206, P26 and P641. Fortunately, our proposed method performs well across all relation categories, and the accuracy on every relation is over 40%.
Next, we conduct an error analysis on the relation "follows" to determine the error sources and the findings are summarized in Table 7. Generally, TAPN outperforms HCRP by 4% improvement in accuracy on the relation "follows". On the one hand, TAPN reduces the error source, e.g., relation "constellation". This may contribute to the calibration based on the relation description shifting the prototype away from an irrelevant relation category. On the other hand, TAPN decreases the error probability on the relation "part of", meaning that TAPN can capture the discriminative features of each relation.

Conclusions and Future Work
In this paper, we propose a taxonomy-aware prototypical network to solve the few-shot relation extraction. Specifically, we design a category-aware calibration module that utilizes the relation description and contrastive learning to calibrate prototype representation to become sufficiently unbiased and discriminative. Furthermore, we develop a task-aware training strategy module, which dynamically balances the weight of easy and hard tasks. In addition, we conduct extensive experimentation on FewRel for four typical meta tasks.
The results demonstrate that our proposal exceeds the state-of-the-art baseline in average accuracy.
However, our proposal may be limited to addressing the cross-domain relation extraction task, where the testing and training data originate from various domains. Therefore, regarding the feature work, on the one hand, we plan to examine the generalization of TAPN in the cross-domain few-shot scenario [48]. On the other hand, we would like to introduce prompt learning for the true few-shot [49] scenario, where both training and validation data are scarce. For example, we can design a template to close the gap between relation extraction and the pre-trained language model, which can exploit common knowledge learned from pre-trained language models.

Data Availability Statement:
The data presented in this study are available on request from the first author.

Conflicts of Interest:
The authors declare no conflict of interest.