Enhancing Targeted Minority Class Prediction in Sentence-Level Relation Extraction

Sentence-level relation extraction (RE) has a highly imbalanced data distribution that about 80% of data are labeled as negative, i.e., no relation; and there exist minority classes (MC) among positive labels; furthermore, some of MC instances have an incorrect label. Due to those challenges, i.e., label noise and low source availability, most of the models fail to learn MC and get zero or very low F1 scores on MCs. Previous studies, however, have rather focused on micro F1 scores and MCs have not been addressed adequately. To tackle high mis-classification errors for MCs, we introduce (1) a minority class attention module (MCAM), and (2) effective augmentation methods specialized in RE. MCAM calculates the confidence scores on MC instances to select reliable ones for augmentation, and aggregates MCs information in the process of training a model. Our experiments show that our methods achieve a state-of-the-art F1 scores on TACRED as well as enhancing minority class F1 score dramatically.


Introduction
Relation extraction (RE) is the task of identifying the semantic relation between two or more entities. For example, given the sentence "Sam[Entity1] was born in 1596[Entity2]", the target relation-type (class) between the entities would be person:date of birth.
In TACRED [1] that is a widely used supervised RE dataset, we found that some classes suffer from (1) label noise that refers to the errors in labels [2] and (2) low source availability as shown in Table 1, and let denote those classes as minority classes, MCs. Due to those problems, several neural network models failed to learn MCs and got zero or very low F1 scores on MCs. For example, our experimental results showed that the average F1 test scores on MCs of C-GCN [3], KnowBERT [4], and LUKE [5] were 0%, 0%, 14.3%, respectively; the experimental results of [6] also confirmed the poor performance of 52 neural network models on MCs (details are provided in Appendix E).
Although there have been many studies that dealt with label noise or low source availability, few studies have been done to directly address MCs in RE.
As for label noise, first, manually annotated RE datasets, such as Semeval-2010-Task-8 [7], ACE 2005 (https://catalog.ldc.upenn.edu/LDC2006T06 (accessed on 25 June 2022)), and the FewRel Dataset [8], have been regarded as relatively clean, and the studies on these datasets have rarely considered the noise problem in their approach. However, a few researchers recently referred to the label noise problem in TACRED. Table 2 shows the samples of training dataset under label noise. Alt et al. [6] confirmed that the TACRED dev and test datasets were also corrupted; hence, they corrected the noisy instances and analyzed the error cases. Moreover, Stoica et al. [9] re-categorized relations in TACRED and re-annotated labels. Although those studies highlighted out the label noise problem, they focused on the dataset itself and did not deal with the learning with the noise label. Table 1. Top seven classes in TACRED training dataset ordered by the level of label noise in descending order (a) and those ordered by the number of correct instances in ascending order (b). per and org are the abbreviation of person and organization, Noise denotes the the level of label noise for each class which is calculated by # wrong label # instances , and Correct denotes the number of correct labels for each class. Noisy labels, i.e., wrong labels are determined by the refined annotation [9]. Four classes marked in bold font suffer both of noise label and low source availability regime, i.e., MC. MC instances are totally 227 out of 68,124 training instances (0.33%) and the positive class which has most instances, 2443, is person:title (3.6%).

(a)
Class In contrast, distant supervision for RE (DS-RE) inherently has suffered from the label noise problem and numerous studies have been conducted to solve it. Most of the existing studies mainly adopted multi-instance learning and focused on alleviating bag-level noise using sentence-level attention [10][11][12][13] or used extra information for entities [14,15]. However, no unified validation dataset for DS-RE has been proposed. Most researchers have used held-out evaluation and depended on human evaluation, which involves manually checking the subset of test instances. To tackle this problem, Gao et al. [16] published the manually annotated test set for NYT10 [17] and Wiki20 by using Wiki80 [8] that is a widely used DS-RE dataset. The study confirmed that previous models on NYT10 failed in MC prediction.
Next, as for low source availability, the imbalanced distribution is a widely acknowledged problem in RE task [18][19][20]. Negative instances, i.e., no relation, far exceed other instances. Moreover, even among the positive instances, the amount of clean MC instances is minimal and not sufficient for training a model. For example, the class with the most instances, i.e., person:title in TACRED accounts for only 3.6% of the entire training dataset and MC is much smaller, as shown in Table 1. Some studies have tackled the label sparsity in RE by adopting data augmentation [21][22][23]. However, Xu et al. [21] simply reversed the dependency path of the head and tail entities to prevent overfitting. Eyal et al. [23] validated the efficacy of their approaches on a subset of the dataset under certain scenarios. Papanikolaou et al. [22] focused on the data generation itself and required exhaustively finetuning separate models on each class. As for data augmentation, several studies have proposed masked language modeling (MLM) based data generation [24,25] for text classification. However, they do not apply to RE because they cannot guarantee the class-invariant between entities, and most labels of RE are corrupted.
In this paper, we tackle the MC problem in RE and introduced (1) a minority class attention module (MCAM) with the class-specific reference sentence (Ref), and (2) the augmentation methods particularized to RE. We applied our methods to TACRED.
The Ref is a description that narrates the definition of the keywords in the MC relationtype. Take, relation type organization and dissolved, for example, the Ref of it is constructed by using the definition of origanization and dissolve. We adopted only one Ref for the targeted MC, which differs from previous studies that unselectively used external knowledge for entire classes. The vector of Ref can be seen as an MC label representation. For MCAM, it is used for identifying clean instances of corresponding MC and to construct the vector that represents MCs information. In detail, MCAM calculates the reliability score by comparing the input sentence of an MC instance and its corresponding Ref, where Refs are considered as criteria for distinguishing clean instances of each MC. Based on this score, reliable samples are selected for augmentation, and additionally, the vector of MC information is constructed. Our experiments show that the proposed methods achieved a state-of-the-art (SOTA) F1 score on TACRED, as well as dramatically enhanced MC F1 scores.
In brief, the main contributions of this study are as follows: • We propose MCAM that identifies noisy instances and improves MC prediction by constructing the vectors that represent the MCs information. • We propose simple yet effective data generation methods particularized to RE that coordinate with MCAM and minimize the risk of relation-type change. • Experimental results demonstrate the efficacy of the proposed approaches that enhance the overall model performance and MC prediction and is robust to spurious association.

Related Work
Distant Supervision (DS [26]) inherently has a label noise problem, and numerous approaches have been proposed to tackle it. DS involves automatic data labeling based on the assumption that if two entities in the knowledge bases (KBs) are related, the relation may hold in all sentences where these entities are found. Although DS is an effective method for generating abundant training instances by using openly available KBs (e.g., Yago, Freebase, DBpedia, Wikidata), the training instances inevitably contain significant label noise. To alleviate the label noise problem, Riedel et al. [17] and Hoffmann et al. [27] relaxed the assumption and used the multi-instance learning (MIL) [28] framework which was originally proposed to solve the task with ambiguous samples. For example, Riedel et al. [17] used the expressed-at-least-once assumption; it assume that at least one sentence exists where the predefined relation between the entities holds among the sentences mentioning the same entity pair. Moreover, under MIL, sentences mentioning the same entities were merged into a bag for each triple (relation, entity1, entity2).
Based on MIL, several researchers for DS-RE have focused on reducing the bag-level noise mainly by using an attention mechanism [10][11][12][13]. For example, Lin et al. [10] used sentence-level attention and assigned a different weight for each sentence in the same bag, and aggregated the informative representation of the sentences for the bag representation. Yuan et al. [12] used the sentence-level attention, captured the correlation among the relations, and integrated the relevant sentence bags into a super-bag to minimize bag-level noise. In addition to the attention mechanism, some studies used extra knowledge from KBs to enrich the entity and label representation to clarify the relation between entities [14,15]. For example, Ji et al. [14] used entity descriptions for the entity embedding, and Hu et al. [15] used entity descriptions for label embedding and a bag representation robust to noisy instances. However, in real-world settings, entities are infinite and the descriptions in KBs are limited; hence, they are rarely applicable. Moreover, a model depending on the entity information is prone to use the so-called shallow heuristic methods (i.e., leveraging spurious association); consequently, it is likely to fail generalization on challenging samples [29,30]. In contrast, our approaches use Refs as criteria for determining clean MC instances, which are separate from noisy instances; and adopt only one Ref for each MC relation-type that is independent of the potentially infinite entity. Moreover, this study differs from previous studies in that we selectively used external knowledge for the targeted classes only.
Regarding alleviating imbalance distribution and solving low source availability, very few studies have applied data augmentation to RE. The reason is probably the difficulty of relation-type invariance. Papanikolaou et al. [22] fine-tuned GPT-2 on each relationtype and generated augmentation dataset, which is not applicable to the RE task with many relation-types. Xu et al. [21] augmented the dataset by changing the order of the dependency path of the head and tail entities. However, the study mainly focused on preventing overfitting and not on handling imbalanced distribution. As for generating synthetic data, several studies proposed MLM based approaches [24,25]. Nevertheless, they did not consider the label noise and not guarantee the relation-type invariant. Unlike previous studies, we introduce a method for generating synthetic data particularized to RE tasks that are not exhaustive and independent of label corruption by considering the bi-directional transformer-based architecture with the target entities unchanged, i.e., preserving a relation-type.

Task Formulation
Given a sentence S i = {t 1 , t 2 , . . . , t j } where t j is the j-th token in the sentence S i , the goal of RE is to predict the relation-type in a predefined label set Y between [Entity1] (e 1 ) and [Entity2] (e 2 ); our goal is to improve MC recognition. Let M = {c i } n i=1 denotes MC set where c i ∈ Y is one of the MCs.

Input Sentence Representation
As for S i , special tokens (<s>, </s>) were added at the beginning and end of the sentence; two selected tokens (@, #) were used as entity indicators and added at the beginning and end of the entities [31,32]. Encoder of the pretrained model is used to get contextualized representation vectors as follows: where H S i t j ∈ R d is the representation vector of token t j in the sentence S i and d is the embedding dimension of Encoder. The representation vector of sentence S i for the task is obtained by aggregating the representation vectors of the first token of each entity indicator: where V S i main denotes the representation vector of S i , [;] indicates concatenation and W q ∈ R d×2d . We utilize attention mechanism [33]; V S i main is used as a query vector for calculating the reliability score as shown in Equations (4) and (8).

Reference Sentence Representation
We used relation-type descriptions as Refs D = {D c 1 , . . . , D c n | c i ∈ M} for each MC relation-type c i to set the criteria for determining clean MC instances. c i can have only one Ref D c i that is composed of relation-type c i 's keywords and their definitions. The word definitions were obtained from Wiktionary (https://www.wiktionary.org (accessed on 25 June 2022)) and Wordnet (https://wordnet.princeton.edu (accessed on 25 June 2022)), which are both open-source and publicly available.
We selected the best matching definition; however, in case a definition was too short or inadequately described the relation-type, we concatenated more than one definition with a comma (,). The entire Refs we used are provided in Appendix D.
The representation vector of D c i is the contextualized embedding vector of special token (<s>) in D c i : where t 1 = <s> and, accordingly, H D c i <s> is the representation vector of D c i , i.e., label representation of c i .

Methods
In this section, we describe the proposed approach in detail. Figure 1 shows the overall architecture of the model. Our approaches involve three steps: (1) training the model with MCAM and attention guidance (Section 4.1), (2) filtering noisy labels and selecting the reliable instances of MC for augmentation according to the reliability score (Section 4.

MCAM and Classification
As shown in Figure 1, MCAM refers to operating a series of processes related to MC mainly by using the attention mechanism: (1) calculating the attention score over Refs, and (2) constructing a vector of MCs information. Here we describe how MCAM works.

Attention Mechanism
We adopted an attention mechanism to identify noisy data and, moreover, provide a model with the vector of MCs information utilizing the concept of query, keys, and values: Query (q) corresponds to the representation vector of sentence S i ; and keys (K) and values (V) correspond to projections of the representation vector of Refs D. They can be expressed as follows: where W k ∈ R d×d , W v ∈ R d×d , and K c i and V c i is a key and value vector of D c i respectively.
The representation vector of aggregated MCs information, V MC , can be seen as the vector of MCs information, which is formulated as where α c i is the attention score of the input sentence over D c i : As for α c i , Softmax is not applied because it reduces the attention weights into probabilities and limits the expressibility of the vectors to which the attention weights are applied [34]. Since α c i is obtained by comparing the representation vector of an input sentence and a reference sentence, i.e., label representation, we used |α c i | as a reliability score on instances of c i to determine the noisy data in the process of selective augmentation (Section 4.2).

Classification
The model output vector O is obtained by adding MCs information to query q as follows: where g ∈ (−1, 1) denotes gate unit that regulates the flow of MC information: where W g ∈ R 1×d . Given S i and D, to compute the probability on each relation-type, the projection of the output vector is fed into a softmax layer as shown below: where P(r| · ; θ) is the prediction probability on relation-type r ∈ Y of a model which is parameterized by θ, W o ∈ R L×d and L is the total number of relation-types. Accordingly, given N samples, cross entropy loss function L cl f can be formulated as: where y i is an annotated label on S i .

Attention Guidance
Attention guidance is to make a model that connects the Ref and its corresponding MC. Without explicit guidance, it is hard for a model to match the plain text, Ref, to the corresponding MC. To solve this problem, we trained the classifier to predict each MC using the corresponding Ref alone (i.e., without input sentence) through the following loss function L re f , which enables us to directly incorporate MC c i label information into V c i as follows: As shown in Equation (14), it differs from Equation (11) in that Equation (14) does not use S i and the entire Refs D, but instead uses only one Ref, D c i . An illustrative example is provided in Appendix C.

Self Attention Guidance
In addition to attention guidance, we utilized self attention guidance to obtain more accurate attention scores which are used to determine the noisy data.
It is inspired by the study of [35] that uses this method to minimize the prediction score of the ground truth class after a pixel-level segmentation mask is applied to the specific area that obtains a higher attention score than a predefined threshold. This approach encourages the model to learn that the masked area is important for predicting the corresponding class and extracting more complete attention maps. We modified this method and adapted it to our model when the instance belongs to M.
The processes are as follows: (1) given y = k (k ∈ M), flipping the sign of attention weight on V in Equation (6) and calculating the output vector: and (2) minimize the corresponding prediction score which is denoted as L f lip as given below: Therefore, our objective function is L = L cl f + L re f + L f lip .

Selective Data Augmentation
As illustrated in Figure 2, we selected the reliable instances of MCs according to the following procedure: (1) arranging the MC instances in descending order according to the reliability score on the corresponding Ref, (2) selecting the higher m% instances, i.e., reliable instances, (3) generating synthetic data and re-calculating reliability scores on them, and (4) taking a subset of the synthetic data into a training dataset based on those scores. In step (4), the size of the augmentation is a hyper-parameter and illustrative experiments are provided in Section 6.2. In step (2), regarding m% we determined it by estimating the level of valid annotation on relation-type c k . Let denote it as ρ c k and, then, 1 − ρ c k represents the level of label noise. ρ c k is derived by calculating the number of instances aligning with the corresponding Ref D c k : where N(c k ) is the index set of c k instances, |α c j (S i )| is the absolute value of attention score of sentence S i over D c j , and 1[·] is the indicator function that is equal to 1 when given y i = c k the value inside the function is c k or 0 otherwise. We averaged ρ c k of each MC (i.e., 1 |M| ∑ c k ∈M ρ c k ) to determine the size of reliable instance per MC.

Generating Synthetic Data
Regarding the step (3) in Section 4.2, we designed a method for generating synthetic data particularized to RE that preserves the relation-type between entities, i.e., labelinvariant augmentation. We utilized MLM and conducted following the steps: (1) finetuning pretrained model on a training dataset with MLM task, (2) after completing finetuning, incrementally masking a token with the special token, [MASK], from the beginning to the end of the target sentence except for entity tokens, (3) inferencing the masked token with the finetuned model, (4) replacing it by using top-k random sampling strategy [36], and (5) repeatedly implementing step (2) to (4) and generating K synthetic data per reliable instance (we set K as 300).
This approach can introduce data diversity, minimize the risk of relation-type change and is independent of label noise, because the model learns the token distribution around the target entities in the process of finetuning that is irrelevant to relation-type and bidirectional-attention models, such as BERT, can exploit preserved target entities to predict the masked token. The pseudo-code for generating synthetic data is provided in Algorithm 1.

Algorithm 1 Pseudo Code for Generating Augmentation Candidates
Data: The dataset T clean consisting of selected and reliable MC instances Parameter: Learned masked language model parametersθ Initialize: An augmentation set T aug ← {} Continue end if end for T aug ← T aug ∪ S i count ← count + 1 end while end for

Additional Training with MC Augmentation
To improve the model performance on predicting MCs, we trained the model with more epochs with the augmented dataset and adapted two additional training strategies [37,38]: (1) freezing the backbone model parameters to preserve the information learned from the main training process, and (2) selectively training the instances on which the model's prediction probability is lower than the predefined threshold to prevent overfitting (details are provided in Appendix A). Additionally, label smoothing regularization [39] (LSR) was applied throughout the additional training process to mitigate the effect of label noise and for the calibration [40,41] of which the parameter was set as the averaged the level of label noise calculated from Equation (17). Thus the objective function for the additional training is L = LSR(L cl f ; ) + LSR(L re f ; ) + L f lip where LSR(·; ) is LSR operation parameterized by .

Experiments
In the following sections, we evaluate the proposed methods. Our code is publicly available at https://github.com/henry-paik/EnhancingREMC (accessed on 25 June 2022).

Dataset and Baselines
We trained our models on the training dataset of TACRED [1] for which statistics is provided in Table 3. Experiments were performed on the test dataset of TACRED and two extended TACRED datasets [6,29]. Alt et al. [6] corrected wrong labels and published a revised version of TACRED dev and test datasets. This dataset is denoted as revised TACRED (Rev-TACRED). Rosenman et al. [29] consists of challenging and adversarial samples designed to verify the robustness of models to the so-called shallow heuristic methods, e.g., highly dependent on the existence of specific words or entity types in the sentence while not understanding the actual relation between entities. This is denoted as challenging RE (CRE). We compared our model with the following models: (1) C-GCN [3], (2) LUKE [5], (3) SpanBERT [42], (4) KnowBERT [4], (5) RoBERTa-large [43], and (6) RE-marker [32].

Metrics
In addition to using a micro F1 score (F1), we used a macro F1 score (Ma. F1) that is the average of the per-class F1 scores. Unlike F1, Ma. F1 is insensitive to the majority classes. For Rev-TACRED, we additionally adopted MC F1 and a weighted MC F1 score (W. MC F1). MC F1 is calculated on four MCs while other relation-types are neglected to calculate the model performance on MCs alone. W. MC F1 is an instance-wise weighted micro F1 score on the MC instances to measure the model performance on difficult samples among MCs, where the weight, from 0 to 1, is assigned to each instance according to the difficulty calculated by the seed models from [6]. Details are provided in Table A4.
We also adopted positive accuracy (Acc+) and negative accuracy (Acc−) on CRE that [29] developed for measuring the robustness against leveraging spurious association. Let's take the following two sentences, for example: If a model depends on leveraging spurious association, even though it can correctly classify S1 as person:date of birth, it is very likely to predict that the relation still holds in S2, which is incorrect. Acc-is calculated on the adversarial instance (S2) where the relation does not hold anymore. Thus, a high Acc-value suggests that a model is robust to the so-called heuristic methods, understanding the actual relation between entities.

Implementation Details
In this experiment, we built our model, RE-MC, by equipping RoBERTa-large with MCAM; trained it with nine settings of data augmentation varying scale factor N and minimum proportion S of the token replacements to the entire tokens. We set N = {2, 4, 8} by which the original size of MC (227) was multiplied, i.e., total augmentation size would be 454, 908, and 1816, respectively, which are evenly distributed to each MC; S was set as S = {0.1, 0.2, 0.3}, which is a constraint on MLM with the pretrained model that should be satisfied. Empirical analysis of N and S is provided in Section 6.2.
We trained RE-MC on three different random seeds, and selected one of them that yielded the median F1 on Rev-TACRED dev. In the following sections, we report the results of the model trained on that seed. As for generating synthetic dataset, we finetuned RoBERTa-base on the TACRED training dataset for 100 epochs. Other settings are provided in Appendix B.
As described in Table 1, the targeted MCs for our methods to improve are as follows: per:country of death (c1), org:member of (c 2 ), org:dissolved (c 3 ), and org:shareholders (c 4 ). Table 4 presents the test results on TACRED and Rev-TACRED. The results show the SOTA performance on the overall metrics, not only for MC, which is meaningful results in that our methods are robust to be biased either toward MCs nor majority classes. Compared with RE-marker our model is based on, we can see that MCAM and selective augmentation improved the overall model performance (F1 75.4% and 84.8% on TACRED and Rev-TACRED respectively), which indicates that our approaches can be applied to other base models to reinforce MC prediction, i.e., model-agnotic in that we simply added MCAM and selective augmentation to RE-marker to build our model. Subsequently, regarding W. MC F1 RE-MC outperforms the other models by a large margin of at least ∆26.9%, demonstrating the efficacy of our approaches to dealing with MC. RE-MC (N = 8, S = 0.1) , especially, can be the most effective settings for dealing with MC (49.1% and 71.4% on MC F1 and W. MC F1), even though it might be a relatively limited increase in the overall F1 compared to other settings. Furthermore, as shown in Table 5, the proposed approach is robust to heuristic methods, i.e., rarely leveraging spurious association, indicating that our augmentation strategy is good for token perturbation and relation-type invariants. Table 5. The test scores on CRE. A model with a higher Acc− score, and a smaller gap (Diff.) between Acc+ and Acc− is considered more robust to heuristic methods, i.e., spurious association. Results with † are from [29].

Significance Test
For MC scores, we conducted a significance test because the number of MC instances in TACRED-Rev test set was small, 18 (c 1 : 10, c 2 : 4, c 3 : 1, c 4 : 3). To increase the quantity of MC instances, we additionally took the refined annotation from [9] after manually inspecting the annotations. Finally, the significance test was conducted using total 33 MC instances (c 1 : 14, c 2 : 4, c 3 : 4, c 4 : 11). We did bootstrapping 100,000 times, for each size of 33, and calculated MC F1.
The results of significance test between RE-MC (N = 2, S = 0.1) (bootstrapping mean is 42.3) and two main competitive models, i.e., LUKE and RE-Marker (bootstrapping means are both 21.1), show that the difference is significant at 90% confidence level as shown in Table 6 and Figure 3. Table 6 shows the lower and upper bound of 90% confidence interval and Figure 3 shows the distribution of bootstrapping results of the difference between MC F1 scores of ours and RE-marker and LUKE, respectively.   Table 7 shows the efficacy of our methods, such as selective augmentation, additional training, and LSR; removal of each component causes the significant performance deterioration on MC prediction. As for selective augmentation, it leads to significant improvements in MC prediction (MC F1 9.1 → 47.1), which indicates that it is the critical component for MC prediction. The removal of additional training shows the deterioration of the MC prediction performance (MC F1 9.1 → 0). We can also see that LSR contributes to improving MC prediction (MC F1 27.6 → 47.1).

Augmentation Size and Token Replacements
To analyze the effects of the augmentation size and token replacements, we set nine different MC augmentation datasets by varying the scale factor N = {2, 4, 8} and the minimum proportion of token replacements S = {0.1, 0.2, 0.3} where the actual average proportion was 0.21, 0.28, and 0.35, respectively. Figure 4 shows the results of the average scores of 30 models for each setting, which were the top ten models from three different random seeds, respectively, based on Rev-TACRED dev F1. Following the experimental results in Figure 4, we reported the scores of the optimal parameter-combination in Table 4   As shown in plot (1, 1), the entire augmentation settings are effective, and the values are consistently higher than those of other base models shown in Table 4 (minimum F1 in plot (1, 1) is greater than 84%). For MCs, in plot (3,1) and (3,2), we can clearly see that MC prediction performance increases dramatically as N becomes larger, especially when S = 0.3. For example, given S = 0.3, the maximum differences are yielded between the case of N = 2 and N = 8 in plot (3,1), ∆13%, and (3, 2), ∆10.2%. It indicates that a low MC F1 is attributed to the low source availability, and our augmentation approach functions properly.
Regarding F1 and Ma. F1 in plots (1, 1) and (2,1), the trends are contrary to each other: the former decreases and the latter increases as N becomes larger. However, owing to greater improvements in MC as shown in plot (3,1), the drops on F1 are offset by the rapid increase in Ma. F1, which is evident when comparing the slopes in plots (1, 1) and (2, 1).

Conclusions
This study demonstrated that MC prediction in TACRED under label noise and low source regimes could be improved by using MCAM with Refs and selective augmentation. The experimental results showed that the proposed methods significantly improved the overall performance and MC prediction. Moreover, these methods are also robust to heuristic methods. While our approaches proved efficacy in dealing with MC for RE, we should further extend the usage of MCAM architecture to other tasks where MC problems prevail but text Ref is not available. Our future work includes finding an appropriate proxy of Ref and strategies to embed MCs information for other tasks.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Additional Training
Xie et al. [38] introduced Training Signal Annealing (TSA) and gradually increased the schedule of confidence threshold η t at every step t. We modified the schedule and adapted it to our additional trainining as shown in Figure A1. Figure A1. Exponential schedule. We introduce maximum η max and minimum η min threshold, run maximum 6 epochs and set the epoch E as 4 where schedule reach η max . Empirically, exponential schedule is suitable for the model, in particular, which is suffering from learning MCs pattern. We set where T is the product of the total steps per epoch and E.

Appendix B. Experimental Settings
Training and experiments are conducted on a Ubuntu20.04 server with Intel (R) Core (TM) i9-10980XE CPU and GeForce RTX 3090 GPU. For TACRED, we used RoBERTalarge [43] as a backbone model, used learning rate 5 × 10 −6 and batch size of 4 for initial training, and batch size of 4 and learning rate 5 × 10 −6 for additional training. The checkpoint of backbone model was obtained from https://huggingface.co/roberta-large (accessed on 25 June 2022), the number of parameters is 355M, and that of ours is 357M.
The hyper parameter settings for RE-MC (N = 2, S = 0.1) are as shown in Table A1 where the parameter of label smoothing was determined by using Equation (17) but statistical approaches to ratio estimation [44] or noise estimation [45] also can be used. The number of augmentation dataset are provided in Table A2 and MCs distribution is provided in Table A3.
We searched hyperparameters as follows: • learning rate: 1 × 10 −5 , 5 × 10 −6 , 1 × 10 −6 ; • batch size: 2, 4, 6. Appendix C. Attention Guidance Figure A2 shows that the model assign a high reliability score to the intended reference vector of each MC. The instance with a blue-colored cell on the corresponding value vector of reference sentence is likely to be an error in the label with higher probability, i.e., an incorrect annotation. Figure A2. Heatmaps of the absolute attention scores for each MC instance in TACRED training dataset over value vectors of reference sentences. The Y-axis represents an instance, the X-axis represents a reference sentence, and the value is the reliability score. The expected heat map of the ideal dataset and MCAM is showing a high value (almost purple) on the i-th column of the i-th figure where the scores of the c i instances are plotted.

Appendix D. Reference Sentence
We provide reference sentences for TACRED as follows: • org:member_of: 'the relation is "organization and member of". organization: a group of people or other legal entities with an explicit purpose and written rules. member: one who officially belongs to a group, a part of a whole, one of the persons who compose a social group (especially individuals who have joined and participate in a group organization). of: having a partitive effect, introduces the whole for which is indicated only the specified part or segment, from among, indicates a given part.' • org:dissolved: 'the relation is "organization and dissolved". organization: a group of people or other legal entities with an explicit purpose and written rules. dissolve: stop functioning or cohering as a unit, to terminate a union of multiple members actively, as by disbanding, to destroy, make disappear.' • per:country_of_death: 'the relation is "person and country of death". person: an individual, usually a human being. country: the territory of a nation, especially an independent nation state or formerly independent nation, a political entity asserting ultimate authority over a geographical area, a sovereign state, a politically organized body of people under a single government. death: the cessation of life and all associated processes, the end of an organism's existence as an entity independent from its environment and its return to an inert, nonliving state, the event of dying or departure from life'. • org:shareholders: 'the relation is "organization and shareholders". organization: a group of people or other legal entities with an explicit purpose and written rules. shareholder: one who owns shares of stock in a corporation, someone who holds shares of stock in a corporation.'

Appendix E. Model Performance on MCs
Alt et al. [6] tested 52 RE models on Rev-TACRED test; we used the experimental results to calculate the average number of models that correctly predict for each class # Correct Prediction # RE Models . For example, regarding org:dissolved, on average 0.5 RE models correctly do classification on the instances belong to org:dissolved. As shown in Table A4, models generally failed predict MCs.
As for the metric W. MC F1, instance-wise weight of difficulty is calculated by the same experimental source that Table A4 is based on. For example, the instance below belongs to per:country_of_death but none of 52 models correctly predicted, and hence the weight of this instance is one (i.e., 52/52): • where we omitted the name and unimportant tokens.