Improving Distant Supervised Relation Extraction with Noise Detection Strategy

Distant supervised relation extraction (DSRE) is widely used to extract novel relational facts from plain text, so as to improve the knowledge graph. However, distant supervision inevitably suffers from the noisy labeling problem that will severely damage the performance of relation extraction. Currently, most DSRE methods are mainly focused on reducing the weights of noisy sentences, ignoring the bag-level noise where all sentences in a bag are wrongly labeled. In this paper, we present a novel noise detection-based relation extraction approach (NDRE) to automatically detect noisy labels with entity information and dynamically correct them, which can alleviate both instance-level and bag-level noisy problems. By this means, we can extend the dataset from the Web tables without introducing more noise. In this approach, to embed the semantics of sentences from corpus and web tables, we firstly propose a powerful sentence coder that employs an internal multi-head self-attention mechanism between the piecewise max-pooling convolutional neural network. Second, we adopt a noise detection strategy, which is expected to dynamically detect and correct the original noisy label according to the similarity between sentence representation and entity-aware embeddings. Then, we aggregate the information from corpus and web tables to make the final relation prediction. Experimental results on a public benchmark dataset demonstrate that our proposed approach achieves significant improvements over the state-of-the-art baselines and can effectively reduce the noisy labeling problem.


Introduction
Knowledge graphs (KGs) play a crucial role in natural language processing (NLP). KGs such as Freebase [1] and DBpedia [2] have shown their strong knowledge organization capability and are used as data resources in many NLP tasks including semantic search, intelligent question answering and text generation, among others. These KGs are mostly composed of relational facts, which are in the form of triplets such as <Warren Buffett, born_in, Omaha>. However, as knowledge is constantly increasing and updating, the existing KGs are far from complete. To fill this gap, relation extraction (RE), which aims to identify the relation r between a given pair of entities <e1, e2> from unstructured text, is thus an essential task in NLP.
As manually labeled data is insufficient for traditional supervised RE systems, distant supervision (DS) [3] is proposed to automatically construct large-scale labeled training data by aligning entities in text corpus and corresponding KGs. The assumption of DS is that if there is a relational fact <e1, r, e2> in the knowledge graph, all sentences mentioning <e1, e2> will express the relation r. Otherwise, if there is no relation between <e1, e2> in the knowledge graph, the sentence mentioning them will be labeled as "not a relation" (NA).
DS has been widely applied in relation extraction and has achieved good results. At present, many studies have applied deep learning in DS and improved distant supervision relation extraction (DSRE) by automatically learning text features [4][5][6]. In addition, some scholars have found that the addition of additional knowledge can further improve DSRE, thus gaining great attention [7][8][9][10]. Among them, Deng et al. [10] proposed a hierarchical framework to fuse information from DS and web tables that share entity relationship facts, which greatly improves the efficiency of DSRE.
However, the above methods follow the strong assumption of DS, and as this strong assumption does not always hold true, it might result in a wrong labeling problem. At the same time, adding additional knowledge can improve relationship extraction, but using two-hop DS data will even introduce more new noise because the strong assumption is not changed.
On the issue of DS noise mitigation, a multi-instance learning (MIL) framework [11][12][13] is introduced into DSRE to relax the strong assumption to the at-least-one principle. In the MIL framework, all instances mentioning <e1, e2> constitute an instance-bag which shares a common label r. The at-least-one principle states that at least one instance in the bag can imply the relation r, and the bag is treated as the samples for training RE models instead of instances. However, above feature-based methods rely heavily on the accurate features derived by NLP tools, which might suffer from an error propagation problem.
To alleviate the noise problem, many RE studies based on the MIL framework employ neural networks with a selective attention mechanism to assign weights to different instances within the bag [14][15][16][17][18], and all achieve good results. These selective attention methods still assign a certain weight to the noisy instances (false positive instances); especially when a bag composed of single instance is wrongly labeled, as illustrated in Table 1, selective attention will not work on such a bag-level noise problem. According to the statistics [19], nearly 28% of one-instance bags in the NYT dataset are incorrectly labeled, which seriously hurts the performance of RE. Table 1. Examples of instances of wrong labeling by distant supervision. There are two bags of instances B1 and B2 in the table, which are labeled with "play" and "famous_in", respectively. The bold words in instances are entities. The first instance in B1 is correctly labeled, and the second is wrongly labeled due to the strong assumption of distant supervision. In this situation, the selective attention mechanism might assign 0.9 and 0.1 weights to the two instances, respectively. Meanwhile, only the one instance in B2 is mislabeled due to the incomplete knowledge graph and its weight will be set to 1 in selective attention. Therefore, there is instance-level noise in B1 and bag-level noise in B2, respectively. In this paper, we propose a novel noise detection-based relation extraction model (NDRE), which can automatically distinguish the true positive and false positive cases in the training process by evaluating the correlation between sentences and tags, so as to alleviate the noisy labeling problem at both the instance and bag levels and avoid adding new noise labels while integrating two-hop DS data, thus further improving the performance of DSRE.

Bag
Specifically, in the proposed framework: 1) To learn a comprehensive sentence representation for each sentence in corpus and web tables (high-quality relational tables extracted from http://websail-fe.cs.northwestern.edu/TabEL/#content-code (access date: 19 April 2013) [10], we firstly combine the multi-head self-attention mechanism and piecewise convolution neural network (PCNN) for the sentence encoder; 2) In this stage, we use a noise detection strategy to address the issue of noisy labeling. To evaluate the correlation of sentences and labels, we calculate the similarity between entity-aware embeddings and each sentence representation. According to the similarity score, we can judge whether the sentence can express the labeled relation; that is, whether the sentence is true positive or false positive. Then, the label of the detected false positive sentence will be dynamically corrected to NA during training; 3) To fuse information from corpus and web tables, we utilize a bag aggregation method to balance their impact on the predictive relation. The experimental results on a real-world DS dataset prove that our model performs better than the state-of-the-art baselines.
The remainder of the article consists of the following: In Section 2, we introduce the related work of relation extraction. The detailed methodology of the NDRE is illustrated in Section 3. Section 4 shows the results of the experiments and the analysis of results in detail. Finally, Section 5 presents our conclusion and discussion of future work.

Related Work
Relation extraction is an important research task in the NLP area. Many works regard RE as a supervised classification task and have achieved good results. One of the main defects of these conventional supervised relation extraction methods is the lack of manually annotated training data, which is expensive and time-consuming. To deal with this issue, distant supervision was proposed by Mintz et al. [3] to generate a large amount of training data automatically by aligning text to corresponding KGs.
At present, many DSRE works are combined with a neural network model to achieve good results. Socher et al. [4] and Zeng et al. [5] employed a recursive neural network (RNN) and a convolutional neural network (CNN) to obtain text representation, respectively. Zeng et al. [6] attempted to propose a piecewise convolution neural network for RE. They considered the relative position information of each word and entities and adopted piecewise pooling on entity position to retain more fine-grained information in the sentence. In addition, the addition of additional knowledge associated with text can further improve DSRE and become one of the new research directions. Ji et al. [7] proposed a sentencelevel attention model that made full use of extracted entity descriptions to provide more background knowledge. Vashishth et al. [8] employed graph convolution networks to capture syntactic information from text and utilized available side information such as entity type and relation alias to improve RE. Beltagy et al. [9] combined distant supervision with a directly supervised data and used it to improve weights of relevant sentences. Deng et al. [10] proposed a hierarchical framework to fuse information from DS and web tables that share relational facts about entities to further improve RE.
The above methods have achieved good results, but because of the strong assumption of DS, the data generated inevitably contains sentences that are mistaken for their strong assumption. Moreover, adding two-hop DS data may bring new noise. Therefore, how to fully join the existing entity data and additional knowledge and get rid of the noise problem is an urgent problem for RE.
The multi-instance learning (MIL) framework is an important method of noise reduction in DSRE. Riedel et al. [11] introduced a multi-instance single-label learning framework to RE and assumed that the positive instance must exist and the instance with the highest confidence can express the label. Hoffmann et al. [12] and Surdeanu et al. [13] adopted multi-instance multi-label learning to model multiple relations between entities.
In recent years, to alleviate the noise problem, many RE studies based on the MIL framework have employed neural networks with a selective attention mechanism to assign weights to different instances within the bag. Lin et al. [14] proposed a CNN-based model with a sentence-level attention mechanism to assign more weights to effective instances and reduce the weights of noisy instances. Liu et al. [15] introduced a soft-label method to exploit valid information from correctly labeled entity pairs to help noisy instances. Jat et al. [16] proposed a -word attention model based on the Bidirectional Gated Recurrent Unit (BiGRU) and an entity-centric attention model to identify key words in sentences. Xiao et al. [17] proposed a hybrid attention-based Transformer block to get word-level features and constitute the bag representation. He et al. [18] presented a reinforcement-learningbased framework to choose positive instances and make full use of unlabeled instances.
Although the above methods have achieved good results, they still use noisy sentences in training and do not consider the problem of bag-level noise labeling, so do not fundamentally get rid of the noise problems.
In order to alleviate instance-level and bag-level mislabeling problems, we propose the use of the NDRE to detect noisy instances with entity information and dynamically correct the wrong labels to NA, which can reduce the influence of noisy instances and bags, which in turn can effectively fuse corpus and two-hop DS data and avoid introducing new noise tags, thus greatly improving the performance of RE.

Proposed Method
In this section, we describe our proposed method, the NDRE, in detail. The overall framework of the NDRE consists of three main components as shown in Figure 1.
In recent years, to alleviate the noise problem, many RE studies based on the MIL framework have employed neural networks with a selective attention mechanism to assign weights to different instances within the bag. Lin et al. [14] proposed a CNN-based model with a sentence-level attention mechanism to assign more weights to effective instances and reduce the weights of noisy instances. Liu et al. [15] introduced a soft-label method to exploit valid information from correctly labeled entity pairs to help noisy instances. Jat et al. [16] proposed a -word attention model based on the Bidirectional Gated Recurrent Unit(BiGRU) and an entity-centric attention model to identify key words in sentences. Xiao et al. [17] proposed a hybrid attention-based Transformer block to get wordlevel features and constitute the bag representation. He et al. [18] presented a reinforcement-learning-based framework to choose positive instances and make full use of unlabeled instances.
Although the above methods have achieved good results, they still use noisy sentences in training and do not consider the problem of bag-level noise labeling, so do not fundamentally get rid of the noise problems.
In order to alleviate instance-level and bag-level mislabeling problems, we propose the use of the NDRE to detect noisy instances with entity information and dynamically correct the wrong labels to NA, which can reduce the influence of noisy instances and bags, which in turn can effectively fuse corpus and two-hop DS data and avoid introducing new noise tags, thus greatly improving the performance of RE.

Proposed Method
In this section, we describe our proposed method, the NDRE, in detail. The overall framework of the NDRE consists of three main components as shown in Figure 1.

Sentence3
Web Tables Sentence2  Then, we describe our proposed method, the NDRE, in detail: (1) A sentence encoder which includes two parts: the input representation and encoding layer. The input representation takes the combination of word embedding and position embedding as a vector representation of each sentence in the package. The encoding layer consists of three modules: convolution, multi-head self-attention and piecewise max-pooling, which are used to extract semantic features implicit in sentence representation; (2) A noise-detection strategy Appl. Sci. 2021, 11, 2046 5 of 14 that can automatically detect noisy labels during the training process. We introduce the embedding of entities to evaluate whether the sentences in the package can express the target relation. Sentences that cannot express the target relation will be regarded as noise, and we remove the noise by modifying their labels; (3) A bag aggregation method, which first obtains the bag-level representation of the corpus and the web tables through selective attention, and then combines these two bags in a balanced manner to obtain the final sentence-bag representation.
In the following, we first introduce the task definition and notation, and then provide a detailed formalization of the NDRE.

Task Definition and Notation
Following the multi-instance learning paradigm, we are given a bag of sentences . .} with a pair of target entities (h, t), where S 1 h,t is composed of all sentences mentioning (h, t) in corpus and S 2 h,t contains some sentences mentioning the anchors of (h, t) discovered in the web. The anchors are defined as two entities in web tables co-occurring with (h, t), which are denoted as {(h 1 , t 1 ), (h 2 , t 2 )...}. The goal of relation extraction is to predict the relation r from a predefined relation set R = {r 1 , r 2 , . . . , r l }, where l is the number of distinctive relation categories. If no relation exists, the label of the sentence will be assigned NA.

Sentence Encoder
where d w is the dimension of word embedding. Following previous works [6,14], our initial word embedding is pre-trained by the skip-gram method [20]. To define the position feature of the target entities and words, position embedding as proposed by Zeng [5] is applied in our work.
Position embedding describes the relative distances between the current word wi to the two target entities, which are further mapped into two low-dimensional vectors, p e1 i and p e2 i , respectively, where p e1 i , p e2 i ∈ R d p . As illustrated in Figure 2, in the sentence "Donald Trump became the first billionaire president of the United States in January 2017.", the relative distances of the word (president) to the head entity (Donald Trump) and tail entity (the United States) are 5 and −2.
ules: convolution, multi-head self-attention and piecewise max-pooling, which are used to extract semantic features implicit in sentence representation; (2) A noise-detection strategy that can automatically detect noisy labels during the training process. We introduce the embedding of entities to evaluate whether the sentences in the package can express the target relation. Sentences that cannot express the target relation will be regarded as noise, and we remove the noise by modifying their labels; (3) A bag aggregation method, which first obtains the bag-level representation of the corpus and the web tables through selective attention, and then combines these two bags in a balanced manner to obtain the final sentence-bag representation.
In the following, we first introduce the task definition and notation, and then provide a detailed formalization of the NDRE.

Task Definition and Notation
Following the multi-instance learning paradigm, we are given a bag of sentences (instances) 1 2 is composed of all sentences mentioning (h, t) in corpus and 2 , h t  contains some sentences mentioning the anchors of (h, t) discovered in the web. The anchors are defined as two entities in web tables co-occurring with (h, t), which are denoted as { (h1, t1), (h2, t2)...}. The goal of relation extraction is to predict the relation r from a predefined relation set exists, the label of the sentence will be assigned NA.

Input Representation
Given a sentence ∈ h,t s  consisting of m words mension of word embedding. Following previous works [6,14], our initial word embedding is pre-trained by the skip-gram method [20]. To define the position feature of the target entities and words, position embedding as proposed by Zeng [5] is applied in our work. Position embedding describes the relative distances between the current word wi to the two target entities, which are further mapped into two low-dimensional vectors, e1

Encoding Layer
The piecewise convolutional neural network (PCNN) [6] has been utilized by a number of previous RE works and its powerful capability of capturing local features and positional information of sentences has been proven. However, PCNN usually performs badly when processing long sentences due to its limitation of not being able to obtain the global dependencies of sentences. Therefore, a multi-head self-attention mechanism is employed in our framework to obtain long-distance dependency information from multiple perspectives.
As shown in Figure 3, given an input representation S, we firstly apply a convolution kernel with a sliding window over S, where the window's size is set to k. In order to keep Appl. Sci. 2021, 11, 2046 6 of 14 the original size of the input, we add k−1 2 padding tokens on both sides of the sentence boundaries. The hidden output of the convolution layer can be expressed as: We concatenate the word embedding and position embedding as the final word rep- In this way, the input sentence representation of s can be denoted as a vector sequence

Encoding Layer
The piecewise convolutional neural network (PCNN) [6] has been utilized by a number of previous RE works and its powerful capability of capturing local features and positional information of sentences has been proven. However, PCNN usually performs badly when processing long sentences due to its limitation of not being able to obtain the global dependencies of sentences. Therefore, a multi-head self-attention mechanism is employed in our framework to obtain long-distance dependency information from multiple perspectives.

Layer-Norm
Multi-head Self-attention As shown in Figure 3, given an input representation S, we firstly apply a convolution kernel with a sliding window over S, where the window's size is set to k. In order to keep the original size of the input, we add padding tokens on both sides of the sentence boundaries. The hidden output of the convolution layer can be expressed as:  , and C c m d × ∈ H  . Next, to exploit the global dependency of a sentence, we adopt the multi-head selfattention mechanism of Transformer [21], which has successfully achieved promising results in most NLP tasks. The basic idea of a self-attention mechanism is to find the interaction between each word and the whole sentence. The importance or the weight of each word in the sentence can be calculated by the degree of its interaction and is utilized to adjust the representation of the sentence. In this way, the global information of the whole sentence will be contained in the new representation.
Formally, a self-attention mechanism is defined as follows: Attention softmax (2) where the Softmax (·) function is used to calculate the weights that are applied to the words, T is a transpose operation,  Next, to exploit the global dependency of a sentence, we adopt the multi-head selfattention mechanism of Transformer [21], which has successfully achieved promising results in most NLP tasks. The basic idea of a self-attention mechanism is to find the interaction between each word and the whole sentence. The importance or the weight of each word in the sentence can be calculated by the degree of its interaction and is utilized to adjust the representation of the sentence. In this way, the global information of the whole sentence will be contained in the new representation.
Formally, a self-attention mechanism is defined as follows: Attention(a, a, a) = so f tmax( aa where the Softmax (·) function is used to calculate the weights that are applied to the words, T is a transpose operation, aa T is the dot product and d a is the dimension of a, and the purpose of scaling through √ d a is to avoid the dot product result being too large. Further, we adopt a multi-head operation on the self-attention mechanism to extract the features about the different levels of a sentence. We get the sentence representation Ha, as represented in the following equation: where H a is segmented according to the position of (h, t), Pool(·) is a max-pooling operation and relu(·) is a non-linear activation function. As a result, s = R m×(3×d c ) concatenated by three pooling results is the final sentence vector representation.

Noise Detection Strategy
As mentioned before, DSRE suffers from the wrong labeling problem whether the data comes from a corpus or web table. Especially when there is the only one sentence in the bag, if the sentence is labeled incorrectly, this noise problem will greatly damage the performance of the model. Unlike previous works that just relied on an attention module to alleviate noise, we propose a noise-detection strategy to detect and revise the noise label in the training process. Specifically, since RE aims to predict the relation between pairs of entities, we combine the embeddings of entities into our model. We employ the similarity between entity embeddings and sentence representation to estimate whether the current sentence can imply a relation between the target entity pair. If the current positive relation label does not exist in the sentence, the label will be modified to NA and the sentence is treated as invalid in the training of current relations. In this way, the noisy labeling problem of distant supervision will be largely eliminated.
Motivated by the translation-based knowledge graph embedding approaches [22][23][24], we utilize the difference vector between the target entity pair as an additional feature. Specifically, we use r ht = t − h to reflect some information of relation between entities, where h, t ∈ R d w are pre-trained word embeddings.
Here, we adopt the Euclidean distance to measure the similarity of r ht and the sentence representations. A two-layer feed-forward network is then applied to the distance, which is formally denoted as are learnable parameters; σ(·) is the sigmoid function; and d(,) is the Euclidean distance function. Then, we correct the original relation label r using the similarity score sml as follows: where φ is a pre-defined threshold. If the similarity score sml is less than the threshold φ, we think the sentence can express the current relation r. Otherwise, r will be modified to NA.

Bag Aggregation
After we use the noise-detection strategy to eliminate the noise, we get a new bag-level label {r 1 , r 2 , . . .} instead of just one label r for S h,t . It is remarkable that maybe not all sentences in S h,t have the positive label r, and different positive sentences express r to different degrees.
Based on the above-mentioned reasons, we adopt frequently used selective attention over S = {s 1 , s 2 , . . .} to get the bag-level representation B. Selective attention with a query vector of relation r i can make the noisy sentences invalid and no longer contribute to the representation of the bag. In addition, via selective attention, higher weights are assigned for the sentences expressing r more clearly. The representation of the bag is computed as follows: where q r i ∈ R (3×d c ) is a learnable query vector of relation r i and n is the number of sentences in the current bag. Due to the fact that the sentences contained by S 1 h,t and S 2 h,t are not an order of magnitude, S 1 h,t and S 2 h,t will get their own representation through Formula (8), which are denoted as B1 and B2. To balance their impact on relation prediction, they are assigned different proportions in the final sentence-bag representation y, which is formulated as where λ is a variable from 0 to 1 and is calculated by where W λ ∈R (1×(9×d w ) is the transformation matrix, b λ ∈R 1 is the bias, and q r ∈R 3×d c is a learnable query vector of bag label r.

Classification and Objective Function
DSRE can be regarded as a classification task. In order to get the predicted relation category, y is fed into a multi-layer perceptron (MLP) to get the final output, which corresponds to all relation categories. Then, we employ a |l|-way softmax function over the final output to calculate the conditional probability P of each relation category: Supposing that there are N bags in the training set {B 1 , B 2 , . . . , B N } with their corresponding target relation labels {L 1 , L 2 , . . . , L N }, we define the objective function using cross-entropy, which is written as: Given that the parameters set the θ of the model, we use the mini-batch stochastic gradient descent (SGD) optimizer to minimize the objective function.

Dataset
We evaluated our model on a widely used NYT dataset developed by Riedel et al. [11]. The NYT dataset was generated automatically by aligning the entities in Freebase with the New York Times corpus. This dataset contains 53 relations, including a negative class NA, which means the relation of an entity pair is unavailable. In addition, we used the WikiTable corpus dataset aligned with the NYT. The dataset and more details on how to build it were provided by Deng et al. [10]. In this paper, NYT and WikiTable corpus are denoted as S 1 h,t and S 2 h,t , respectively.

Comparison with Baselines
Following previous studies [3,11], we conducted our experiments in a held-out evaluation instead of costly human evaluation. Held-out evaluation compares predicted relational facts in text with existing facts in Freebase. To verify the effort of our model, we adopted precision-recall (PR) curves as the evaluation metrics, which can show the trade-off between precision and recall. To further quantify PR curves, we also report precision at recall (P@Recall) and the area under the curves (AUC). P@Recall indicates the precision values at concrete recall rates, and AUC evaluates the overall performance of PR curves. We selected the following baselines to compare with our model, the NDRE:

. Word and Position Embeddings
Our model uses the word embeddings pre-trained by the Word2vec 3 tool on NYT corpus as initial values. Position embeddings were initialized with the Xavier [25] initialization. The input of all models was the concatenation of word embeddings and position embeddings.

Hyper-Parameter Settings
To demonstrate the effectiveness of the NDRE, we set most hyper-parameters in the NDRE following previous work [10,14] and adjusted other hyper-parameters using cross-validation on the training data.
After adjustment, we chose the following settings in our model: the dimension of word embedding d w was set to 50, the dimension of position embedding d p was set to 5, the dimension of convolution layer d c was set to 256, the window size of the convolution kernel k was set to 3, the number of multi-heads A was set to 8, the dropout rate was set to 0.5, and the threshold φ of similarity was set to 0.16 from experience. Following the approach in [10], we first pre-trained our sentence encoder and noise detection strategy with only data S 1 h,t , then fine-tuned the whole model with data S h,t = S 1 h,t ∪ S 2 h,t . The epoch of pre-training was about 50, and the epoch of fine-tuning was about 180; the batch size of pre-training was fixed to 64, and the batch size of fine-tuning was fixed to 2500; the learning rate of pre-training was set to 0.005, and the learning rate of fine-tuning was set to 0.001.

Overall Evaluation Results
From Figure 4, the following observations can be made: (1) Among all the baselines, our NDRE achieves the best performance over the entire recall range; (2) The NDRE performs much better than PCNN+ATT, BGWA, and PCNN+ATT+SL. It indicates that our noise-detection strategy is superior to ordinary selective attention mechanism and soft labeling based on correctly labeled instances in alleviating the noisy labeling problem.
(3) The NDRE substantially outperforms RESIDE and Direct+Dis, which utilize the extra information, and even achieves the higher performance over the recently promoted PCNN+PU. This demonstrates the effectiveness of our work. forms much better than PCNN+ATT, BGWA, and PCNN+ATT+SL. It indicates that our noise-detection strategy is superior to ordinary selective attention mechanism and soft labeling based on correctly labeled instances in alleviating the noisy labeling problem. 3) The NDRE substantially outperforms RESIDE and Direct+Dis, which utilize the extra information, and even achieves the higher performance over the recently promoted PCNN+PU. This demonstrates the effectiveness of our work.

Ablation Results
To further verify the effectiveness of the main components in the NDRE, we conducted an extra ablation study under three testing modes:

Ablation Results
To further verify the effectiveness of the main components in the NDRE, we conducted an extra ablation study under three testing modes:

•
Single: Only bags composed of single instance were selected for relation extraction.

•
Multiple: Only bags composed of more than one instance were selected for relation extraction.

•
Whole: The whole dataset was used for relation extraction. Figure 5 shows the PR curves of the compared models under three testing modes (single, multiple, whole), and the corresponding P@Recall and AUC values are listed in Table 2.
In particular, NDRE(-Label) denotes removing the noise-detection strategy, and NDRE(-MHA) denotes removing the multi-head self-attention mechanism. REDS2 is the first work to fuse NYT and web tables data, which is the dataset we use.
From these results, we can observe that: (1) The performance of the NDRE declines significantly when its different components are removed. Especially when the noise detection strategy is removed, it shows a 9.9%, 7.6%, and 9% decrease in terms of AUC under the three testing modes, respectively. The overall performance is NDRE > NDRE(-MHA) > NDRE(-Label) > REDS2, which shows that our sentence encoder can better capture semantic features and that the noise-detection strategy can greatly improve DSRE. (2) Under the "Single" testing mode, NDRE and NDRE(-MHA) had improvements of 12% and 4.8%, respectively, under AUC compared with REDS2. This demonstrates that the noise-detection strategy can effectively detect and filter out the noisy bag consisting of single instances. (3) Under the "Multiple" testing mode, NDRE and NDRE(-MHA) outperformed REDS2 and their AUC was improved by 14.7% and 9.2%, respectively. It indicates that the noise-detection strategy can availably detect the wrongly labeled instance in the bag and then make full use of the information contained in the true positive instance. (4) The AUC of the NDRE under the "Whole" testing mode was 0.5094, which comprises a new state-of-the-art performance.

Case Study
To show the capabilities of the noise-detection strategy, we selected four representative instances in the training process for a case study. As shown in Table 3, each instance had a corresponding sml calculated in Section 3.3. If the sml exceeds the threshold φ, the label of instance will be dynamically modified to NA during training, r is the original relation label r and r' is the new label modified by Noise Detection Strategy. The correct relation of the current instance is marked in blue, and the wrong relationis marked in red. From the table, we can see that: 1) Except for the first instance, none of the instances were true positive instances. The second and third instances were detected to have no relation in the instance because of their sml. The second instance was noisy and its label was correctly modified to NA according to the noise-detection strategy. 2) The original label of the fourth instance was people/person/place_of_birth, but its correct label was people/person/place_lived. The noise was not detected as the head entity "Peggy" had a relation people/person/place_of_birth with "Miami". Thus, noise is hard to detect when there are multiple relations in an instance. Table 3. Case study: a real example in our dataset for the noise-detection strategy. "Noise" represents whether the instance was detected as a noisy one, and "Correct" represents whether the final label of the instance was correct.

Conclusions
To solve the noisy labeling problem, this paper proposes a novel distant supervised method, named the NDRE, utilizing a noise-detection strategy to eliminate false positive labels and improve the quality of training. The NDRE fuses the noise-filtered information from corpus and web tables, based on a multi-head self-attention mechanism for PCNN to extract the semantic features of sentences. As a result, our model achieves better performance as compared to competitive baselines on the benchmark dataset in terms of commonly used evaluation metrics.
In the future, we plan to explore reinforcement learning for DSRE to identify noisy instances and learn the correlation between false positive and true positive instances. In addition, there is a lot of information hidden in the instances labeled as NA. We tend to make full use of this potential information to further improve relation extraction.