A Machine Learning Filter for the Slot Filling Task

Slot Filling, a subtask of Relation Extraction, represents a key aspect for building structured knowledge bases usable for semantic-based information retrieval. In this work, we present a machine learning filter whose aim is to enhance the precision of relation extractors while minimizing the impact on the recall. Our approach consists in the filtering of relation extractors’ output using a binary classifier. This classifier is based on a wide array of features including syntactic, semantic and statistical features such as the most frequent part-of-speech patterns or the syntactic dependencies between entities. We experimented the classifier on the 18 participating systems in the TAC KBP 2013 English Slot Filling track. The TAC KBP English Slot Filling track is an evaluation campaign that targets the extraction of 41 pre-identified relations (e.g., title, date of birth, countries of residence, etc.) related to specific named entities (persons and organizations). Our results show that the classifier is able to improve the global precision of the best 2013 system by 20.5% and improve the F1-score for 20 relations out of 33 considered.


Introduction
In the age of structured knowledge bases such as Google Knowledge Graph [1], DBpedia [2] and the Linked Open Data cloud [3], relation extraction is becoming a very important challenge for enhanced semantic search.Relation extraction and its sub-task, slot filling, have been very active in recent years and have been subject to several evaluation campaigns that assess the ability of automatically extracting previously known relations from corpora.Despite some progress, the results of these competitions remain limited.Relation extraction generally consists of extracting relations from unstructured information.This is done by identifying the meaningful links between named entities.Slot filling goes one step further by providing a template defining the relations for which a named entity needs to be linked.In this paper, we focus on the Text Analysis Conference (TAC) Knowledge Base Population (KBP) English Slot Filling (ESF) track, which is an evaluation campaign that targets the extraction of 41 pre-identified Wikipedia info-box relations (e.g., title, date of birth, countries of residence, etc.) related to specific named entities (persons and organizations) [4].In this task, a named entity (the query entity) and a relation (the slot) are submitted to the system, which must find every other entity (the filler) that is linked to this entity with this particular relation, and must return a textual segment that justifies this result [4,5].The goal of this task is to populate a reference knowledge base, in this case a Wikipedia infobox 2008 snapshot [6], for which fillers to a pre-defined list of relations for 100 selected named entities are missing.In 2014, the top performing system, submitted by Stanford, achieved a recall of 0.277, a precision of 0.546 and a F1 score of 0.368 [5,7].These results are somewhat similar to those submitted the previous year by the best performing system, Relation Factory, which achieved a recall of 0.332, precision of 0.425 and F1 score of 0.373 [4,8].

Example of a Slot Filling System's Structure
For most systems, the slot filling task is done in two main steps: candidate generation and candidate validation [8].The candidate generation stage can be further decomposed into the following steps: (i) query expansion, (ii) document retrieval and (iii) candidate matching.Query expansion is performed by finding alternate reference forms that designate the named entity contained in the query.For example, the entity Barack Obama could be referenced as President Obama, Barack Hussein Obama or simply Obama.These alternative forms constitute the named entity expansion.The system then retrieves documents containing a reference to the query entity.The retained documents are annotated with a name entity recognizer that associates a named-entity tag (or the tag OTHER if the word is not a named entity) to each word.In the candidate matching stage, candidates, e.g., sentences containing a reference to the query entity or to some item of its expansion, along with a named-entity tag of a potential slot filler type for a given relation, are retrieved [8].For example, the sentence Barack Obama was born in Honolulu, Hawaii in 1961.would be tagged by the named entity tagger as follows: "Barack/PERSON Obama/PERSON was/OTHER born/OTHER in/OTHER Honolulu/GPE-CITY, Hawaii/GPE-STATE in/OTHER 1961/DATE."This sentence would be a candidate for the relation city of birth and the query entity Barack Obama since it contains a named-entity tag of type GPE-CITY.For each relation, the candidate sentences are passed to the validation stage.This system-dependent step is usually based on distant supervision classifiers and/or hand-crafted patterns, among other approaches [9].

Research Objective
A system's recall can be increased significantly by configuring a more lenient candidate generation step or candidate validation step.However, its precision is then expected to drop.This is a well-known issue in information retrieval where there is a trade-off between precision and recall.Our research describes a filtering module that could be applied to the output of relation extractors in order to increase their precision.It is driven by the following research questions: RQ1: What are the most important features for the filtering step?RQ2: Is there a generic method to increase the precision of slot filling systems without significantly degrading their recall (overall performance across several relations)?RQ3: Are some relations more sensitive to our filter?

Outline and Organization
The paper is organized as follows.In Section 2, we present a brief overview of related work in the field of relation extraction and slot filling.In Section 3, we explain our approach, describing the dataset and the set of features used to train our classifiers.We compare the evaluation of systems' for the TAC ESF track before and after filtering, measuring the impact of the feature sets in Section 4. In Section 5, we discuss our motivation to focus on systems' precision along with other aspects of our work.We also give a brief conclusion and discuss future work.

Related Work
In Section 2.1, we give a brief overview of the most common and most successful approaches for relation extraction.In Section 2.2, we focus on relation extractors' enhancement and present different methods to increase a relation extractor's performance.

Relation Extraction and Slot Filling
Relation extraction being a prevailing concern, multiple research groups have worked on the task lately.Relation extraction and slot filling are complex tasks that require rich modular automatic systems.Typical methods such as supervised relation extraction encounter certain problems.Primarily, there are very few labeled data available for training such systems since it is expensive to produce.In addition, there is generally a text or domain bias considering relations are labeled on a particular corpus [10].Alternatively, unsupervised relation extraction approaches use very large amounts of data and can extract a wide scope of relations.However, those relations are not necessarily easy to map to a knowledge base [11].Therefore, modern systems (e.g., [7,8,12]) use different techniques such as pattern bootstrapping and distant supervision, which do not require labeled datasets but extract pre-defined relations.
Bootstrap learning uses a small number of seed instances, which are entities known to fill a given relation [13,14].The seed pairs are used to extract patterns from a large corpus, which are then used to extract more seed entity pairs to generate additional seed patterns in an iterative manner.Depending on the pattern types and other factors, the resulting patterns often suffer from low precision because of semantic drift [9].Some approaches rely on dependency parsing patterns.For example, Ref. [15] uses bootstrapping to generate syntactic dependencies patterns.The authors train classifiers by parsing the dependency tree and selecting all paths containing less than three vertices between the query and the candidate filler.Their system achieved a recall of 0.223, a precision of 0.199 and a F1 of 0.21 for the 2014 TAC ESF task for which the median F1 was 0.198.
Most systems throughout different evaluation campaigns use distant supervision to acquire data to train their classifiers.This approach, briefly described in the previous section, seems to work better than bootstrapping, since most systems that use distant supervision scored higher in the ESF task in 2013 and 2014 [5].For example, the authors in [8] use distant supervision support vector machine classifiers for candidate validation.Distant supervision argument pairs are obtained by mapping Freebase relations to TAC relations and by matching hand-crafted seed patterns against the TAC 2009 text collections [16].Another work [8] uses a maximum of 10 k argument pairs per relation for each of the two sets of seed pairs that are then matched against the TAC 2009 text corpora.A maximum of 500 sentences per pair are used as training data.Though the quality of generated examples is inferior to the quality of examples generated using bootstrapping techniques, the number of examples is much greater.In order to increase the performances of classifiers trained on noisy false positive sentences, the authors use aggregate training.Instead of treating each sentence as a single training example, they group all sentences per entity pair, extract the features, sum the feature counts of all the sentences and normalize the feature vector for that pair so that the highest feature has weight 1.0.While increasing training speed, this method reduces the influence of features that are highly correlated with a single distant supervision pair.Their system, Relation Factory, scored the highest in the 2013 task.

System Enhancement
Rather than developing whole systems, it is also common for research groups to look to enhance existing relation extractors since there is a great margin of potential improvement.Ref. [17] developed a Maximum Entropy based supervised re-ranking model to re-rank candidate answers for the same slot.The model is trained on labeled query-answer pairs to predict the confidence of each answer's correctness.The authors used features based on confidence assigned by OpenEphyra, a web module of an open domain question-answer system [18], co-occurrence in the document collection, named-entity tags and slot type.When tested on the 2009 TAC ESF task, this approach achieved a 3.1% higher precision and a 2.6% higher recall compared to the best performing system.Ref. [19] worked on a semantic rule filter.It is an unsupervised, scalable learning method for automatically building relation-specific lexical semantic graphs representing the semantics of the considered relation.The authors compare approaches using WordNet and BabelNet.The approach using BabelNet yields a considerable precision gain with only a slight loss of recall.This approach worked best on relations of type acquisition and type birth/death and struggled for relations of type place of residence, which have a larger lexical diversity.The approach was tested on only seven relations.Ref. [20] worked on an active learning module that combines supervised learning with distant supervision.By providing only a small sample of annotated data, they have increased the F1 of Stanford's 2013 slot filling system on the 2013 ESF task by 3.9.
All of these enhancement methods are incorporated within the pipeline of an individual system and aim, in most cases, at a particular system or work best for particular relations.In contrast, this paper describes a generic approach that can be easily integrated to any relation extractor and which can be trained on any relation, regardless of the domain.An alternative to single system enhancement is ensemble learning.In an incentive to encourage the development of ensemble learning techniques, TAC also proposes the Slot Filler Validation (SFV) task, which consists in refining the output of ESF systems [21].Ref. [22] has developed a system participating in the TAC 2013 SFV task that uses an aggregation approach.For a given query, it aggregates "raw" confidence scores produced by individual slot fillers to produce an aggregated confidence score.This value is obtained based on the number of occurrences of a response and the confidence score for each occurrence.A variant of their method assigns a weight to each occurrence based on the "general confidence" of the system that submitted the response.Their approach achieves great performances for single-value slots (Precision: 0.797, Recall: 0.794, F1: 0.795) but does not perform as well for list-value slots.Similarly, Ref. [23] has developed a system using stacking that trains a classifier to optimally combine the results of multiple systems.Their system combines the output of various systems in the TAC ESF task using a "meta-classifier" and using the systems' confidence score.The authors also derive features from document provenance and offset overlap, i.e., the proportion of characters that are common between justifications of every response submitted for the slot.For a given query/relation pair (slot), if N systems provide answers and n of those systems provide the same docid as provenance, then the provenance score associated with this docid is equal to n/N.Similarly, they use the Jaccard coefficient to measure the offset reliability of a response relatively to other responses for the same slot.Similarly to Relation Factory [8], they use an expansion method based on Wikipedia anchor text applied on the filler, in order to detect redundant responses, thus increasing precision.This method allows them to achieve an F1 of 50.1% on the 2014 SF submissions, which is greater than the best performing system (39.5%).
In general, ensemble methods achieve better performances compared to individual systems.However, they require accessing multiple relation extractors.This paper describes a more light-weight approach and focuses on increasing precision while limiting loss of recall.

Proposed Work
We start by introducing our system architecture in Section 3.1.We then provide background on the TAC KBP 2013 ESF and explain the preprocessing on our dataset Section 3.2.In Section 3.3, we present the features used for filtering.Finally, we present our classification method in Section 3.4.

System Architecture
Our approach consists of using a machine learning filter trained on the combined outputs of the TAC ESF participating systems.For each relation, a binary classifier filter is appended to the end of a relation extractor, from where it eliminates responses, which it classifies as wrong.Figure 1 shows the general data flow: justification sentences from relation extractors are preprocessed and then features are extracted from these sentences for classification.This approach aims at improving the precision of any relation extractor.An increase in recall is not possible because there are not any additional responses generated by our filter.The training data is aggregated and preprocessed in the following manner.The outputs from the relation extractors, which contain a justification sentence for each response, are passed to the named-entity recognizer (NER), which gives the NE tags for each word in the justification sentences (OTHER if the word is not a named entity).The outputs from the relation extractors are also passed to the part-of-speech (POS) tagger and syntactic analysis parser, which return the POS tags as well as the syntactic dependencies tree (details in Section 3.2.3).Most frequent patterns are extracted from the raw justification sentence as well as the POS/syntactic analysis output.Sentence-level and syntactic-dependency-tree-level frequent patterns and the preprocessed responses are used for feature extraction (details in Section 3.3).The resulting feature vectors are then passed to the binary classifier for training.

Background
As mentioned previously, the dataset is composed of responses from relation extractors to queries related to 41 pre-defined relations.Table 1 shows a complete list of the relations for the TAC KBP 2013 ESF task.There are 25 relations expected to be filled for requests of entity type person and 16 relations for requests of entity type organization [24].Responses are assessed as either correct, if they satisfy the assessment validity criterion, and wrong otherwise.The table also shows the distribution of the complete training dataset, comprising the preprocessed responses of each relation extractor's best run after the dataset is cleaned (details in Section 3.2.2).The amount of training data varies widely across relations, primarily because some relations expect multiple values, whereas others only expect a single value as filler, and secondly because relation extractors have more ease in generating and retaining candidates for certain relations compared to others.Another important fact to consider is the class imbalance towards the wrong class in most relations.The relation extractors were provided a list of 100 queries that consisted of a named entity and a list of relations to fill [4].Figure 2 shows an example of a query.The XML query contains the query ID, the named entity, the ID of a document in which an instance of the entity can be found, and the beginning and end character offset of this instance in the document.The query also contains the entity type (organization or person) indicating which relations have to be extracted.If the entity is already in the knowledge base, it also contains a node ID and a list of slots that must be ignored (because they are already available in the knowledge base).In response to the input queries, systems' submissions were required to provide the following information in each individual answer: the query ID, the ID of the document that contains the answer, the name, string or value that is related to the relation specified in the query (we will call this the slot filler), the offset of the query entity in the document, the offset of the slot filler, the beginning/end character offset of the justification text, and, finally, a confidence score [4].For a given query, slots are expected to be filled for all relations associated with the entity type (except for relations to be ignored).Some relations accept multiple answers, e.g., alternate_names, children, title, etc. Name slots are required to be filled with the name of a person, organization or geo-political entity; value slots are required to be filled by either a numerical value or a date; string slots are a "catch all" meaning that their fillers cannot be classified as names or values [24].All answers come from the Linguistic Data Consortium (LDC) Corpus entitled "the TAC 2013 KBP Source Corpus" (catalog ID: LDC2013E45) composed of 2.1 million documents: 1 million newswire documents from Gigaword, 1 million web documents and 100,000 documents from web discussion forums [4].Systems were also required to submit between one and five system runs for which the system configuration could vary.Each run is annotated by the LDC assessors who classify each response as correct if the slot filler and its justification are accurate and wrong otherwise.For our experiment, we used the data from the evaluation queries for the TAC KBP 2013 Slot Filler Validation (SFV) track, which are composed of the annotated ESF submissions.The complete dataset utilized for our experiment is the pool of annotated data made of each participating systems best run, i.e., the run that achieved the highest F1 score.

Dataset Cleaning and Partition
As mentioned, the TAC KBP English Slot Filling task only requires systems, for every output, to provide the query entity, the slot filler and justification, the ID of the document where they are found and the beginning/end character offset of the filler, the query and the justification in this document.The document ID and offsets constitute the data that is used to train and test our classifiers and are in some cases noisy, since some relation extractors provide erroneous offsets.For example, although some source documents are in HTML, some systems do not consider HTML tags in their computation of offsets.Therefore, the justification cannot be properly recovered.To ensure the cleanliness of the data, such outputs are removed from the training set.Responses containing a justification with a character length greater than 600 are removed from the dataset as well to ensure a smooth usage of the Stanford Parser.We removed HTML tags from responses in the query entity, the slot filler, or the justification.We also removed periods, question marks and exclamation marks from responses that contained such punctuations at the beginning of their justification.For a given system, we hold out its own output from the training set and use it as test data in order to be able to compare the system's initial performance to its performance after filtering using the official evaluation script.This means no actual data coming from the system is used in the training set.The evaluation script calculates the global recall, precision and F1 by considering only the slots for which the system submitted a response or nil, which indicates the system is confident no response can be found.We have also experimented a segmentation where 75% of slots were used for training and 25% of slots were held for the test dataset.This enables us to include a portion of the tested system's output in the training data; however, we can only test the filter's performance on 25% of the system's output.We have not noticed a significant increase nor decrease in performance for the tested slots compared to the previous segmentation.We therefore make the assumption that there is no significant variability between outputs of different systems affecting the filter's performance and proceed using the former train/test segmentation to be able to compare the systems' performances after filtering to their original performance.

Linguistic Processing of Justifications
The various features require the justification to be annotated with part-of-speech tags as well as named-entity tags.The syntactic dependencies tree is also necessary for feature extraction.Systems are allowed to submit up to two sets of character offsets within the specified document to designate query entity, slot filler and justification.Therefore, since most features depend on the query entity and the slot filler's position in the justification, we must ensure a mapping between these positions and the justification sentence.This process is necessary, especially in cases where there are multiple references to either the query entity or the slot filler in the justification or if either entity is designated by a pronoun.We used the Stanford Parser to obtain the part-of-speech annotated justification as well as the syntactic dependencies [25].The named-entity annotations are obtained by applying sequor, a perceptron named-entity tagger [26].Figure 3 shows an example of a justification sentence for which the named-entity tags, the part-of-speech tags and the syntactic dependency tree have been extracted.

Down-Sampling and Selective Filtering
In order to avoid overfitting, we do not filter relations for which the number of instances is inferior to 15 for either class (correct/wrong).Due to the lack of available examples, the following relations were omitted from the filtering process for all systems: org:date_dissolved, org:member_of, org:political_religious_affiliation, per:county_of_birth, per:religion.Data imbalance have been shown to cause classifiers to favor the majority class [27].For most relations, there are more wrong instances than correct instances.Therefore, for each relation where the number of wrong instances is greater than two times the number of correct instances, we down-sample the subset of wrong instances to five subsets randomly sampled containing the same number of instances as the correct class subset.The usage of these five subsets will be further detailed in Section 3.4.The training dataset was rebalanced for the following relations for most systems: org:founded_by, org:members, org:parents, org:subsidairies, per:charges, per:cities_of_residence, per:countries_of_residence, per:employee_or_member_of, per:origin, per:other_family, per:parents, per:siblings, per:statesorprovinces_of_residence.Other relations were either rebalanced or eliminated for a few systems.

Features
The main goal of our approach is to experiment with various sets of features and ultimately propose a relation-independent selection of features.In this section, we present a set of generic features used in our experiment.
Our approach is based on a wide selection of generic features ranging from statistical, named-entity, lexical to syntactic features.These features are listed in Table 2.

Statistical Features
Our filter utilizes generic features such as sentence length (C1), the number of words of both the query entity and the slot filler (two features) (C2), and their order of appearance in the justification sentence, either query or filler first (C3).Another set of features is based on the segmentation of the sentence according to the position of the query entity and the slot filler, resulting for most sentences in three segments (before, between and after the query entity and slot filler).For each sentence segment, we capture the number of tokens (including words and punctuation) (C4).Since the training data is directly derived from relation extractors that participated in the TAC KBP English Slot Filling task, and given that the task requires every submitted response to be accompanied by a confidence score, we exploit this score as a feature (C5).

Named Entity Features
This subset of features captures the location of named entities in the justification relative to the query and the filler for the following types: person (N1), geo-political entity (N2) and organization (N3) [28].Similarly to feature C4, this set of features is based on the segmentation of the sentence according to the position of the query entity and the slot filler.For each sentence segment, we capture the number of named entities of each of the specified types, resulting in three features for each entity type.For example, as shown in Figure 3, given the relation city of birth, in the following justification, "Nicole Kidman and Barack Obama were both born in Honolulu, Hawaii.",where the query entity is Barack Obama and the slot filler is Honolulu, the sentence would be segmented in the following way: (1) Nicole Kidman and (2) were both born in (3), Hawaii.The first segment contains one entity of type person (Nicole Kidman), the second segment does not contain any named entity and the final segment contains one named entity of type geo-political entity (Hawaii).

Lexical (POS) Features
Our filter also relies on a subset of features that captures the ratio of part-of-speech tags in each sentence segment and in the whole sentence.The part-of-speech tags are generalized into four categories: nouns, verbs, adjectives and other (L1) [28].If we take Figure 3, for example, the distribution of POS tags for the sentence segment before the query entity ("Nicole/noun Kidman/noun and/other") is the following: nouns: 2/3, verbs: 0/3, adjectives: 0/3, other: 1/3.We also extracted the most frequent subsets of part-of-speech tags between both entities at the sentence level in the training dataset (L2).This is done using the Apriori algorithm after separating the instances of each relation by class (correct and wrong) [29].We used a support of 0.15, meaning that the subset must be present in at least 15% of data related to the wanted class.The Apriori algorithm allows for retaining the desired subsets with reduced computational resources, by first identifying individual items, in this case a single POS tag, for which the support is greater than the minimum support and extending them to larger item sets by retaining only the subsets that have a support greater than the minimum support, 15% in this case.In order to ensure the specificity of the previously withheld subset, a subset is only retained if its support on the opposite class is equal to or less than 0.5 of its support on the wanted class (we will call this value specificity), which gives us a minimum specificity of 2. These parameters were determined experimentally.In order to maintain a reasonable training execution time, we have limited the number of retained subsets to 100.As long as the number of retained subsets is greater than the limit, we rerun the Apriori algorithm with a 2% support increment and a 10% specificity increment.The number of retained subsets ranges from 1 to 100 in the 41 relations.We derive boolean features that indicate the presence or absence of each subset between the query and the filler at the sentence level.A similar method is used to extract the most frequent word subsets contained in the whole sentence (bag of words) (L3).Stop words, named-entities and punctuation are excluded from potential subsets.Figure 3 shows an example of a justification sentence for which the query is Barack Obama and the filler is Honolulu.In this case, the word set is the following : {born}.The other words between the query and the filler are either named entities or stop-words and are therefore not retained.Since word subsets are more specific to responses compared to POS subsets, we used Apriori with more lenient parameters to retain the most frequent subsets in the training data: support 10%, specificity 1.5, limit of retained subsets 200, support increment 1% and specificity increment 0. Similarly, we extract POS tag bigram subsets and word bigram subsets between the query and the filler and retain the most frequent subsets using the Apriori algorithm once again (L4 and L5).We also add two additional tokens representing respectively the query and the filler to the bigram subset.In order to retain a sufficient amount of subsets, we use the following settings: support 5%, specificity 1.5, limit of retained subsets 200, support increment 1% and specificity increment 0. The word bigram subset is the following for the example shown in Figure 3: {QUERY_were, were_both, both_born, born_in, in_FILLER}.The POS bigram subset is constituted of the corresponding POS tags: {QUERY_VBD, VBD_DT, DT_VBN, VBN_IN, IN_FILLER}.

Syntactic features
These features are based on the syntactic dependency tree obtained by parsing the justification sentence.The distance between both entities, i.e., the query and the filler (S1), is the number of links that separate both entities in the syntactic dependency tree.The entity level difference (S2) is the difference between the level (depth) of each entity in the tree.We use a feature that indicates if one entity is the ancestor of the other (S3).We also focus on frequent subsets between the entities in the syntactic dependency tree.The subsets retained are comprised of syntactic dependencies of words present in the subtree between both entities (path in the tree between both entities) (S4).In cases where either entity is composed of more than one word, we consider only the word with the lowest level to represent the query or the filler.For example, in Figure 3, the query has two words, Barack and Obama.Since the level of Barack is 3 and the level of Obama is 2 (root level = 0), we consider only Obama as the query for the subset extraction.This is the case for every syntactic feature.We, once again, use the Apriori algorithm, with the following parameters: support 15%, specificity 2, limit of retained subsets 100, support increment 1%, specificity increment 10%, to obtain the most frequent subsets.The syntactic dependencies subset for the example in Figure 3 is: {conj, nsubjpass, prep, pobj, nn}.Another set of features derived from most frequent subsets captures the syntactic dependencies, direction and part-of-speech tag of each node in the subtree common to both entities (S5).The direction is obtained by following the path starting from the query and ending at the filler.The Apriori algorithm is used with the same parameters as the previous set of features.Each token is comprised of the syntactic dependency, the direction (↑: the next token in the path is closer to the root, ↓: the next token in the path is further from the root, -: the token is at root level) and the POS tag.The subset for the example in Figure 3 is: {(conj,↑,NNP), (nsubjpass,↑,NNP), (root,-,VBN), (prep,↓,IN), (pobj,↓,NNP), (nn,↓,NNP)}.In the same way, we extract the syntactic dependencies bigrams as well as the syntactic dependencies, direction and part-of-speech tuple bigrams within the path between the query and the filler in the syntactic dependency tree (S6 & S7).The syntactic dependency bigram subset for the example in Figure 3 is: conj_nsubjpass, nsubjpass_prep, prep_pobj, pobj_nn.The syntactic dependencies, direction and part-of-speech tuple bigrams are the following: {(conj,↑, NNP)_(nsubjpass,↑,NNP), (nsubjpass,↑,NNP)_(root,-,VBN), (root,-,VBN)_(prep,↓,IN), (prep,↓,IN)_(pobj,↓,NNP), (pobj,↓,NNP)_(nn,↓,NNP)}.We use the Apriori algorithm to retain the most frequent subsets with the same configuration as L4 and L5 feature subsets.

Classifiers
For each relation, we trained a series of classifiers from which we selected the best classifier to apply on test data, i.e., used for filtering the relation extractors' output.Models are selected from one of the following base classifiers: RandomForest [30,31], Sequential Minimal Optimization (SMO) [32][33][34][35], NBTree [36,37], DecisionTable [38,39], J48 [40] and K* [41,42].The evaluation is done on the training set using a 10 fold cross-validation.We retain the classifier that has the greatest F1 measure on the correct class.We selected this metric amongst others for the sole purpose that we want to limit the number of false negatives produced by the classifier.As mentioned in the previous section, the aim of the filter is to increase the relation extractors' global precision while limiting loss on recall.A significant amount of false negatives produced by the filter would result in a great loss of recall, which would be detrimental to the relation extractors' F1.For relations that have been rebalanced, we train these six classifiers for the five randomly sampled subsets.We retain the classifier/subset pair that has the greatest F1 measure on the correct class.We performed the experiments using Weka's implementation of the classifiers with the default configurations and hyper-parameters [43].

Experiments
We performed experiments for each participating system in the TAC KBP Slot Filling task.We present results for all systems, and then we focus on Relation Factory, the best performing system [8].

Experiment Overview
From the 18 available systems, we excluded four systems because the offsets provided did not allow us to extract the queries, fillers and justifications.In fact, the following systems' outputs are corrupted by a character offset error and were omitted in order to evaluate the filter's performances correctly: SINDI, IIRG, CohenCMU and Compreno.The experiment is executed for each of the remaining 14 participating systems by holding out its own output from the training data.The following description indicates the procedure for one system.Once the preprocessed training data is separated by relation, the features are extracted from each response.In some cases, some features may not be extracted if they depend on the position of the query entity and the slot filler in the justification sentence (e.g., the distance between entities in the syntactic tree, the ratio of part-of-speech tags of each type between the entities at the sentence level, etc. ).In general, when this happens, it is due to the fact that one of the entities (query, slot filler) cannot be found in the justification, or because both entities cannot be found in the same sentence.The filter only processes responses in which a reference to both entities is present in the same sentence.When both entities are not found in the same sentence, most features cannot be extracted.Therefore, the response is removed from the training set.The procedure for the test data is quite similar: responses for which entities are not found in the same sentence are simply not passed through the filter and are kept by default.Cases for which the responses are rejected correspond to approximately 15% of the instances and are unevenly distributed amongst relations.For relations such as org:website, systems do not typically return complete sentences as justifications, but only return short textual segments generally containing only the filler.Specific relations such as per:alternate_names and org:alternates_names, for which justifications are not required to obtain a correct assessment are the main cause of examples being rejected from the training set.
The filtered response is evaluated using the TAC KBP 2013 English Slot Filling evaluation script version 1.4.1 (http://www.nist.gov/tac/2013/KBP/SentimentSF/tools/SFScore.java) with the strict configuration, for which the response is assessed as correct only if the answer is provided with a correct justification.As mentioned previously, from the 18 systems, we excluded four systems because their output was not processable.Our filter enables an increase in precision for every relation extractor that participated in the TAC KBP 2013 English Slot Filling task compared to its original performance.We have attempted multiple feature configurations to train our classifiers.We have achieved the greatest precision increase using a combination of statistical and lexical/POS features using bigrams as a base for the extraction of most frequent subsets (subsets composed respectively of POS and word bigrams).Table 3 shows the evaluation for participating systems before and after filtering using this feature configuration.The filter results in an average increase in precision of 11.2% and an average decrease in recall of 4.4% for the 14 systems.It allows an increase in precision for every system and an F1 increase for three systems including Relation Factory, the best performing system.In the case of Relation Factory, the precision increases from 42.5% to 63% with a 20.5% increase.The F1 is increased from 37.3% to 38.4%.

Evaluation by Relation
Table 4 shows the filter's accuracy and F1 for both correct and wrong classes evaluated by cross-validation using 10 folds on the training data.The training data contains the output of every participating system from which Relation Factory's output was held out.Table 5 details the performances of Relation Factory, the best system of the 2013 campaign, for each relation before and after filtering using its best feature set configuration (detailed in Section 4.3) individually for every relation.From the 33 relations out of 41 for which there was a trained classifier, the filter increases the precision as well as the F1 for 20 relations.There is an increase in precision but a decrease in F1 for nine relations.There was no change in precision or F1 for three relations.Finally, the precision along with the F1 are decreased for one relation.We separated relations based on slot quantity, slot content, number of training instances, initial recall and initial precision.Table 6 shows the difference in recall, precision and F1 after filtering Relation Factory's output for the different relation groups.The filter has a greater impact on list-value slots for which the precision is increased by 18.3% on average relative to 13.5% for single-value slots.List-value slots have a greater contribution on global performances since they represent 73.5% of the system's recall, whereas single slots represent 26.5% of the system's recall.The filter increases the precision of name and string slots by respectively 17.3% and 19.3%, whereas it increases precision of value slots by only 9.8%.Table 6 also indicates that relations with a greater number of training instances achieve a greater increase in precision on average.Initial recall has a slight influence on precision increase.However, relations with a high initial recall achieve greater F1 increase.Relations with high initial precision (≥65) achieve less precision increase than relations with a lower initial precision.Relations for which initial precision is less than 40% achieve the greatest F1 increase (2.4%).Based on the results obtained by filtering Relation Factory's output, for the seven feature subsets that allow an increase in F1 for Relation Factory, we have tested our filter for the other systems as well.Table 8 shows the average performances for the 14 systems that were tested.The greatest average precision increase was obtained using the Statistical + Lexical/POS (bigrams) feature subset, which is the same feature subset providing the greatest precision and F1 increase for Relation Factory.However, we assume there is no statistical significance between the precision difference obtained using the different subsets.

Filtering All System Runs
As mentioned previously, relation extractors were required to submit between one to five runs.This provided the opportunity for the teams to test their different system configurations.The systems that submitted multiple runs generally have precision-oriented and recall-oriented configurations.Table 9 shows the evaluation before and after filtering for all system runs.Precision is increased for 43 out of 44 runs and F1 is increased for 11 out of 44 runs (25%).The filter has increased the F1 for at least one run for 5 out of 14 systems.When applied to Relation Factory's highest Recall run, the precision is increased by 20.9% and the F1 is increased by 3%.This post-filtering F1 of 39.4% is also a 2.1% increase relative to Relation Factory's highest F1 run.Table 10 shows the average results for respectively every system's F1-tuned, precision-tuned and recall-tuned run.Our filter consistently increases the original system precision by an average of at least 10% for any system configuration.However, the filter decreases the systems' F1 by an average of 2.4% for precision-tuned runs, 1.8% for F1-tuned runs and 1.3% for recall-tuned runs.We have also compared our results with those obtained by filtering Relation Factory's output using a confidence score threshold heuristic only (heuristic filter).These results are shown in Table 11.By filtering Relation Factory's output using only the confidence score, there is no confidence score threshold that achieves a higher precision or F1 than those obtained using our filter.This confirms that our filter achieves a much greater precision than tuning an individual system by increasing the threshold on the confidence score which also severely hinders the recall.To evaluate the statistical significance of our results (precision gain), we once again used the chi-square test [44].Column p-value (p/baseline) indicates the p-value of the precision increase using the heuristic filter compared to the baseline.We measured this value using the same process as described in Section 4.3.The precision increase is statistically significant for a confidence score threshold of 0.4 to 1.0.Column p-value (p/filter) indicates the p-value of the precision difference between our filter and the heuristic filter for different confidence score values.The precision difference is statistically significant for a confidence score threshold of 0.1 to 0.7.However, using a heuristic filter based on a confidence score threshold above 0.7 hinders the recall much more (>10 percentage points) than our filter.This recall difference is statistically significant as well.
As mentioned in previous sections, our approach aims mainly at increasing the relation extractor's precision.Our filter is appended to the end of the relation extractor's pipeline, thus allowing the filter to be tested and operated on any system.This approach, however, imposes an upper bound on recall, which is the recall of the system before filtering.As shown in Table 9, post-filtering performances are optimal for recall-oriented system configurations.In order to increase the relation extractor's recall, the filter would have to be directly inserted inside the pipeline after the candidate generation stage.Our modular approach gives us the flexibility to operate upstream or downstream of the candidate validation stage, whereas typical ensemble learning methods can only be applied downstream.This approach would require further specificity in order to adapt to each individual system.The system's original candidate validation module could be improved by taking into account our filter's confidence score for each response.An upstream integration could be a promising avenue to explore given that the filter performs best on high-recall outputs.In line with the same idea, since the relation extractor's recall can only decrease after filtering downstream, it is imperative to limit the loss of recall.The loss of recall is caused by the filter's rejecting correct instances provided by the relation extractor.This case is recognized as a False Negative.Therefore, the selection of a convenient classifier for filtering is based upon the metric that is correlated with the quantity of false negatives.A high recall on the correct class indicates that a low rate of false negatives are generated at training time by the filter (Recall = TP/(TP + FN)).However, we also want to limit the number of False Positives, which in this case is a wrong response retained by the filter.A high precision on the correct class indicates a low rate of false positives (Precision = TP/(TP + FP)).Therefore, we use the F1, which is a harmonic mean of both recall and precision, to determine the base classifier.Classifiers that exhibit a high F1 on the correct class at training time and are trained on a sufficient amount of training data usually perform well on the test data.To measure the impact of our algorithm selection, we tested our filter consistently using alternately each algorithm for all relations.Results of performances on Relation Factory's output are shown in Table 12.From the six machine learning algorithms, Random Forest provides the best performances with a 62.2% precision and a 38% F1 compared to a 63% precision and a 38.4% F1 when using our proposed algorithm selection method (Statistical and Lexical/POS (bigrams) features).For applications using limited resources, a filter using Random Forest as the default classifier could be considered without expecting a significant decrease in performances.

Discussion
One of the main objectives of this research was to determine the most important features for the filtering step (RQ1).Our results show that the greatest performances were achieved using a combination of statistical and POS/lexical features.Our results also show that the most frequent subsets are more informative when considering bigrams over unigrams.Bigram subsets implicitly capture the token sequence, whereas unigram subsets do not.Another objective of our research was to evaluate whether a method based on a generic set of features could lead to an increase in precision without a major setback in recall (RQ2).Since our approach imposes an upper bound on recall, which is the recall of the system before filtering, post-filtering performances are optimal for recall-oriented system configurations.Finally, we wanted to determine whether some relations were more sensitive to our filter (RQ3).The filter has a greater impact on list-value slots, which generally have a greater number of training instances.Initial recall has a slight influence on precision increase.However, relations with a high initial recall achieve greater F1 increase.The features used for training are mostly generic and could be utilized for any pre-defined relation.However, classification is more successful when there is a sufficient amount of training data that is balanced between the correct and wrong class.
An interesting path to explore in the future is feature selection.Currently, most features are generated by capturing most frequent patterns at different levels using the Apriori algorithm.The selection criteria allow us to retain patterns for which the number of occurrences in the subset of the specified class is above a certain threshold and for which the number of occurrences in the subset of the complementary class is below a certain threshold.However, for certain relations such as per:title for instance, there is a wide array of patterns that exist to represent the relation.Therefore, in order to ensure an optimal coverage, an option would be to be more lenient on Apriori selection parameters and instead use feature selection algorithms such as information gain [45] to empirically discover the most significant features for each relation.
The relation would have an even greater impact with more training data.The simplest way to increase the amount of training would be to append to the current training dataset the responses for the systems having participated in the TAC KBP English Slot Filling track in the subsequent years when they are available.However, the data would still be unevenly distributed across relations and the observable imbalance towards the wrong class would persist and we would lose data by down-sampling the dataset.We could use existing methods for up-sampling the minority class instead.One way to do so could be by using the Synthetic Minority Over-sampling Technique (SMOTE) method [46].This method consists in creating "synthetic" examples rather than over-sampling by duplicating existing examples.This is done by creating a feature vector between a randomly selected minority class example and its nearest neighbor and adding this vector multiplied by a random number between 0 and 1 to the original selected minority class example.Finally, making use of distant supervision to increase the amount of training data would be an avenue worth considering.

Conclusion
This paper presented a method to increase the precision of relation extractors.This generic method consists of appending a machine learning-based filter to the relation extractor's pipeline in order to increase its performance.The filter is based on a wide scope of features, including statistical, lexical/POS, named entity and syntactic features.We showed that precision could be increased for all the participating systems in the TAC KBP 2013 ESF track.Overall, the filter enabled a 20.5% precision increase and a 1.1% F1 increase for the best performing system from the 2013 slot filling track when applied to the F1-tuned run.The filter also led to a 14.9% precision increase and a 3% F1 increase for the best performing system when applied to the recall-tuned run.In fact, our filter performs best on high recall runs.Finally, the filter obtained a 2.1% F1 increase compared to the original F1-tuned run, the best global performance in the TAC KBP 2013 ESF track.

Figure 3 .
Figure 3. Example of NE tags, POS tags and syntactic dependency tree for a justification sentence.
Number of tokens in sentence C2 Answer/query length Number of tokens within answer and query references C3 Entity order Order of appearance of query and answer references C4 #tokens left/between/right Number of tokens left/right or between entities in the sentence C5 Confidence score Score given by the relation extractor [4] Named-entity features N1 #person left/between/right Number of person left/right or between entities in the sentence [28] N2 #gpe left/between/right Number of Geo-political entities left/right or between entities in the sentence [28] N3 #orgs left/between/right Number of organizations left/right or between entities in the sentence [28] Lexical (POS) features L1 POS fractions left/between /right/sentence Fraction of nouns, verbs, adjectives and others left/right/between answer and query references or in the whole sentence [9] L2 POS subsets Most frequent subsets of POS tags between query and answer references in the sentence.(boolean feature indicating the presence of the subset) L3 Word subsets Most frequent subsets of word (excluding stop-words and named entities) in the sentence.(boolean feature indicating the presence of the subset) L4 POS bigram subsets Most frequent subsets of POS tag bigrams (excluding stop-words) between query and answer references in the sentence.(boolean feature indicating the presence of the subset) L5 Word bigram subsets Most frequent subsets of word bigrams between query and answer references in the sentence.(boolean feature indicating the presence of the subset) Syntactic features S1 Distance between entities Distance between entities at the syntactic dependency tree level S2 Entity level difference Level difference within syntactic dependency tree between query and answer references S3 Ancestors One entity is ancestor of the other at the syntactic dependency tree level S4 Syntactic dependencies subsets Most frequent subsets of syntactic dependencies between query and answer references at the syntactic dependency tree level (boolean feature indicating the presence of the subset) S5 Multilevel subsets Most frequent subsets, where each token is composed of a POS tag, syntactic dependency and direction, between query and answer references at the syntactic dependency tree level (boolean feature indicating the presence of the subset) S6 Syntactic dependencies bigram subsets Most frequent subsets of syntactic dependencies bigram between query and answer references at the syntactic dependency tree level (boolean feature indicating the presence of the subset) S7 Multilevel bigram subsets Most frequent subsets, where each token is composed of a POS tag, syntactic dependency and direction bigram, between query and answer references at the syntactic dependency tree level (boolean feature indicating the presence of the subset)
Content: filler type (Name, Value or String), Quantity: number of fillers expected for the relation (Single or List), #Correct: number of correct responses, #Wrong: number of wrong responses, Total: number of responses.

Table 2 .
Full set of generic features used for filtering.

Table 3 .
Global Evaluation using the TAC KBP 2013 Slot Filling scorer on the test dataset for all systems before and after filtering.

Table 4 .
Classifier performances on train set for Relation Factory by cross-validation (10-folds).

Table 5 .
Evaluation for Relation Factory using the TAC KBP 2013 Slot Filling scorer on the test dataset before and after filtering for each relation.

Table 6 .
Evaluation variation using the TAC KBP 2013 Slot Filling scorer on the test dataset after filtering for different relation groups (Relation Factory).Precision post-filtering − Precision pre-filtering , ∆F1: F1 post-filtering − F1 pre-filtering , #Relations: number of relations.

Table 7 .
Results obtained by Relation Factory using the TAC KBP 2013 Slot Filling scorer on the test dataset after filtering when using different feature sets.

Table 8 .
Average evaluation obtained by the participating systems using the TAC KBP 2013 Slot Filling scorer on the test dataset after filtering when using different feature sets.

Table 10 .
Average global evaluation for all systems using the TAC KBP 2013 Slot Filling scorer on the test dataset before and after filtering for F1, precision and recall oriented runs.

Table 11 .
Results obtained by Relation Factory using the TAC KBP 2013 Slot Filling scorer on the test dataset after filtering when using a confidence score threshold heuristic (Precision increase is significant for p-value < 0.1.Statistical significance is indicated with * when comparing our filter with the heuristic filter and with † when comparing the heuristic filter with the baseline.)

Table 12 .
Performance using the TAC KBP 2013 Slot Filling scorer on the test dataset when using alternatively different algorithms to filter responses for every relation (Relation Factory).