Learning from Knowledge Graphs: Neural Fine-Grained Entity Typing with Copy-Generation Networks

Fine-grained entity typing (FET) aims to identify the semantic type of an entity in a plain text, which is a significant task for downstream natural language processing applications. However, most existing methods neglect rich known typing information about these entities in knowledge graphs. To address this issue, we take advantage of knowledge graphs to improve fine-grained entity typing through the use of a copy mechanism. Specifically, we propose a novel deep neural model called CopyFet for FET via a copy-generation mechanism. CopyFet can integrate two operations: (i) the regular way of making type inference from the whole type set in the generation model; (ii) the new copy mechanism which can identify the semantic type of a mention with reference to the type-copying vocabulary from a knowledge graph in the copy model. Despite its simplicity, this mechanism proves to be powerful since extensive experiments show that CopyFet outperforms state-of-the-art methods in FET on two benchmark datasets (FIGER (GOLD) and BBN). For example, CopyFet achieves the new state-of-the-art score of 76.4% and 83.6% on the accuracy metric in FIGER (GOLD) and BBN, respectively.

In FET, knowledge graphs (KGs) usually play an important role. For example, given large-scale KGs, FET systems resort to distant supervision [10] to generate large training corpora [9,17,18] (i.e., to label entity mentions in the training corpus with all types associated with the entity in KGs). Although distant supervision can eliminate the high cost in labeling training data with KGs, how to efficiently encode a KG's typing knowledge into FET model is still underexplored.
In this paper, we concentrate on how to take advantage of KGs to improve FET in the process of type inference. In fact, the correct type information about a mention usually can be found in large-scale knowledge graphs, such as Freebase [19], YAGO [20], DBpedia [21], OntoNotes [22], and Few-nerd [23], which typically have already recorded a large number of entity-typing facts with various context from corpus. Consider the following example from Wikipedia: "In 2006, Obama released The Audacity of Hope that expanded upon the themes in their convention speech." The entity mention "The Audacity of Hope" can directly be predicted as a book by copying the type information of the known entity-typing fact ("The Audacity of Hope", /things/book) from KGs to the mention. For cases that require the understanding of the entity context, using a copy mechanism is also quite beneficial. In the previous example where "Obama" is the entity mention, copying all the types of "Barack Obama" in KGs (i.e., /people/person, /book/author, /person/politician, etc.) as the priority typing candidates of the mention is still informative for inferring the correct type (i.e., /book/author) that fits the context best, since they can substantially narrow down the optimal list of possible type labels (referred to as the type's copying vocabulary from here onward). According to the statistics in Table 1, over 49% and 64% of all the manually annotated mentions' typing facts in FIGER (GOLD) [9] and BBN [24] (testing data) have already been included in their existing KGs, respectively, which inspires us to improve FET by learning rich known entity-typing information from KGs. Table 1. The percentage of the entity mentions' typing facts that have been included in the existing knowledge graphs. The statistical analysis indicates they are over 49% and 64% in two benchmark datasets FIGER (GOLD) and BBN (testing data), respectively. These observations are the key motivation that inspired us to take advantage of knowledge graphs to improve fine-grained entity typing through the use of a copy mechanism in this paper. To this end, we incorporate a copy mechanism in fine-grained entity typing, and propose a novel deep neural model called COPYFET for FET via a copy-generation mechanism. A copy mechanism can significantly avoid unnecessary mistakes and improve the accuracy in the type inference process. It is similar to the copy mechanism in the keyphrase generation [25] and abstractive summarization [26,27], which allows a language generator to copy items from the source text directly, in order to help generate reliable results that keep salient information from the source text.

Benchmark FET
Specifically, COPYFET includes two submodels of inference, i.e., a copy model and a generation model, as shown in Figure 1. COPYFET has two submodels: (i) the first one conducts the regular way of making type inference from the whole type set, i.e., generation model; (ii) the second one uses the new copy mechanism which can identify the semantic type of a mention with reference to the type's copying vocabulary from KGs, called copy model. Both models are combined to build the final type inference model. Since it is unknown that a certain entity mention corresponds to a certain entity in KGs, we propose to perform entity linking as a solution to generate the type's copying vocabulary.
Extensive experimental results on two benchmark FET datasets demonstrate that the proposed COPYFET can effectively conduct fine-grained entity typing by incorporating a copy model with a generation model in both training and inference. For example, COPYFET achieves 76.4% and 83.6% absolute strict accuracy on the benchmark datasets FIGER (GOLD) and BBN, respectively.
The contributions of this paper are threefold: (1) We propose to take advantage of knowledge graphs to improve fine-grained entity typing through the use of a copy mechanism. (2) Specifically, we propose COPYFET, a simple but effective neural fine-grained entity-typing model that incorporates a copy mechanism in FET via a copy-generation framework. (3) We conduct empirical experiments on two benchmark datasets, which demonstrate that incorporating a copy mechanism highlights the superiority of the proposed COPYFET over previous SOTA models in a FET task.
The structure of the paper is as follows. In Section 2, we provide a brief review of related works. In Section 3, we describe the methodology of our model. In Section 4, we present the cross-entropy learning method. Section 5 presents experimental results followed by their discussion. Finally, Section 6 gives the conclusion and future directions of this research. ...   (2)) and the context representation by Bi-LSTM (Equation (3)). In COPYFET (right box), the green bar indicates the probability score calculated by the copy model (Section 3.1) and the generation model (Section 3.2). The copy model is able to learn to predict from a much more delimited candidate space, i.e., the type's copying vocabulary, than the whole type vocabulary, on which the generation model makes a prediction. The final type prediction agrees with both of them (Section 3.3).

Related Work
To make this paper self-contained, we introduce some related topics here on finegrained entity typing, and copy mechanism.

Fine-Grained Entity Typing
The FET task was first introduced by [9,28]. Different from lexicon-level [29], discourselevel [30], and corpus-level [31,32] FET, most previous works consider sentence-level entity typing. The progress of FET has primarily focused on the following directions.
Neural Network Model. Different from early heuristic hand-crafted feature-based models [9,28,33] and embedding-based methods [1], the neural models are expected to learn better latent representations for mention and context [8,34,35]. For instance, ref. [36] firstly used recurrent neural networks (RNNs) to recursively obtain a vector representation of each entity mention. Refs. [37,38] proposed to incorporate an attention mechanism with LSTM. Ref. [39] proposed a CNN-based FET model. Refs. [40,41] proposed attentive neural models that also encoded latent type representations besides mention and context. Refs. [8,41] found that the pretrained language model ELMo [42] performed better than BERT [43] as the input of a neural model. Our model builds upon these progress and takes advantage of the state-of-the-art neural network architecture.
Incorporating Knowledge Graphs. Different from distant supervision that only takes advantage of KGs to build training data, a few researchers focused on incorporating KGs into FET models. Ref. [37] proposed to improve FET with knowledge attention which learns the relational information from KGs. Ref. [34] proposed to enrich the mention features by adding a KG-type representation obtained from KGs. However, they did not directly utilize the entity-typing facts in KGs for improving FET.
Denoise. Most typical FET datasets, such as FIGER (GOLD) [9], BBN [24], and OntoNotes [18,28], are labeled with KGs by distant supervision [10], which inevitably bring noise to training data. Several studies aimed to address these problems by heterogeneous partial label embedding [17,44,45], hierarchy-aware loss normalization [38], language model enhancement [46], filtering function [47], compact latent space clustering [48], virtual adversarial learning [49], attentive graph convolution network [50], and automatic relabeling [51]. There would be a concern about the noise issue for our model; however, it is not the main focus of this paper. We believe that our model could be further boosted by adding a denoising module, which we reserve for future.
Others. There are some other points concerned in FET. Recently, some researchers have focused on FET in KGs, also known as knowledge graph entity typing (KGET), which is a subtask of knowledge graph completion [52], by using external data outside KGs [53,54], or only with structural relational information in KGs [52,55]. Some have concentrated on encoding the hierarchical characteristics of fine-grained type in their models [7,44] and on zero-shot entity typing [56,57].

Copy Mechanism
The copy mechanism is widely used in various natural language generation tasks, such as sequence-to-sequence learning [58,59], keyphrase generation [25], abstractive summarization [27], and entity prediction [60]. Ref. [58] proposed the pointer networks, which used attention as a pointer that could select a member of an output sequence directly from the input, which could be seen as a copy model with an attention mechanism. However, it could not make prediction using external lexemes besides the input sequence. Ref. [59] proposed COPYNET to solve this issue in a hybrid end-to-end way, which incorporated the copy mechanism with a generation model which yielded external lexemes that did not appear in the input sequence. Based on [25,59] proposed to incorporate a copy mechanism with a recurrent neural network (RNN)-based generation model for deep keyphrase generation. To enhance the copy mechanism, ref. [61] proposed SeqCopyNet, which not only could copy single words but also copy subsequences from the input text. Ref. [27] proposed a transformer model with copy mechanism for abstractive summarization. Inspired by previous works, we utilize the copy mechanism with the typing characteristics of knowledge graphs. To the best of our knowledge, we are the first to attempt to incorporate the copy mechanism in neural fine-grained entity typing.

Methodology
In this section, we introduce the details of the proposed model COPYFET. We first give the notations and then introduce the feature encoder and the model architecture, which includes a copy model and a generation model.
Notations. Given an entity mention m and its context c in a sentence s, and a set of type tags T , our model aims to predict the probability of each type t ∈ T for this mention. We denote w 1 , w 2 , · · · , [m 1 , m 2 , · · · , m n ], · · · , w L as the words in the sentence, where m 1 , m 2 , · · · , m n denotes the words in entity mention. L and n denote the number of words in the sentence s and mention m. Boldfaced m and c represent the embedding vector of mention m and context c, respectively. Besides containing lots of entity facts, a knowledge graph G also provides a large amount of existing entity-typing instances, i.e., G = {(e, t)|e ∈ E, t ∈T }. E andT represent the set of entities and types, respectively. For each entity mention m and its corresponding entity e, we build a bounded subset of T which is specific to m (namely a type's copying vocabulary for m) as T m . It consists of all the types that have been labeled as the types in known entity-typing facts with entity e in the KG.
Since the labels in type tag set T and KG type setT may not be exactly the same, it needs type-mapping processing. The copying vocabulary T m is an N-dimensional few-hot indicative vector. N stands for the size of T . The value of types in the copying vocabulary are marked 1 while others are set to 0. The fine-grained entity typing task is to infer the type probability distribution in type tags space T given the mention m and context c, i.e., p(t|m, c)=? . Feature Encoder. We concatenated the entity mention representation m and its context representation c in the sentence as the feature vector x, as follows: Entity Mention Representation. The entity mention representation m was simply calculated by averaging the embeddings of all words in entity mention [m 1 , m 2 , · · · , m n ] [37], as follows: where n represents the length of the entity mention. Context Representation. We used a bidirectional LSTM (BiLSTM) to encode context representation. We first utilized a special token to denote the mention, as the token "[M]" in Figure 1. Then, the word embeddings of the modified context words {w 1 , w 2 , · · · , [M], · · · , w L } were fed into two layers of BiLSTM (bidirectional long short-term memory), and the context representation c was the sum of the BiLSTM layers' outputs: where h 1 m and h 2 m are the output of the first and the second layer of BiLSTM for [M], respectively.

Copy Model
The copy model was designed to identify types from its corresponding copying vocabulary that stemmed from the known entity-typing instances in existing KGs.
We first dealt with the training dataset to build the type's copying vocabulary for each mention, i.e., T m , which contained three steps: (1) Given a mention m, we utilized a simple entity-linking (EL) algorithm (similar to [34]) to retrieve its corresponding entity in the KG. Specifically, we directly linked the mention to the entity with the largest commonness score [62], which indicated the probability of an entity given the entity mention. The commonness score was calculated based on the anchor links in Wikipedia. (2) If the EL algorithm returned an entity, we obtained the types of this entity from the KG. (3) Since the types in the KG may be different from the target type set T , following the rules used in [56], we mapped them to the type tags in T . T m was an N-dimensional multi-hot vector and the value of types in the type's copying vocabulary was 1, while others were 0. Note that if the result of EL is NULL, we simply set T m as a zero vector.
If the mention m has the type's copying vocabulary T m , COPYFET increases the probability value calculated for the candidate types that are chosen from the copying vocabulary. Specifically, the copy model first builds an indicator vector v T with a multilayer perceptron (MLP): v where f c is a three-layer MLP with Relu activation, and the vector v T is an N-dimensional indicator vector. N is the size of the type's tag vocabulary T . To decrease the probability value of some types that do not belong to the type's copying vocabulary (i.e., uninterested types for the copy model), COPYFET implements an element-wise multiplication between the index vector v T and the indicator vector T m . Formally, the type prediction distribution of the copy model is defined as follows: where p c (t|m, c) stands for the prediction probabilities on the type's copying vocabulary.
The maximum item of p c (t|m, c) indicates the type will be copied from the type's copying vocabulary. The basic idea behind the copy model is that it is more beneficial to learn to predict from a small candidate set than the whole type vocabulary. However, entity-typing facts may be out-of-KG. Thus, it needs an additional generation model to infer such typing facts.

Generation Model
With the same mention m and context c, the generation model concentrates on type inference by selecting the type from the whole type set T . The inference made by the generation model treats the typing fact as a new one without any mentions to the knowledge graph. Similar to the copy method, the generation model also builds a whole type vocabulary query vector p g as follows: where f g is a three-layer MLP with Relu activation. Similar to p c (t|m, c) in the copy model, p g (t|m, c) stands for the predicted probability distribution among the whole type set. The largest score in p g (t|m, c) denotes the corresponding type we inferred in the whole type set from the generation model.

Incorporating Copy Model with Generation Model for FET
To make type distribution prediction regarding a query p(t|m, c) =?, both copy model and generation model predict a type among their candidate type sets. As shown in Figure 1, COPYFET incorporates the predicted results from both models as follows: where p c (t|·) stands for the copy model, and p g (t|·) stands for the generation model. λ is a hyperparameter for the trade-off between copy model and generation model. To cope with the overly specific issue that usually biases the model towards popular subtypes instead of generic ones, i.e., preferring politician over person, we designed a recursive selecting method. For each stage, we chose the type that had the maximum combined probability in that depth and then we went deeper. We implemented it recursively until the maximum probability was smaller than a threshold. For the sake of simplicity, let us suppose the maximum depth of the type hierarchy is 2 without loss of generality. Formally, we denote T 1 as the type set of first level, and T 2 t i stands for the children type set (second level) of a specific type t 1 i ∈ T 1 . The final output fine-grained entity type predictiont changes with a threshold β as follows: where t 1 i indicates the output type with the highest probability in the first level, and t 2 j indicates its subtype with the highest probability in T 2 t i . Here, β ∈ (0, 1) is a hyperparameter acting as a threshold that controls the specific degree of the hierarchical type. The higher β is, the more coarse-grained the final output type should be.

Cross-Entropy Loss Function for Optimization
Since in the training set, there were multiple ground truths for one mention, the objective function was defined as the element-wise cross-entropy over all entity mentions: where t * indicates the ground truth types of the mention. Since the training data were automatically generated by linking the mention to all labels in the KG, which was the same as the copy vocabulary, it may cause the model to overfit the weakly labeled training data. This is fine for most types of entities such as /locations and /organizations since they usually have the same type in different context; however, this is problematic for other context-dependent entities, such as /person entity mentions. To alleviate overfitting issue, we added a random fine-grained type label that did not belong to this entity when building the type copying vocabulary. During training, we employed dropout in two LSTM layers and MLP layers.

Experiments
In this section, we evaluate the effectiveness of the proposed COPYFET with two public datasets.

Datasets
We used two publicly available benchmark datasets for FET experiments, including Wiki/FIGER (GOLD) [9] and BBN [24]. The statistics of the percentage of the entity mentions' typing facts of the two benchmark datasets (i.e., testing data) that have been already covered by existing KGs are shown in Table 1, and other statistics are included in Table 2. Wiki/FIGER (GOLD). Ref. [9] extracted a dataset from Wikipdia articles and news reports to form the training, validation set, and testing data, and annotated entity mentions using 113 types with a two-level hierarchy.
BBN [24] is based on a portion of the one million word Penn Treebank corpus from Wall Street Journal articles and is completely manually annotated using 56 types with a two-level hierarchy. Ref. [44] regenerated the training corpus via distant supervision.

Baselines
To demonstrate the effectiveness of our proposed model COPYFET, we compared results with several state-of-the-art FET models: • AFET [44]: one of the most widely used FET model. AFET models the samples with only one label and samples with multiple labels separately with a partial label loss to handle noisy labels. • Attentive [63]: a popular attention-based neural network model which uses attention mechanism to focus on relevant information. • AAA [45]: an extension of AFET which jointly encodes entity mentions and their context representation. • NFETC [38]: a very popular model which formulates FET as a single-label classification problem with hierarchy-aware loss. • NFETC-CLSC [48]: an influential extension of NFETC which utilizes imperfect annotation as model regularization via compact latent space clustering to address the confirmation bias problem. • IFETET [34]: a FET model which utilizes entity type information from a KB obtained through entity linking to form the final feature vector of a mention. • NDP [7]: a random-walk-based model which weighs out noise with a loss function. • HFET [41]: a popular ELMo-based pretrained language model which adopts a hybrid type classifier.
• HET [8]: a recent model that takes the hierarchical ontology into account with a multilevel learning-to-rank loss and gains great performance improvement. • FGET-RR [50]: a recent model that refines the noisy mention representations by attending to corpus-level contextual clues prior to the end classification. • Box [64]: a recent box-based model for fine-grained entity typing.

Experimental Settings
In COPYFET, we evaluated the performance by the strict accuracy (Strict Acc), loose macro F-score (Macro-F1), and loose micro F-score (Micro-F1), which are the most widely used evaluation settings for FET systems [9]. We used pretrained word embeddings from [65]. The settings are shown in Table 3. For training our model, we selected the parameters λ and β ∈ {0.1, 0.2, 0.5, 0.7} based on the validation set. Finally, the λ was set to 0.5 and 0.7 on Wiki/FIGER (GOLD) and BBN, respectively. The β was set to 0.5 on both Wiki/FIGER (GOLD) and BBN.  Table 4 demonstrates the results of fine-grained entity typing. We can observe that our COPYFET outperforms all baselines for fine-grained entity typing in terms of all metrics on Wiki/FIGER and BBN. Specifically, our model improves the strict accuracy on the two datasets with values of 76.4 and 83.6, respectively, which confirms the capability of COPYFET to incorporate the copy mechanism for FET using the copy-generation mechanism and to infer types for mentions in text. These results are in line with our intuition, which indicated that the proposed model was capable of leveraging entity-typing information from existing KGs to build the type's copying vocabulary. It is the main feature that leads to the better performance of CopyFet. It can substantially narrow down the optimal list of possible type labels for unlabeled mentions, and thus can significantly improve the performance of FET.

Ablation Study
To evaluate the different component of our model COPYFET, we conducted an ablation study. To this end, we generated a variant of COPYFET by deleting the use of its copy components (called COPYFET-Generation-only), and compared the fine-grained entitytyping performance on Wiki/FIGER and BBN. Table 5 shows the fine-grained entity-typing results by the variant of our model COPYFET. We can observe that the copy model is significant. Deleting the copy module leads to drops of all metrics on strict accuracy in Wiki/FIGER and BBN, respectively, which indicates that learning to infer types for unlabeled mentions by referring to the known entity-typing facts in KGs can be quite beneficial. The results demonstrate our model COPYFET can successfully take into account the known entity-typing information from existing KGs via the copy mechanism to improve FET. Next, we analyzed the detailed results of COPYFET compared to COPYFET-Generationonly. In Figure 2, we present the type-wise performance for the top-10 most frequent types in the FIGER testing dataset. Compared to COPYFET-Generation-only, COPYFET performs better in all types.  Figure 2. Performance analysis of COPYFET on the top 10 types present in FIGER dataset. In these ten types, COPYFET performs better than COPYFET-Generation-only. Table 6 gives two examples of fine-grained entity-typing results on the FIGER and BBN testing set. For example, given a mention of "Unitec Institute of Technology" and its context, the possible fine-grained types are inferred by COPYFET-Generation-only and COPYFET, respectively. The former makes a false prediction (i.e., /location), while the latter can make a positive inference since the known entity-typing fact ("UNITEC", /organization/educational_institution) has already been included in the KG (e.g., Freebase). This is quite helpful for the prediction, which illustrates the efficacy of the copy model in COPYFET. Table 6. Example type predictions on FIGER and BBN testing sets using COPYFET-Generation-only and COPYFET. Bold indicates the true prediction.

Wiki
The study is from the Unitec Institute of Technology, Auckland , New Zealand.

Conclusions and Future Work
In this paper, we proposed a novel model architecture for fine-grained entity typing using KGs. The proposed model leveraged the popular copying mechanism that "copies" an inferred fine-grained type probability distribution of the target from a knowledge graph. The copied distribution was then combined with the output of a regular generation model that predicted the distribution of the full set of types. This new method achieved new SOTA results on FET, outperforming previous methods based on weak supervision or knowledge injection using KGs. The ablation analysis showed that the copying module of the model contributed significantly to the improved prediction quality. Interesting future work directions include exploring how to make the construct operation of mapping from the target types to the KG types in this model soft (currently it is one-hot) and trainable and adding a denoising module in the copy-generation networks.