Attention-Based Joint Entity Linking with Entity Embedding

Entity linking (also called entity disambiguation) aims to map the mentions in a given document to their corresponding entities in a target knowledge base. In order to build a high-quality entity linking system, efforts are made in three parts: Encoding of the entity, encoding of the mention context, and modeling the coherence among mentions. For the encoding of entity, we use long short term memory (LSTM) and a convolutional neural network (CNN) to encode the entity context and entity description, respectively. Then, we design a function to combine all the different entity information aspects, in order to generate unified, dense entity embeddings. For the encoding of mention context, unlike standard attention mechanisms which can only capture important individual words, we introduce a novel, attention mechanism-based LSTM model, which can effectively capture the important text spans around a given mention with a conditional random field (CRF) layer. In addition, we take the coherence among mentions into consideration with a Forward-Backward Algorithm, which is less time-consuming than previous methods. Our experimental results show that our model obtains a competitive, or even better, performance than state-of-the-art models across different datasets.


Introduction
Given a query consisting of mentions (name strings) and their background document, entity linking involves linking the mentions to their corresponding entities from a reference knowledge base, such as Wikipedia [1]. It is a fundamental task in the field of natural language processing, which can facilitate many other tasks, such as semantic search, question answering, relation extraction, and text understanding [2].
The challenge of entity linking comes from polysemous mentions and various forms of entities. For example: The mention "Jordan" could refer to the basketball player "Michael Jordan" or the actor "Michael B Jordan", and the entity "Michael Jordan" could be named as "His Airness" or "MJ". All the existing state-of-the-art methods regard entity linking as a ranking problem. They try to find the most semantically matched entity for every given mention as its predicted entity [1,[3][4][5]. The key difference among these methods is how to encode the entity and mention contexts.
In a knowledge base (e.g., Wikipedia), one can obtain certain kinds of information about an entity, such as the entity context and entity description, as shown in Figure 1. Generally, these information reflect different aspects of entity, and our experimental results show that the entity linking system could obtain better performance with an entity embedding framework which can capture a larger range of different entity information aspects. Figure 1. In Wikipedia, each entity has a canonical page and this is the canonical page for the entity "Michael Jordan". The description for the entity "Michael Jordan" consists of the words in the blue box. There are many hyperlinks in the page and the words around the hyperlink form the entity context for the corresponding entity. For example, the context for the entity "North Carolina Tar Heels men's basketball" consists of the words in the orange box. Some earlier methods [2,6] only take the entity description into consideration, with heuristic methods like BOW or TF-IDF. However, Sun [1] argues that these methods are insufficient to disentangle the underlying explanatory factors of the data and proposes a method which employs a convolutional neural network (CNN) to encode the entity description. Some other methods [3,[7][8][9] try to encode the entity, based on the idea of word embedding, which only takes entity context into consideration. However, all the methods above fail to capture different information aspects of entity, which could result in a loss of information. Inspired by Nitish Gupta's work [4], we design an entity embedding framework which can capture different information aspects of entity. Other entity information aspects may be lacking, like entity type, while the entity description and entity context are common for most knowledge bases. So, in our work, we mainly investigate how to effectively encode and combine entity description and entity context, to generate dense unified entity embeddings. It is also worth noting that our entity embedding framework could be easily extended to the case in which there are more entity aspects. We use long short term memory (LSTM) [10] and CNN [11] to encode the entity context and entity description, respectively, and afterwards we design a function to encourage the entity embedding's similarity to all of the encoded representations (e.g., representation of entity context and representation of entity description). Compared to previous methods, our approach can capture different aspects of entity information, and our experimental results show that our global model could obtain a better performance, with entity embeddings which can capture richer entity information aspects.
The semantics of mention mainly come from the mention context and other mentions in the given document. For mention context, most previous methods assume that all the words in the context have the same importance [1,4,6,8,12]. Obviously, this should be investigated carefully. Some works [8,13] try to introduce the standard attention mechanism to fix this problem. However, the standard attention mechanism can be viewed as a process of performing soft selections of individual words independently; it ignores the dependencies between words, and may make mistakes when complex expressions are involved. Inspired by Wang's work [14], we propose a novel conditional random field (CRF)-based attention mechanism, which can effectively capture the important text spans for the mention with a CRF layer. The coherence among mentions could also be helpful in entity linking. For example, when the mention "Jordan" and the mention "Bulls" both appear in a document. We can infer that the mention "Jordan" refers to the entity "Michael Jordan" and the mention "Bulls" refers to the entity "Chicago Bulls" with a high probability. In this paper, we use the Forward-Backward Algorithm to calculate the coherence among mentions, which is less time-consuming than the graph-based methods and probabilistic methods used in earlier entity linking systems [2,3].
The main contributions of this work are the following: 1. We present an entity embedding framework, which can effectively capture different information aspects. 2. We are the first ones who use a CRF-based attention mechanism to capture the important text spans in the mention context, to improve the performance of our linking system. 3. We take the coherence among mentions into consideration with the Forward-Backward algorithm, which is less time consuming than those graph-based models used in previous work. 4. Based on the above three contributions, we build our global model. Our experimental results show that our model can achieve a competitive, or better, performance than state-of-the-art models.

Related Work
There have been a lot of studies on entity linking but, in this section, we only focus on those prior works which relate to our main contributions.

Encoding of Entity
Similar to the idea of word embedding [15], the main goal of entity embedding is to compress the relevant information of an entity into a vector. In entity linking systems without entity embedding, such as the method proposed in [2], the coherence between two entities is often calculated by counting the number of the page they share, which is called a co-occurrence entity pair. However, this method can suffer from sparsity issues and large memory footprints. In addition, there are often some new entities, which need to be added into a knowledge base in practice, while all the co-occurrence entity pairs need to be recounted under this framework when new entities are added. With the help of entity embedding, new entities could be added in an incremental manner, just as with word embedding [15], which is necessary in practice. To the best of our knowledge, Stefan Zwicklbauer [3] was the first one to extend word embedding to entity embedding. Later, some works [8,13] followed Stefan Zwicklbauer's work. The framework proposed by Zwicklbauer only takes entity context into consideration. It ignores the fact that the entity description is also an important semantic aspect, which may result in loss of information. These days, some works [1,16] have been carried out to address the problem. However, they fail to obtain unified entity embeddings, like in the framework proposed by Matthew [16], which uses the CNN to encode the entity title and entity description, respectively, but both of the encoded representations are independent when linking mentions to their corresponding entities. What is special, in our model, is that we design a function to encourage the entity embedding to be similar to all the encoded representations. Our model can capture different information aspects about entities and generate unified dense entity embeddings. More importantly, entities could be added in an incremental way in our method, which was often ignored in previous methods.

Encoding of Mention Context
There have been many approaches to encoding the mention context in the literature. Z. Chen et al. [6,12] used some heuristic methods, such as bag of words (BOW) or TF-IDF , to encode mention context. However, those heuristic methods capture the semantics of mention context in a coarser way than deep learning methods. Yaming Sun et al. [1] used CNN and stack denoising auto-encoders to encode the mention context. Some other deep learning models were also used in entity linking to encode the mention context, such as LSTM and Doc2vec [15]. But, all the methods above made the same assumption: That all the words in the context have the same importance, which should be investigated carefully. To fix this problem, some attention-based methods were proposed, Octavian et al. [8,13,17] proposed a standard attention mechanism, based on LSTM, which tried to capture the individual important words. However, the standard attention mechanism was usually achieved by concatenating vectors and sending them to a multilayer perceptron (MLP), and could be viewed as a process of performing soft selections of individual words independently. It ignored the dependencies between words. Our model uses CRF to capture the dependencies and find the important text spans, rather than individual words.

Modeling Coherence among Mentions
As we mentioned earlier, the coherence among mentions could also be helpful in entity linking, especially when we can not disambiguate the mention only by its context. Han et al. [2,8,18] tried to use a graph model to calculate the coherence among metions to improve the performance of entity linking. Some probabilistic models were also used in entity linking, such as random walk [19]. However, some researchers [13,20] argued later that all those graph and probabilistic models are time-consuming. So, simplifying the methods of calculating the coherence among metions has become a research hotspot for entity linking. Some researchers have tried to use heuristic algorithms to fix the problem, such as the Hill-climbing Algorithm [20]. More recently, Phan et al. [13] indicated that assuming that all the mentions in the same context should have similar topics is not suitable. Typically, only a few of the mentions in the same context have a high coherence. Using this assumption disambiguates the two mentions which have the highest confidence in an iterative way. Although the method simplifies the calculation a lot, it is not robust enough. In our model, we use the Forward-Backward Algorithm to calculate the topic information, which is robust and less time-consuming.

Definition
Formally, given a document D containing a set of words D = {w 1 , ..., w i , ...w t } (where w i denotes the i-th word in the document), a set of mentions M = {m 1 , m 2 , ..., m n } (where m i denotes the i-th mention in the document), and each mention has context C = {w m−s ..., w m−1 , w m , w m , w m+1 , ...w m+s } (where w m denotes a word which is a part of a mention and s is the window size of the mention context), if K is a a target knowledge base, then entity linking is a task to find a N-tuple u = {e 1 , e 2 , ..., e n }, e i ∈ C m i ⊆ K (where C m i is the candidate entity set for m i and e i is a candidate entity of C m i ). The entity candidate set is often used to improve the accuracy and efficiency of the entity linking algorithm (see more details in Section 5.2). All the early methods can be formulated by Equation (1).
Here, ϕ(m i , e i ), called the local score, reflects the likelihood mapping m i → e i , based on the semantic similarity between the mention context and entity which can be computed independently. The model which only takes the local score ϕ(m i , e i ) into consideration is called the local model. ψ(u) is called the co-occurrence score, and reflects the coherence among the mentions which was often calculated by graph or probabilistic methods in the previous works. The model which makes use of both ϕ(m i , e i ) and ψ(u) is called the global model. Most of the state-of-the-art models try to find the N-tuple u * by maximizing the confidence of each assignment ϕ(m i , e i ), which can be determined by the cosine similarity between the entity embedding of the entity e i and the mention context representation of the mention m i , while enforcing the coherence among all the linked entities ψ(u). The coherence between two entities are often calculated by the cosine similarity between embeddings of the two entities. The key difference between these models is how to encode entity and mention context. In our paper, we mainly focus on the representations of entities and mentions.

Model
In this section, we first describe how to build the framework of entity embedding (Section 4.1). Then, we show the details of the mention context encoder and our local model (Section 4.2). Last, we address the global model, assuming the coherence among the mentions (Section 4.3). We show the relations among the three parts in Figure 2. Figure 2. Overview of the model. The input of the encoder for the mention context is the mention context and the inputs for entity embedding are the entity description and the entity context. Firstly, we get the representations of entity and mention context, through the entity embedding framework (Section 4.1) and mention context encoder (Section 4.2), respectively. Then, we get the local scores by computing the cosine similarity between the two representations (Section 4.2). Lastly, we take the coherence among mentions into consideration with a Forward-Backward Algorithm, and gain the global scores based on local scores (Section 4.3).

Framework of Entity Embedding
The semantics of entity may come from many different aspects such as entity description and entity context. The key to an accurate entity embedding framework is the effective encoding and combination of different information aspects. Inspired by the work of Gupta [4], we use LSTM and CNN to encode the entity context and entity description, respectively. Then, we combine all the different information aspects to get unified dense entity embeddings, based on the assumption that an entity embedding should be similar to all its related encoded representations (e.g., the representations of entity context and of entity description).

Encoder for Entity Context
As Figure 1 shows, the entity contexts are usually short and their word order can not be ignored. LSTM is a model which could effectively capture the word order so, in this case, we employ LSTM as our basic model. The processing is shown in Figure 3.
In more detail, given a hyperlink and its context C = {w 1 , ..., w m , ...w 2s }, we split the context into two parts by the anchor text w m , le f t = {w 1 , w 2 , ...w m }, right = {w 2s , w 2s−1 , ..., w m }. The left-LSTM is applied to the sequence le f t = {w 1 , w 2 , ...w m } with the output h l , while the right-LSTM is applied to the sequence right = {w 2s , w 2s−1 , ..., w m } to produce h r . After that, we concatenate the vector h l and the vector h r , and send it to a MLP layer to get the encoded entity context representation v c . Inspired by the work proposed by Yamada [9], we design a function that enables entity embeddings to contain the information of the entity context: .
(2) Figure 3. Encoding the entity context information with LSTM. The input of the model is word embeddings of the words in the context, which may come from an annotated corpus (e.g., Wikipedia hyperlinks in our case).
In this equation, e is the corresponding entity to the current mention and v e is the entity embedding of the entity e. The entity embeddings v e are randomly initialized and will converge through the course of training. C m is the entity candidate set of the current mention, and v c k is the entity embedding of a candidate entity. Here, based on the assumption that the context-encoded representation should be similar to its corresponding entity embedding and dissimilar to other entity embeddings, we maximize the cosine similarity between v c and v e , and minimize the cosine similarity between v c and the embeddings of the other entities in C m .

Encoder for Entity Description
As Figure 1 shows, the description is usually long and noisy. In our model, we use CNN to encode the entity description, which has proved efficient in handling the text-like entity description in Yoon Kim's work [21]. The framework is shown in Figure 4.
As Figure 4 shows, the input of the model is the word embeddings of the description. We feed it into the CNN and get the encoded description representation v desc . Similar to Equation (3), we encourage v desc to be similar to its corresponding entity embedding v e and dissimilar to other candidate entity embeddings, and get the expression p desc (e|v desc ).

Combine Different Information Aspects
We design a function to encourage entity embeddings to be similar to all the encoded representation and get a unified dense entity embedding.
Here, e is an entity of the target knowledge base K and {v e } are the trained entity embeddings for all the entities in the knowledge base. θ are the training parameters of the model. v c is the encoded representation of entity context and v desc is the encoded description representation of the entity e. P text (e|v c ) denotes the similarity between the current entity embedding and its context encoded representation (see Section 4.1.1). p desc (e|v desc ) denotes the similarity between the current entity embedding and its encoded description representation (see Section 4.1.2).

Attention-Based Local Model
We build our attention-based local model on the insight that only a few text spans are important for disambiguation. Focusing on those text spans can help to reduce noise and improve the performance of entity linking. This model tries to stimulate the processing of human beings by using a CRF layer. Given a mention and its context, people try to find the relevant text spans around the mention to infer its corresponding entity. The model is shown in Figure 5. Furthermore, we design two additional regularizations to guide this model. In this section, we disambiguate the mention m by its context C = {w 1 , w 2 , ..., w 2s }. Here, s is the window size of the mention context. Sometimes, once we know the mention, we can nearly infer its corresponding entity. To highlight the current mention words, we introduce a binary variable i t ∈ {0, 1}, which is called the mention indicator, to indicate whether the i-th is a part of the current mention or not. As Equation (4) shows, the input V of the model is the concatenation of the word embedding and mention indicator embedding.
Here, d is the size of the word dictionary, d 1 denotes the dimension of word embedding, W emb (w t ) denotes the mapping of a word onto a vector, d 2 denotes the dimension of the mention indicator embedding, and W mask (i t ) denotes the mapping of a mention indicator onto a vector. V is sent to a Bi-LSTM layer. For each unit of BiLSTM, we get two hidden vectors: The forward hidden vector − → h t for forward transmission of information, and the backward hidden vector ← − h t for backward transmission of information. We concatenate the two vectors − → h t and ← − h t as the final output of each unit of BiLSTM layer, shown in Equation (6) R = [r 1 , r 2 , ..., r 2s ].
As Figure 5 shows, for every word in the given context, we introduce a binary variable z ∈ {0, 1} to indicate whether the word is important or not. The importance state changes, along with the mention context. For a certain mention context, we could use a path P = {z 1 , z 2 ....z 2s } to denote a possible importance state changing, where z i is the importance state corresponding to the i-th word in the mention context. The space of the possible paths P grows exponentially, corresponding to the number of given words. Formally, the importance distribution of the given mention context is calculated by Equation (8) Pr where Pr is formed by z i , f (z i , R) is the potential function of the path P, and Z(R) is a normalization factor calculated by summing the potential functions of all the possible paths. In detail, f (z i , R) is formed by two potentials on the vertices of Equations (11) and (12) for an undirected graphical model, respectively.
We know from Equations (10)-(12) that the importance of the current word is dependent on the current word itself and its adjacent words. Here, W v ∈ R 2×2d h is the state characteristic matrix, which maps context representation to the feature score of the importance state, and d h is the dimension of r t . Further, W e ∈ R 2×2 is a transition matrix which denotes the transmission between every adjacent word's importance state. After we get the importance distribution of every word in the context, we get the mention context encoded representation v text by weighted summarization, which is shown in Equation (13) v text = ∑ z p(z)g(R, z), where g(R, z) is a feature function, which is defined based on the selection of path P. However, enumerating all the possible paths would be very time-consuming and so, in this paper, we simplify the calculation by dynamic programming with a message passing model and Equation (13) can be rewritten as Equation (15) As mentioned earlier, entity linking can be regarded as a ranking problem. The entity which has the most similar semantics to the given mention will be picked as the mention's predicted entity. Here, we maximize the local score of the corresponding entity of the current mention and minimize the local score of other entities in the candidate set of the mention. The local score is formulated in Equation (16), and the optimal object of the local model is shown in Equation (17) ϕ(m, e) = exp(v text · v e ). (16) opt(m, e) = ϕ(m, e) ∑ c k ∈C m ϕ(m, c k ) .
The key difference between our attention mechanism and the standard attention mechanism is that our model can capture important text spans, rather than individual words in the given context; which is more similar to human thought. Inspired by the framework proposed by Bailin Wang [14], we introduce two regularizations to improve the performance of our model, based on the assumptions that only a few text spans are important for entity linking: Ω 1 is designed to discourage the importance state change between every two adjacent words, which enforces our model to focus on the text spans rather than individual words. Ω 2 is designed to discourage the number of important spans, which also enforces our model to focus on the text span, which is really important for entity linking.

Global Model
The assumption that all the mentions which appear in the same context share a similar topic is often used to improve the performance of entity linking. This is unlike local models, which ignore the coherence among mentions and disambiguate all the given mentions independently. In this section,  Our global model is defined by Equation (20) score(e|m i ) = ϕ(m i , e) + M i−1,i (e) + M i+1,i (e). (20) To disambiguate all the given mentions, we need find the most semantic matched entity for each mention, while all the chosen entities should have a high coherence. Here, ϕ(m i , e) denotes the semantic matching rate between the entity e and the i-th mention m i , which can be calculated by our local model (mentioned in Section 4.2). M i−1,i (e), M i+1,i (e) denotes the coherence between a candidate entity e of m i and its adjacent entities. We calculate the coherence using dynamic programming with a Forward-Backward Algorithm. More specifically, M i−1,i (e) denotes the forward message passing and M i+1,i (e) denotes the backward message passing. The only difference between M i−1,i (e) and M i+1,i (e) is the direction of message transmission. Here, we only give the details of M i−1,i (e), which can be formulated by Equation (21). We can calculate M i+1,i (e) in a similar way.
where M i−1,i (e) denotes the message passing from the i − 1th mention to the entity e of the ith mention, and e ∈ C m i−1 denotes that e is a candidate entity for m i−1 . Here, φ(m i−1 , e ) denotes the semantic matching rate between e and the context of m i−1 . φ(e , e) denotes the coherence between two entities, which can be formulated by Equation (22) φ where A ∈ R d e ×d e is randomly initialized and needed to be trained, v e and v e are the corresponding entity embeddings for the entities e and e , and d e is the dimension of the entity embedding. Following the idea of the local model, the optimal function of our global model is designed as in Equation (23). We encourage our ground-truth entity to have a higher global score than the other candidate entities.
Our final loss function, for the whole model, is shown in Equation (24) where λ 1 , λ 2 denote the coefficient of each of the regularizers, and Ω 1 (z), Ω 2 (z) are as mentioned in Section 4.2.

Experiments
In this section, we first give the benchmark datasets that we used. Next, we introduce how to obtain the candidate sets for each mention, how to train the entity embedding, and how to train our local model and global model, based on pre-trained entity embedding and word embedding. Last, we show the performance of our entity linking system and the state-of-the-art models that we compared our model to.

Datasets
Here, we give the details of the datasets involved in our experiments: AIDA-CoNLL [22] is one of the biggest manually-annotated datasets for entity linking. It consists of three parts: AIDA-train (a training set), AIDA-A (a validation set), and AIDA-B (a test set). WNED-CWEB are automatically extracted from clue-web and cleaned by Barbosa [23]. ACE04 and AQUAINT are both cleaned and updated by Guo and Barbosa [23,24]. The statistics of all the datasets are shown in Table 1.

Candidate Generation
To disambiguate the mentions provided, computing the semantic similarity between the current mention and all the entities in the knowledge base is impractical. Generating an entity candidate set for each mention is a very important step for both efficiency and accuracy of entity linking. We pick the top 30 popular entities for each mention as its entity candidate set. The most common way of estimating popularity is by using the Wikipedia-based frequencies of particular names in link anchor texts referring to specific entities. In this paper, we build our hyper-link statistics from Wikipedia (February 2014) and a large web corpus [25]. The statistics for each of the datasets is shown in Table 2. Table 2. Statistics for candidate generation. Gold recall denotes the ratio of the mentions for which the candidate entity set includes the ground truth entity. , where e m i is the ground truth entity of m i , C m i is the candidate entity set for m i , and n is the size of the dataset. From Table 2, we can see that dividing the candidate set into a small number did unnecessary harm to the recall. Our model can obtain a better efficiency in the case of less candidate entities.

Disambiguation Step
Our entity embedding framework was implemented in the Tensorflow framework. Once we had the target knowledge base, we could train the entity embedding. It was an off-line work. Our global model and local model were both implemented in the Pytorch framework. Only when the mentions and contexts were provided, could we link the mentions to their corresponding entities through the model; thus, it was an online work.

Hyper-Parameters Setting
Parameters for entity embedding model: We used Wikipedia (2014Feb) as our training data. We used existing links in Wikipedia with anchors as mentions and links as true entities. The window size of entity context was set as 5. As for description for each entity, we used the first 150 words in its corresponding Wikipedia page as input. The hyper-parameters of our entity embedding model are summarized in Table 3. We used different combinations of parameter settings to train our model, and the parameters settings with which our model could obtain the best performance are given. Hyper-parameters for local model and global model: The window size of the mention context was set as 5. The hyper-parameters of our global model are summarized in Table 4.

Evaluation Matrix
Here, we design experiments to verify the validity of our main contributions. We give the result of our final global model, compared to other advanced entity linking systems, to show that our model could obtain a competitive, or even better, performance than other advanced entity linking system. Our models are trained on AIDA-train, validated on AIDA-A, and tested on AIDA-B and the other datasets, mentioned in Section 5.1.
Here we focus on six variations of our model: To explore the validity of our attention mechanism, we designed three variations of our local model which disambiguate the given mentions independently, only by each mention context with pretrained entity embedding, making use of both entity context and entity description: (1) LSTM-MEAN: Local model without attention mechanism, the importance scores of the words in the mention context were assumed to be the same.
We assume that all the mentions are provided, and we do not predict NIL (unlinkable) mentions. We only consider a linking to be correct when the first entity in the ranking result is equal to the ground truth. Table 5 shows F1 scores of the existing model and several variations of our model on AIDA. From the results of LSTM-MEAN, LSTM-A, and LSTM-A-R, we find that our linking system can obtain a great improvement with the help of our attention mechanism, and the regularizers designed in Section 4.2 could enforce our model's focus on the text, an important factor in disambiguating the given mention. From the results of the three variations of our global model, we can see that our global model achieved a better performance with the entity embedding framework, which could capture richer entity information aspects. By comparing LSTM-A-R and Global Model CD, we see that the coherence among mentions can be used to further improve the system, by use of our algorithm. We re-implemented some of the state-of-the-art models and tested on the other datasets. The results are shown in Table 6.
Our model is trained only on AIDA-train without optimizing on a special dataset. The results show that our model obtained better performance than the baselines across different datasets, which means that our model has a strong generalization ability.

Case Analysis
Here, we give our analysis of the results of our experiments. In the Figure 7, the x-axis denotes the words in the context and the y-axis denotes the importance score of each word. Here, the word "China" is the current mention, and its corresponding entity is "China national football team". From this case, we can see that the mention word "China" gains the highest importance among all the words in the given context. Generally, we find the mention words usually have a high importance score in almost all the cases. This is because the mention words, themselves, are strongly related to its corresponding entity. You can almost infer the most possible corresponding entity from the mention. Besides the mention words, our attention mechanism could also capture some other text spans which are helpful in entity disambiguation. In this case, according to the statistics, the two most possible entities for the mention "China" are "China (country)" and "China national football team". The attention mechanism captured the text spans, which are helpful for distinguishing the two entities. Here, we find that the words "championship", "their", and "them" gain the highest score, except for the mention word "China". The words "their" and "them" tend to refer to a group of people, rather than a country, and the word "championship" has a close relation to sports. Focusing on those words, we could infer that the mention of "China", in this case, refers to the entity "China national football team" with a higher probability than the entity "China (country)". We find that words about time often obtain a low importance score, such as "Friday" in this case. Some other words, such as prepositions like "on" and conjunctions like "But" usually have low importance scores as well. This is similar to the way that people think. People nearly do not disambiguate the given mentions by these words. From Table 5, we find that our model obtained a better performance on AIDA-A than AIDA-B. This is because there are some cases of AIDA-B in which it is hard to disambiguate the given mention only by its local context. For example, consider the context "0 0 2 1 3 Syria 1 0 0 1 1" and the mention "Syria". The ground truth of this case was the entity "Syria national football team". However, according to our statistics, the most possible entity for the mention "Syria" was the entity "Syria(country)". Unfortunately, the context of the mention "Syria" consisted of digital numbers, and it is hard to tell whether the mention refers to a country or a football team by these digital numbers. In cases like this, we may need some other information to infer the corresponding entity of the given mention, such as coherence among mentions. There were 47 similar cases in AIDA-B. So, we can see, with the help of coherence among mentions, our global model obtained a much better performance than our local model in the test set AIDA-B. We also analyzed some other wrong cases. In those cases, most of the mentions had a low frequency to map to the correct entity, according to statistics in Wikipedia. Our model tended to link the mention to the entity which gained a high frequency in the statistics.

Conclusions
In this paper, we focused on how to build a high-quality representation of mention and entity. To encode entity semantics, we built a function to incorporate different aspects of information about entities, in order to obtain dense unified embeddings. For the representation of mention, we introduced a novel attention mechanism which could capture important text spans. In addition, we calculated the topic information with a Forward-Backward Algorithm, which could be helpful to entity linking, especially when we can not predict the correct entity only by its local context. There are still some avenues for future work. Firstly, the semantics of entities could also come from their type and the relations between entities. Taking all of them into consideration could yield a higher quality entity embedding. Secondly, the assumption that all the mentions in a same context may share several topics would be more suitable, and we believe it is worth further research.