Next Article in Journal
Bifurcation and Stability Analysis of a New Fractional-Order Prey–Predator Model with Fear Effects in Toxic Injections
Next Article in Special Issue
Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding
Previous Article in Journal
Locating Anchor Drilling Holes Based on Binocular Vision in Coal Mine Roadways
Previous Article in Special Issue
An Approach Based on Cross-Attention Mechanism and Label-Enhancement Algorithm for Legal Judgment Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Adaptive Mixup Hard Negative Sampling for Zero-Shot Entity Linking

1
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
2
School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China
3
Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830046, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(20), 4366; https://doi.org/10.3390/math11204366
Submission received: 26 September 2023 / Revised: 15 October 2023 / Accepted: 18 October 2023 / Published: 20 October 2023

Abstract

:
Recently, the focus of entity linking research has centered on the zero-shot scenario, where the entity purposed to be labeled at the time of testing was never observed during the training phase, or it may belong to a different domain than the source domain. Current studies have used BERT as the base encoder, as it effectively establishes distributional links between source and target domains. The currently available negative sampling methods all use an extractive approach, which makes it difficult for the models to learn diverse and more challenging negative samples. To address this problem, we propose a generative negative sampling method, adaptive_mixup_hard, which generates more difficult negative entities by fusing the features of both positive and negative samples on top of hard negative sampling and introduces a transformable adaptive parameter, W, to increase the diversity of negative samples. Next, we fuse our method with the Biencoder architecture and evaluate its performance under three different score functions. Ultimately, experimental results on the standard benchmark dataset, Zeshel, demonstrate the effectiveness of our method.

1. Introduction

Entity Linking (EL) is a critical task in the field of Natural Language Processing (NLP), whose core goal is to associate entity mentions appearing in a document (e.g., names of people, places, organizations, etc.) with their referent entity in a knowledge base (e.g., Wikipedia, Freebase, etc.). The EL task generally consists of the following main steps: entity detection, candidate entity generation, and candidate entity ranking. It should be noted that the specific steps and methods of EL tasks may vary in different application scenarios and tasks. In some tasks, entity detection may not be necessary because the entities in the text are already explicitly labeled. However, in an end-to-end EL task [1,2], it is usually necessary to include all of these steps to complete the entity linking process. EL has received much attention due to its wide range of applications in various tasks, including information retrieval [3], content analysis [4], etc. However, significant progress has been made in building EL systems, and most of the existing research works [5,6] are based on the assumption that entity sets are shared between the train and test sets. However, in practice, textual data may come from different domains, topics, and sources, thus presenting diversity and heterogeneity in the data distribution. It also means that the train and test sets may come from different domain distributions, ultimately leading to disjointed entity sets in different domains. This situation highlights the need and importance of zero-shot entity linking [7,8].
The main goal of zero-shot EL is to address two aspects of the problem. Firstly, it aims to deal with unknown entities, having the ability to successfully link entities that have never been seen in the training data to the correct entities in the knowledge graph or entity repository. Secondly, it aims to build EL models that are more general so that they can be adapted to the challenges of different domains, topics, and data distributions, thus increasing the generality and robustness of the models to satisfy the information needs of multiple domains. However, labeled data are often costly to produce or difficult to access in certain specialist areas (e.g., legal). To delve deeper into this problem, Ref. [8] constructed the Zeshel dataset, containing 16 specialized domains, divided into 8 domains for training and 4 domains each for validation and testing, which covers rich textual content for mentions and entities. Without adopting resources (e.g., structured knowledge base) or assumptions (e.g., labeled mentions, a shared entity set), they expand the scope of zero-shot EL to promote the generalizability of the EL system on unseen domains.
To date, a body of work [9,10,11,12,13] has emerged on zero-shot EL, most of which uses BERT [14] as the base encoder. These research efforts have mainly focused on the candidate generation phase, which has a crucial impact on the candidate ranking in EL systems. Ref. [11] was based on encoding mentions and entities using a Biencoder [9] architecture, followed by a Sum-Of-Max (SOM) score function, to compute the similarity between them and training the model using either hard negative sampling or mixed-p negative sampling. Ref. [13] proposed a Transformational Biencoder, which introduced a transformation into the Biencoder to encode mentions and entities and adopted an In-Domain negative sampling strategy, in which they sorted all entities in the golden entity domain over a training period to take the top-k entities as hard negatives.
Concerning current negative sampling methods in the field of zero-shot EL, we note that they generally employ an extractive strategy, resulting in a lack of diversity in the selected negative samples, thus limiting the ability of the model to acquire richer knowledge. Furthermore, hard negative sampling aims to select more challenging negative samples to expose the model to more challenging tasks. However, the negative samples selected by the current hard negative sampling strategy are still not challenging enough. Another problem is that the current zero-shot EL task is usually divided into two phases: candidate entity generation and ranking. These two phases of the task usually take a significant amount of time. Therefore, in the candidate entity generation phase, we believe that not only do we need to improve the recall of candidate entities to provide a richer pool of candidates for the candidate entity ranking phase, but we also need to improve the accuracy so that the model can match the ultimately correct entities in the candidate entity generation phase. These improvements will help to increase the performance and efficiency of the zero-shot EL system.
In this paper, we propose an adaptive_mixup_hard generative negative sampling method based on the hard negative sampling method. The main innovation of the method is to generate more difficult negative samples by fusing the positive sample features of the current mention with negative sample features. In the generation process, a transformable adaptivity parameter W is introduced, which enables the model to generate rich and diverse negative samples during the training process to compensate for the shortcomings of the existing extractive negative sampling methods. In addition, this method inherits the feature that the negative entities selected by hard negative sampling and is semantically different from the golden entities but closer in the embedding space, which helps to improve the differentiation between the golden entity and the negative entities. Therefore, we combine this negative sampling method with the Biencoder architecture to form a new model Biencoder_AMH. In the model, we adopt three different score functions (DUAL, Pooling Mean, and SOM) for similarity calculation. Through the validation of a large number of experiments, our model achieves a certain amount of improvement in the top-64 recalls and accuracy compared to previous work, which makes an essential contribution to the research of matching to the final correct entity in the candidate entity generation stage. Notably, our model achieves different degrees of improvement in each of the r@k (k = 4, 8, 16, 32, 64) metrics, which indicates that our approach not only improves the performance but also provides a strong support for the research in the candidate entity ranking stage. More importantly, we show results at a finer granularity, demonstrating that the improvement in model performance is not limited to a single domain in the test set but grows on multiple domains, which is more in line with the context and goal of zero-shot entity linking.
Our contributions can be summarized as follows:
  • We propose an adaptive_mixup_hard negative sampling, a method variant on hard negative sampling that enables the model to cope with more demanding challenges. Subsequently, we merge this method with the Biencoder [9] architecture to construct a new model, Biencoder_AMH.
  • Our negative sampling method is a generative approach that generates a diversity of negative samples, which helps the model to learn the data distribution more comprehensively and reduces the potential risk of overfitting, improving the model’s generalization performance.
  • After extensive experimental validation, our method achieves not only a significant improvement in top-64 recalls but also a certain degree of improvement in accuracy when compared with other negative sampling strategies (Random, Hard, Mixed-p) under three different score functions (DUAL, Pooling Mean, SOM).

2. Related Works

We discuss related work to better contextualize our contributions. The entity linking task can be divided into candidate generation and ranking. Previous work has used frequency information, alias tables, and TF-IDF-based methods to generate candidates. For candidate ranking, Refs. [5,15,16,17,18] have established state-of-the-art (SOTA) results using neural networks to model context word, span, and information helps to link [19,20,21].
In the EL domain, negative sampling strategies aim to efficiently select negative samples to optimize the performance of EL tasks. Ref. [8] proposed the zero-shot entity linking task. Recently, the strategy of negative sampling has been widely used in the candidate generation phase in the domain of zero-shot entity linking. Ref. [9] followed [22] by using hard negatives in training. They obtained hard negatives by finding the top 10 predicted entities for each training example and added these extra hard negatives to the random in-batch negatives. Ref. [11] demonstrated the results obtained with different negative sampling strategies (Random, Hard, and Mixed-p) on different architectures and showed theoretically and empirically that hard negative mining always improves performance for all architectures. Ref. [13] suggested that negatives that are lexically similar, semantically different, and close to the golden entity representation are more difficult. As a result, they considered the domain of the golden entity. They sorted all entities in the golden entity domain over a training period to take the top-k entities as hard negatives. However, the negatives generated by the above methods are not difficult enough. Therefore, we generate more difficult negatives based on hard negative sampling by incorporating the features of the golden entity, allowing the model to face more difficult challenges.
In the candidate generation, Ref. [8] used BM25, a variant of TF-IDF, to measure the similarity between mentions with their contexts and candidate entities with their descriptions. Numerous applications to the Zeshel dataset have sprung up following this work. Among them, BERT [14] is found to be a highly regarded encoder. Ref. [9] proposed a Biencoder architecture in which textual descriptions of mentions and entities are encoded using two independent BERT encoders. Then, the dot product is used as a scorer and is referred to by [11] as DUAL. Due to BERT, the Biencoder provides a robust baseline for the task. In the study by [10], they used repeated location embedding based on the BERT architecture to address the problem of remote modeling in entity text description. Ref. [11] used the Biencoder framework. However, they used the more expressive SOM [23] score function to measure the correlation between mentions and entities and, as a result, achieved better results on the task. Ref. [13] proposed a Transformational Biencoder, which introduced a transformation into the Biencoder [9] to improve the generalization performance of zero-shot EL over unknown domains. Accordingly, we also combined our negative sampling approach with the Biencoder architecture and notably achieved some improvements to three different score functions (DUAL, Pooling Mean, and SOM).

3. Methodology

In this section, we describe our adaptive_mixup_hard negative sampling strategy, a method variant on hard negative sampling, inspired via [24]. We then combine our negative sampling approach with the Biencoder [9] and multiple similarity calculations (DUAL, Pooling Mean, and SOM) to propose our model Biencoder_AMH. First, we formally present the task definition in Section 3.1. Next, in Section 3.2, we introduce the Biencoder. Then, we describe our adaptive_mixup_hard negative sampling strategy in Section 3.3. Finally, we present our model Biencoder_AMH in Section 3.4. The structure of our model is shown in Figure 1.

3.1. Task Definition

The entity linking task is expressed as follows. Given a mention m in a document and a set of entities Ψ = e i i = 1 , . . . , n , EL aims to identify the referring entity e Ψ that corresponds to the mention m. The goal is to obtain an EL model on the training set of mention–entity pairs D T r a i n = m i , e i | e i Ψ i 1 , n that correctly labels mentions in the test set D T e s t . D T r a i n and D T e s t are usually assumed to be from the same domain. We assume that the title and description of the entities are available, which is a common setting in entity linking [5,8].
In this paper, we focus on the study of the zero-shot EL [8], where both D T r a i n = D s r c i i = 1 , . . . , n s r c and D T e s t = D t g t i i = 1 , . . . , n t g t are found to contain multiple sub-datasets from different domains. At the same time, the knowledge base is separated into training and test time. Formally, denote κ t r a i n and κ t e s t to be the knowledge base in training and test, and we require κ t r a i n κ t e s t = ϕ . The collection of text documents, mentions, and entity dictionaries are separated for training and testing, so linked entities are not visible during the test.
Below, we will describe the three negative sampling methods that are already available.
  • Random: The negatives are sampled uniformly at random from all entities of a batch in training data. It can help the model to deal with unknown entities in various situations but may lead to a training process that lacks guidance for specific textual contexts.
  • Hard: This is a more challenging strategy that tries to select semantically similar negatives to positive examples. In this way, the model will face incredibly greater difficulty in learning and will need a better understanding of the meaning of the entity in different contexts. It aims to help models to capture semantic information better, but it can also lead to a more strenuous training process.
  • Mixed-p: p percent of the negatives are hard, the rest are random. It maintains a degree of diversity in the training process while introducing a degree of challenge. Previous works have shown that such a combination of random and hard negatives can be effective. Ref. [11] finds that the performance is not sensitive to the value of p, In this paper, We choose a p-value of 50%.

3.2. Biencoder

Our model is based on the Biencoder [9], which independently embeds mentions and corresponding entities into the same representation space. As shown in Figure 2, the Biencoder comprises a text encoder E P m for encoding mentions, a text encoder E P e for encoding entities, and a score function f for calculating the relevant scores for mention–entity pairs. E P m and E P e share the same architecture but have independent parameters, P m and P e , and BERT [14] is employed to model E P m and E P e . This approach allows for real-time reasoning because candidate representations can be cached.
Given a pair of the mention–entity  m , e , the representation of mention m is composed of the left context c t x t l and right context c t x t r of the mention, as well as the mention itself. Specifically, we construct the input for each mention m as:
m = C L S c t x t l M s m e n t i o n M e c t c t r S E P
Likewise, the entity representation e is also composed of word pieces of the entity title and description. Therefore, the input to our entity e is:
e = C L S t i t l e E N T d e s c r i p t i o n S E P
where [CLS], [Ms], [Me], [ENT], and [SEP] are special tokens to mark the boundaries of the different pieces of information. For instance, [ENT] is a special token to separate the entity title and description representation. More specifically, the input of mention m is represented after tokenization as T m = m t t = 1 , . . . , n m and the entity e is denoted as T e = e t t = 1 , . . . , n e . Then, both input context T m and candidate entity T e are encoded into vectors V m R n m × d and V e R n e × d .
V m = E P m T m V e = E P e T e
where d denotes the dimension of representations.
The problem of the entity linking is then reduced to using a score function f, i.e.,  f V m , V e , to quantify the similarity between V m and V e . In the current mention–entity pair m , e , if the entity e is the golden entity, the score f V m , V e should be high, or low if otherwise.
As shown in Figure 3, we will introduce the three existing score functions. Ref. [9] defines a DUAL score function that chooses the C L S representations v C L S m R 1 × d and v C L S e R 1 × d of the respective representations V m and V e to compute the score f V m , V e .
DULA:
f V m , V e = v C L S m v C L S e T
Pooling Mean can be used to average pool the embeddings in a text fragment of the mention and entity to obtain an overall representation of the text fragment. Doing so allows the text snippet to be represented as a vector, reflecting the average features. Pooling Mean computes f V m , V e as follows.
Pooling Mean:
f V m , V e = 1 n m t = 1 n m v t m 1 n e t = 1 n e v t e T
In addition, Ref. [11] followed the architecture of [9] and presented that the SOM scorer [23] produces better results than DUAL. However, it is worth noting that the SOM scorer comes at the cost of increased computational cost due to considering all hidden states of V m and V e in the scorer. It means that SOM takes more time than DUAL and Pooling Mean in the training and prediction phases. SOM computes f V m , V e as follows.
SOM:
f V m , V e = t = 1 n m m a x n e t = 1 v t m v t e T
Eventually, the network is trained to maximize the score of the correct entity relative to the (randomly sampled) entities of the same batch ([25,26]). Concretely, the total loss L is computed as:
L P m , P e = 1 B i = 1 B ( f E P m T m i , E P e T e i , 1 + 1 B i = 1 B log j = 1 B exp f E P m T m i , E P e T e i , j
where m i , e i , 1 i = 1 , . . . , B are golden mention–entity pairs in the training set, and  e i , 2 , . . . , e i , B are B 1 negative entities for the i-th mention in a batch.

3.3. Adaptive_Mixup_Hard

It is known that hard negative sampling makes it more difficult for the model to learn, allowing it to better understand the meaning of entities in different contexts. However, the negative samples obtained under this sampling are still not challenging enough for zero-shot entity linking. Therefore, as shown in Figure 4, we propose the adaptive_mixup_hard (AMH) negative sampling, a method variant on hard negative sampling following a two-stage pipeline: choosing and mixing. This method improves the robustness of the model by fusing the features of positive entities V e p and negative entities V e n to obtain more difficult negative samples (strong hard negatives), enabling the model to face more complicated tasks.
Below, we will describe the two-stage process of the AMH negative sampling and present its algorithmic process in Algorithm 1.
Algorithm 1 The process of AMH negative sampling
  • Require: Mention–entity pairs m i , e i i = 1 , . . . , B in a batch, Mention encoder E P m , Entity encoder E P e , score function f ( · ) , a parameter α to control the difficulty of synthesizing new negative entities
  •    for  i = 1 , 2 , 3 , , to B do
  •      Construct the input for each mention m i and entity e i as T m i and T e i by (1) and (2).
  •      Obtain the mention and entity embeddings V m i and V e i using E P m and E P e by (3).
  •      Obtain one positive entity V e , 1 p and the rest are its corresponding negative entities V e , i n i = 2 , . . . , B for each mention m i in a batch.
  •       if  f ( · ) is DUAL then
  •           Compute the scores of the mention with its corresponding negative entities by (4).
          else if f ( · ) is Pooling Mean then
  •           Compute the scores by (5).
          else if f ( · ) is SOM then
  •           Compute the scores by (6).
  •       end if
  •       Select the top k highest scores among the negative entities as the hard negative entities  V e , i n i = 1 , . . . , K by (8).
  •       Calculate an adaptive parameter W [ 0 , 1 ] by (9).
  •       Obtain the final strong hard negative entity V e , i s t r o n g by (10).
  •    end for
Choosing: For the mention–entity pairs m i , e i i = 1 , . . . , B in a batch, they are encoded into vectors V m i , V e i i = 1 , . . . , B . Then, for each mention m in a batch, there is one positive entity V e , 1 p and the rest are its corresponding negative entities V e , i n i = 2 , . . . , B . Next, we use the scoring function f to compute the scores of the mentions with their corresponding negative entities. According to hard negative sampling, we select the top k highest scores among the negative entities as the hard negative entities  V e , i n i = 1 , . . . , K . The hard negative entities are computed as:
V e , i n i = 1 , . . . , K = T o p k f V m , V e , 2 n , . . . , f V m , V e , B n
Mixing: This process is crucial to our approach and aims to synthesize strong, hard negative entities to improve the robustness of the model. Considering that it may be counterproductive to train the model if it is too difficult to fuse the negative entities of the positive entity features at the beginning, we introduce an adaptive parameter W [ 0 , 1 ] . Its calculation is as follows:
W = exp ( f ( V m , V e , 1 p ) ) exp ( f ( V m , V e , 1 p ) ) + i = 1 K exp ( f ( V m , V e , i n ) )
During the training process, W will increase to progressively increase the difficulty of negative entities, which allows the model to learn more diverse representations. It is worth noting that W will eventually increase to 1. Therefore, we also introduce an additional hyper-parameter α ( 0 , 1 ] to control the difficulty of synthesizing new negative entities. The strong hard negative entity V e , i s t r o n g R n e × d is computed as:
V e , i s t r o n g = α · W · V e , 1 p + V e , i n
It is worth noting that for different score functions, different α values cause the model to perform differently and that there is a critical value at which the model performs best. We will describe this in more detail in Section 4.5.

3.4. Biencoder_AMH

Eventually, we combine the Biencoder, Adaptive_mixup_hard, and the score function to form our new model Biencoder_AMH. More specifically, we form the new model Biencoder_AMH_DUAL using DUAL as the score function. Similarly, depending on the different score functions Pooling Mean and SOM, we will also form Biencoder_AMH_Pooling_Mean and Biencoder_AMH_SOM, respectively. Because, depending on the scoring function f, our negative sampling strategy AMH has different rules for the first stage of choosing.
As shown in Figure 1, first, we follow the Biencoder architecture and use BERT [14] to encode mentions with their contexts and entities with their descriptive information to obtain their encoded representations V m and V e , respectively. Then, according to our method AMH, we first filter a batch to obtain K hard negative entities V e , i n i = 1 , . . . , K and one positive entity V e , 1 p corresponding to the current mention m, and finally fuse the features of each negative entity with those of the positive entity to obtain K strong hard negative entities V e , i s t r o n g i = 1 , . . . , K . Finally, we input V m , V e , 1 p and V e , i s t r o n g i = 1 , . . . , K into the scoring function f and use BCEWithLogitsLoss to calculate the loss L . Concretely, for each training pair m i , e i in a batch of B pairs, the loss is computed as:
L m i , e i = log ( s i g m o i d ( f ( V m i , V e i , 1 p ) ) ) j = 1 K log ( 1 s i g m o i d ( f ( V m i , V e i , j s t r o n g ) ) )
where V m i indicates the code corresponding to the current mention m. V e i , 1 p and V e i , j s t r o n g denote the coding of the positive entity and strong hard negative entity corresponding to m, respectively. s i g m o i d ( · ) is a function that maps the model’s output to the probability space, facilitating probability estimation and the computation of cross-entropy loss.

4. Experiments

In this section, we only empirically investigate our model on Zeshel [8], a challenging dataset for zero-shot entity linking. We have conducted in-depth research and experiments on all three similarity calculations (DUAL, Pooling Mean, and SOM) [23].

4.1. Dataset

Zeshel is a prevailing benchmark dataset for zero-shot entity linking and contains 16 specialized domains from Wikia, divided into 8 domains for training and 4 domains each for validation and testing. The train, validation, and test sets have 49 K, 10 K, and 10 K examples, respectively. Table 1 shows the details of this dataset, including the number of entities and mentions.

4.2. Evaluation Protocol

EL systems typically follow a two-stage pipeline: (1) a candidate generation stage, training an entity retriever to select the top-k candidate entities for each mention; and (2) a candidate ranking stage, training a ranker to identify the golden entity among selected candidate entities. Candidate generation is critical to the performance of candidate ranking because if no golden entity is retrieved in the top-k candidates, the model can never recover the golden entity during the candidate ranking process. So, we follow the evaluation protocol of the previous work [8,9,11] and evaluate at the candidate generation stage. We report accuracy and top-64 recall for models on the validation and testing sets.

4.3. Implementation Details

We experiment with BERT-base [14] for our models. We directly use the preprocessed dataset provided by [9]. We tune the models over {5, 10, 15, 30} epochs using a batch size of 64 for mention–entity pairs. We only consider the learning rate as 2 × 10 5 . Due to time and computational costs, we only perform a grid search to select the best set of hyper-parameters: α in [0.1, 0.2, 0.3,…,1] for DUAL and Pooling Mean and K in [1, 2, 4, 6, 8, 10, 12, 14, 16, 18] for SOM. In this case, the optimal hyper-parameter values are shown in Table 2. Our models are implemented in PyTorch and optimized with Adam [27]. All models are trained on one NVIDIA 3090 24 GB, and the results are the average over 3 runs using different random seeds.

4.4. Performance Comparison

In this section, we compare our model against recent work [8,9] and different negative sampling methods for candidate generation. In addition, we consider the comparison of accuracy rates. These works use the Biencoder and generate negative entities for optimization. DUAL, Pooling Mean, and SOM scorers are employed in this work.

4.4.1. Main Results

The model comparison results are shown in Table 3. Hard and mixed negative examples for all architectures always yield considerable improvements over random negative examples. It is worth noting that for top-64 recalls, our negative sampling strategy performs better than these three negative sampling strategies. More specifically, regarding the DUAL scorer on the test, we find that our negative sampling strategy improves over random negative sampling by 3.65%, over hard negative sampling by 1.31%, and over mixed negative sampling by 1.36%. Concerning the Pooling Mean scorer on the test, we observe that our method improves over random negative sampling by 3.69%, over hard negative sampling by 1.81%, and over mixed negative sampling by 2.76%. The final average improvement in the SOM scorer on the test is particularly significant, being 2.04% over random negative sampling, 1.28% over hard negative sampling, and 2.21% over mixed negative sampling, respectively. These results indicate the effectiveness of our sampling strategy. We also find that SOM yields better results over DUAL and Pooling Mean, while hard sampling leads to better optimization. However, SOM is more expensive and time-consuming to compute.
Interestingly, we also take accuracy into account. We observe that our method achieves better performance compared to other negative sampling strategies concerning DUAL and Pooling Mean. However, compared to SOM, hard negative sampling performs better. In addition, in terms of accuracy, DUAL and Pooling Mean yield better results over SOM, and Pooling Mean is slightly better on the test.

4.4.2. Domain Zero-Shot Performance

Our main results show that we achieved the best performance on both validation and testing sets with respect to the top-64 recalls. To show that this improvement is true for all test domains and not a result of a specific test domain, we show more fine-grained results. Specifically, we report the domain zero-shot performance on the testing sets over different choices of architecture and negative examples. Table 4 shows the results for the different testing domains.
Obviously, the results on the test domains “Forgotten Realms” and “Lego” are better than the other two domains. For the DUAL, our method exhibits the best performance in all domains. However, our method is 0.67% lower than hard negative sampling for Pooling Mean on the domain “Lego” and 0.08% lower than random negative sampling for SOM on the domain “Forgotten Realms”. Overall, our method has shown better results on the different ways of calculating similarity, and the results on domains “Star Trek” and “YuGiOh” gain a significant boost, 85.07% and 78.25%, respectively.

4.5. Impact of α

In this section, we investigate the sensitivity of DUAL and Pooling Mean in terms of the hyper-parameter α . Recall α indicates the corresponding restriction after a strong hard negative sample is produced by fusing the current negative sample with the positive sample, where the range of α is set between 0 and 1. A smaller α means that this strong hard negative sample is slightly weaker and vice versa. However, this does not mean that the stronger this strong negative sample is, the better, and at the same time, it is not the weaker, the better, but rather, for different similarity calculation structures, there exists an intermediate value in the range of settings that makes the current performance optimal. Table 5 and Table 6 contend to show the results obtained under different α for DUAL and Pooling Mean, which will be analyzed in the following.
In Table 5, we find that DUAL achieves the best performance for top-64 recalls with α of 0.3 and for accuracy with α of 0.7 on the test. Meanwhile, with increasing α , the performance of top-64 recalls decreases. In Table 6, we observe that Pooling Mean achieves the best performance for top-64 recalls with α of 0.3 and for accuracy with α of 0.8 on the test. By adjusting the α , we still find that Pooling Mean performs better than DUAL in terms of accuracy, being 0.29% higher, while for top-64 recalls, DUAL is better by 0.08%. At the same time, at some value of α , the α will show a difference of about 1% compared to the α in the case that shows the best performance, indicating the importance of controlling α . Considering the time and computational cost, we do not analyze the results presented by SOM under different α .

4.6. Impact of K

In this section, we investigate the sensitivity of SOM in terms of the hyper-parameters K. K, in our method, denotes the number of negative samples formed after fusion with positive sample features. We have selected only some of the K to compare the results of the test. Table 7 illustrates the specific results. As with the hyper-parameter α introduced in the previous section, there also exists an intermediate value for the number of strongly negative samples K after fusing the positive sample features, allowing the current model to exhibit optimal performance to the SOM. It is also predicted that this should be true for the other two similarity computation structures (DUAL and Pooling Mean).
In Table 7, we find that SOM achieves the best performance for top-64 recalls with a K of 10 and for accuracy with a K of 16 on the test. In addition, comparing with Table 3 for the highest accuracy of 32.74% on the test demonstrated for SOM in hard negative sampling, the model achieves a result of 34.15% at a K of 16, a 1.41% improvement. Thus, by adjusting the value of K on the SOM, our method will also show better performance in terms of accuracy than other negative sampling strategies. It follows that controlling the value of K is also particularly important for our approach.

4.7. Analyzing the Number of Candidates

In a two-stage entity linking system, the choice of the number of candidates retrieved influences the overall model performance. Previous work has typically used a fixed number of k candidates where k ranges from 5 to 100 (for instance, Refs. [5,17] choose k = 30, Ref. [8] chooses k = 64). According to [9] and Table 8, when k is larger, the recall accuracy increases; however, the ranking stage accuracy will likely decrease. Further, increasing k would often increase the run-time on the ranking stage.
In the following, we will analyze the importance of the number of candidate entities k in three aspects.
(i)
Accuracy: A smaller number of candidate entities improves the accuracy of the system in selecting the correct entity. If there are too many candidate entities, the system may experience difficulties because there may be many similar entities, making it difficult for the model to make the correct selection.
(ii)
Coverage: An increase in the number of candidate entities can increase the coverage of the system. In zero-shot entity linking, certain entities may not be covered in the candidate entity set, which can result in those entities not being linked correctly. A more comprehensive candidate entity set can improve the chances of successful linking, especially for rare or nonexistent entities in the training data.
(iii)
Efficiency: The number of candidate entities is also related to the efficiency of the linking process. Fewer candidate entities mean the system needs less time and computational resources for entity linking.
However, increasing the number of candidate entities may also lead to problems such as increased noise and interference and reduced entity linking accuracy. Therefore, a balance needs to be found between the number of candidate entities and the accuracy of entity linking. As shown in Table 8, our method shows better performance for r@k (k = 4, 8, 16, 32, 64) compared to other negative sampling strategies to the SOM, which lays a solid foundation for future research on the number of candidate entities k in the entity ranking stage.

5. Discussion

With respect to current negative sampling methods in the zero-shot EL domain, we note that these methods typically employ an extraction strategy that results in a lack of diversity in the negative samples selected, thus limiting the ability of the model to acquire richer knowledge. Furthermore, hard negative sampling aims to select more challenging negative samples, exposing the model to more challenging tasks. However, the negative samples selected by current hard negative sampling strategies are not challenging enough. Therefore, we propose an adaptive_mixup_hard generative negative sampling method based on the hard negative sampling method. This method introduces a transformable adaptivity parameter W, enabling the model to generate rich and diverse negative samples and also generates more difficult negatives by fusing the positive sample features of the current mention with negative sample features.
However, due to the limited hardware resources, we were only able to set the batch size to 64. For the consideration of time and computational cost, we did not compare our method with other negative sampling methods under the Poly-Encoder [26] and Multi-Vector Encoder [28] architectures. In addition, during the experimental phase of the hyper-parameter, we did not perform a comprehensive search for all score functions. Finally, in the fusion of features of positive and negative samples, we currently only consider the control of the weights of features of positive samples, so we are thinking about whether we need to control the weights of negative samples simultaneously. Therefore, future work includes:
  • To enhance the reliability of our method by comparing it with other negative sampling methods under the Poly-Encoder and Multi-Vector Encoder architectures.
  • Analyze our method in terms of gradient and loss to make it theoretically interpretable.
  • Control the weights of positive and negative samples simultaneously for further improvement.
  • Research into end-to-end zero-shot entity linking models: joint models for mention detection and entity linking.

6. Conclusions

We introduce the adaptive_mixup_hard (AMH) negative sampling, a method variant on hard negative sampling, to obtain more difficult negative samples to improve the robustness and performance of the model on the task of zero-shot entity linking. Furthermore, we also combine our negative sampling method with the Biencoder architecture to form our new model Biencoder_AMH and test its performance on three different score functions (DULA, Pooling Mean, and SOM). Our work shows the performance of our method compared to other negative sampling methods (Random, Hard, and Mixed-p) under the same score function. Our experimental analysis demonstrates that our approach generally performs better on both validation and testing sets. More importantly, this improvement is genuine for essentially all test domains and not as a result of a specific test domain. In addition, we conduct extensive experiments to optimize the hyper-parameters α and K in our method and demonstrate that our method provides a solid foundation for the subsequent entity ranking stage.

Author Contributions

Conceptualization, S.C. and X.W.; methodology, S.C.; software, S.C.; validation, S.C., X.W., Y.C. and Z.W.; formal analysis, S.C.; investigation, X.W.; resources, S.C.; data curation, Y.C. and Z.W.; writing—original draft preparation, S.C.; writing—review and editing, S.C.; visualization, X.W., Y.C. and Z.W.; supervision, M.M. and J.Z.; project administration, M.M.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Xinjiang Uyghur Autonomous Region under Grant No. 2021D01C079.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/lajanugen/zeshel (accessed on 20 October 2022).

Acknowledgments

The authors would like to thank the anonymous reviewers for their contribution to this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, B.Z.; Min, S.; Iyer, S.; Mehdad, Y.; tau Yih, W. Efficient One-Pass End-to-End Entity Linking for Questions. arXiv 2020, arXiv:2010.02413. [Google Scholar]
  2. Ayoola, T.; Tyagi, S.; Fisher, J.; Christodoulopoulos, C.; Pierleoni, A. ReFinED: An Efficient Zero-shot-capable Approach to End-to-End Entity Linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, Seattle, WA, USA, 10–15 July 2022; pp. 209–220. [Google Scholar] [CrossRef]
  3. Lin, T.; Mausam; Etzioni, O. Entity linking at web scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-Scale Knowledge Extraction (AKBC-WEKEX) 2012, Montreal, QC, Canada, 7–8 June 2012; pp. 84–88. [Google Scholar]
  4. Weng, J.; Lim, E.P.; Jiang, J.; He, Q. TwitterRank: Finding Topic-Sensitive Influential Twitterers. In Proceedings of the Third ACM International Conference on Web Search and Data Mining WSDM ’10, New York, NY, USA, 4–6 February 2010; pp. 261–270. [Google Scholar] [CrossRef]
  5. Ganea, O.E.; Hofmann, T. Deep Joint Entity Disambiguation with Local Neural Attention. arXiv 2017, arXiv:1704.04920. [Google Scholar]
  6. Cao, Y.; Hou, L.; Li, J.; Liu, Z. Neural collective entity linking. arXiv 2018, arXiv:1811.08603. [Google Scholar]
  7. Sil, A.; Cronin, E.; Nie, P.; Yang, Y.; Popescu, A.M.; Yates, A. Linking Named Entities to Any Database. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Republic of Korea, 12–14 July 2012; pp. 116–127. [Google Scholar]
  8. Logeswaran, L.; Chang, M.W.; Lee, K.; Toutanova, K.; Devlin, J.; Lee, H. Zero-shot entity linking by reading entity descriptions. arXiv 2019, arXiv:1906.07348. [Google Scholar]
  9. Wu, L.; Petroni, F.; Josifoski, M.; Riedel, S.; Zettlemoyer, L. Scalable Zero-shot Entity Linking with Dense Entity Retrieval. arXiv 2020, arXiv:1911.03814. [Google Scholar]
  10. Yao, Z.; Cao, L.; Pan, H. Zero-shot Entity Linking with Efficient Long Range Sequence Modeling. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020. [Google Scholar] [CrossRef]
  11. Zhang, W.; Stratos, K. Understanding hard negatives in noise contrastive estimation. arXiv 2021, arXiv:2104.06245. [Google Scholar]
  12. Tang, H.; Sun, X.; Jin, B.; Zhang, F. A bidirectional multi-paragraph reading model for zero-shot entity linking. In Proceedings of the AAAI Conference on Artificial Intelligence 2021, Virtually, 2–9 February 2021; Volume 35, pp. 13889–13897. [Google Scholar]
  13. Sun, K.; Zhang, R.; Mensah, S.; Mao, Y.; Liu, X. A Transformational Biencoder with In-Domain Negative Sampling for Zero-Shot Entity Linking. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 1449–1458. [Google Scholar] [CrossRef]
  14. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  15. He, Z.; Liu, S.; Li, M.; Zhou, M.; Zhang, L.; Wang, H. Learning Entity Representation for Entity Disambiguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 30–34. [Google Scholar]
  16. Sun, Y.; Lin, L.; Tang, D.; Yang, N.; Ji, Z.; Wang, X. Modeling mention, context and entity with neural networks for entity disambiguation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence 2015, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
  17. Yamada, I.; Shindo, H.; Takeda, H.; Takefuji, Y. Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, 11–12 August 2016; pp. 250–259. [Google Scholar] [CrossRef]
  18. Kolitsas, N.; Ganea, O.E.; Hofmann, T. End-to-end neural entity linking. arXiv 2018, arXiv:1808.07699. [Google Scholar]
  19. Raiman, J.; Raiman, O. Deeptype: Multilingual entity linking by neural type system evolution. In Proceedings of the AAAI Conference on Artificial Intelligence 2018, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  20. Onoe, Y.; Durrett, G. Fine-grained entity typing for domain independent entity linking. In Proceedings of the AAAI Conference on Artificial Intelligence 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8576–8583. [Google Scholar]
  21. Khalife, S.; Vazirgiannis, M. Scalable graph-based individual named entity identification. arXiv 2018, arXiv:1811.10547. [Google Scholar]
  22. Gillick, D.; Kulkarni, S.; Lansing, L.; Presta, A.; Baldridge, J.; Ie, E.; Garcia-Olano, D. Learning Dense Representations for Entity Retrieval. arXiv 2019, arXiv:1909.10506. [Google Scholar]
  23. Khattab, O.; Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 2020, Xi’an, China, 25–30 July 2020; pp. 39–48. [Google Scholar]
  24. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
  25. Lerer, A.; Wu, L.; Shen, J.; Lacroix, T.; Wehrstedt, L.; Bose, A.; Peysakhovich, A. PyTorch-BigGraph: A Large-scale Graph Embedding System. arXiv 2019, arXiv:1903.12287. [Google Scholar]
  26. Humeau, S.; Shuster, K.; Lachaux, M.A.; Weston, J. Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. arXiv 2020, arXiv:1905.01969. [Google Scholar]
  27. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  28. Luan, Y.; Eisenstein, J.; Toutanova, K.; Collins, M. Sparse, dense, and attentional representations for text retrieval. Trans. Assoc. Comput. Linguist. 2021, 9, 329–345. [Google Scholar] [CrossRef]
Figure 1. Biencoder_AMH consists of three main parts: Biencoder, adaptive_mixup_hard, and score function. From the bottom, the encoded representations of mention and entities are obtained separately through the Biencoder architecture. Then, K strong negative entities are obtained by AMH negative sampling and concatenated with the positive entity in the zeroth dimension. Finally, the score function is used to compute the similarity score with mention and calculate the corresponding loss.
Figure 1. Biencoder_AMH consists of three main parts: Biencoder, adaptive_mixup_hard, and score function. From the bottom, the encoded representations of mention and entities are obtained separately through the Biencoder architecture. Then, K strong negative entities are obtained by AMH negative sampling and concatenated with the positive entity in the zeroth dimension. Finally, the score function is used to compute the similarity score with mention and calculate the corresponding loss.
Mathematics 11 04366 g001
Figure 2. Architecture of Biencoder.
Figure 2. Architecture of Biencoder.
Mathematics 11 04366 g002
Figure 3. Architecture of Score Function.
Figure 3. Architecture of Score Function.
Mathematics 11 04366 g003
Figure 4. Architecture of adaptive_mixup_hard.
Figure 4. Architecture of adaptive_mixup_hard.
Mathematics 11 04366 g004
Table 1. Statistics of the Zeshel dataset.
Table 1. Statistics of the Zeshel dataset.
DomainsEntitiesMentions
TrainEvaluation
Training
American Football31,9293898743
Doctor Who40,28183341521
Fallout16,9923286593
Final Fantasy14,04460411156
Military104,52013,0632764
Pro Wrestling10,1331392262
Star Wars87,05611,8242706
World of Warcraft27,6771437255
Validation
Coronation Street17,80901464
Muppets21,34402028
Ice Hockey28,68402233
Elder Scrolls21,71204275
Testing
Forgotten Realms15,60301200
Lego10,07601199
Star Trek34,43004227
YuGiOh10,03103374
Table 2. Optimal observed hyper-parameter configurations of Biencoder_AMH.
Table 2. Optimal observed hyper-parameter configurations of Biencoder_AMH.
ModelBatch SizeLearning RateMax Context Tokens α KEpochs
DUAL64 ×   10 5 1280.3130
Pooling Mean64 ×   10 5 1280.3130
SOM64 ×   10 5 1280.31030
Table 3. Accuracy and top-64 recalls over different choices of architecture and negative examples.
Table 3. Accuracy and top-64 recalls over different choices of architecture and negative examples.
ModelNegativesValTest
r@1r@64r@1r@64
BM25--76.22-69.13
Wu et al. (2020) [9]-42.7989.3640.8279.13
DUALRandom40.4787.7338.2077.82
Hard42.9489.4840.4580.16
Mixed-5043.1089.2940.7580.11
Ours45.3090.0142.9081.47
Pooling MeanRandom36.2286.0435.7177.70
Hard40.1988.1439.0979.58
Mixed-5040.4288.2538.6578.63
Ours43.9389.3643.1081.39
SOMRandom23.1990.1019.4583.01
Hard30.5591.1432.7483.77
Mixed-5015.9191.3424.8482.84
Ours25.5691.8331.1985.05
Bolded data indicates the best results.
Table 4. Top-64 recalls on different domains over different choices of architecture and negative examples.
Table 4. Top-64 recalls on different domains over different choices of architecture and negative examples.
ModelNegativesDomains
Forgotten RealmsLegoStar TrekYuGiOh
DUALRandom89.4288.8280.3266.66
Hard91.2589.2483.0669.35
Mixed-5091.0888.9983.0469.38
Ours91.8389.4985.0570.45
Pooling MeanRandom89.5088.2479.8767.04
Hard91.3389.6682.1168.64
Mixed-5091.0088.4981.9066.63
Ours92.0088.9984.3671.19
SOMRandom94.3392.9982.4976.08
Hard94.2592.9183.8476.70
Mixed-5093.6792.5882.3876.11
Ours94.2593.0885.0778.25
Bolded data indicates the best results.
Table 5. Accuracy and top-64 recalls for different α under adaptive_mixup_hard for DUAL.
Table 5. Accuracy and top-64 recalls for different α under adaptive_mixup_hard for DUAL.
α = 0.1 α = 0.2 α = 0.3 α = 0.4 α = 0.5 α = 0.6 α = 0.7 α = 0.8 α = 0.9 α = 1
Valr@144.3743.9645.3045.3345.7045.1946.5745.3746.5944.68
r@6490.1889.6890.0189.7589.7689.7189.6189.4789.9089.22
Testr@142.4642.5642.9042.6043.3843.2443.8042.7643.6942.43
r@6480.9481.0681.4780.9581.4681.1380.9880.5580.7380.57
Bolded data indicates the best results.
Table 6. Accuracy and top-64 recalls for different α under adaptive_mixup_hard for Pooling Mean.
Table 6. Accuracy and top-64 recalls for different α under adaptive_mixup_hard for Pooling Mean.
α = 0.1 α = 0.2 α = 0.3 α = 0.4 α = 0.5 α = 0.6 α = 0.7 α = 0.8 α = 0.9 α = 1
Valr@143.2844.7443.9343.6743.2842.8244.8045.7945.3544.16
r@6489.4389.4089.3689.3089.0288.9289.1289.2889.5789.00
Testr@141.4142.3643.1042.6942.2342.0543.0244.0943.4742.31
r@6480.7080.8881.3980.9180.7580.1780.1980.6580.3780.20
Bolded data indicates the best results.
Table 7. Accuracy and top-64 recalls for different K under adaptive_mixup_hard for SOM.
Table 7. Accuracy and top-64 recalls for different K under adaptive_mixup_hard for SOM.
K = 1K = 2K = 4K = 6K = 8K = 10K = 12K = 14K = 16K = 18
Testr@129.9828.8931.7032.4731.0031.1932.9031.3634.1531.81
r@6482.6583.3984.1184.7083.9185.0583.8383.2884.8383.59
Bolded data indicates the best results.
Table 8. Top-k recalls over different choices of negative examples for SOM.
Table 8. Top-k recalls over different choices of negative examples for SOM.
ModelNegativesr@k
r@4r@8r@16r@32r@64
SOMRandom50.9262.2871.4177.6983.01
Hard62.6170.3076.1880.0283.77
Mixed-5054.2265.0372.9778.3882.84
Ours62.9471.0876.7680.8184.83
Bolded data indicates the best results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, S.; Wu, X.; Maimaiti, M.; Chen, Y.; Wang, Z.; Zheng, J. An Adaptive Mixup Hard Negative Sampling for Zero-Shot Entity Linking. Mathematics 2023, 11, 4366. https://doi.org/10.3390/math11204366

AMA Style

Cai S, Wu X, Maimaiti M, Chen Y, Wang Z, Zheng J. An Adaptive Mixup Hard Negative Sampling for Zero-Shot Entity Linking. Mathematics. 2023; 11(20):4366. https://doi.org/10.3390/math11204366

Chicago/Turabian Style

Cai, Shisen, Xi Wu, Maihemuti Maimaiti, Yichang Chen, Zhixiang Wang, and Jiong Zheng. 2023. "An Adaptive Mixup Hard Negative Sampling for Zero-Shot Entity Linking" Mathematics 11, no. 20: 4366. https://doi.org/10.3390/math11204366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop