1. Introduction
Metonymy is a common figurative language phenomenon that refers to substituting the name of a thing using one of its closely associated attributes (e.g., producer-for-product, place-for-event, place-for-inhabitant). This linguistic phenomenon [
1,
2] is pervasive in daily life and literature. For example, named entities in the text are often used in a metonymy manner to imply an irregular denotation.
Identifying ambiguities in metonymy is a fundamental process in many NLP applications such as relation extraction [
3], machine comprehension [
4], and other downstream tasks [
5]. The following sentence shows an example of metonymy: “Last year, Ma acquired Eleme”. In this example, the meaning of “Ma” under this particular circumstance has been changed. It is more appropriate to interpret “Ma” as “Ma’s company Alibaba” instead of the literate concept of “a famous entrepreneur”.
In literature, conventional methods for MR mainly rely on the features derived from lexical resources such as dictionaries, taggers, parsers, and WordNet or other handcrafted lexical resources. At present, more researchers are focusing on deep neural networks (DNN) [
6,
7], which is becoming the mainstream approach to handle various tasks in NLP, including metonymy resolution [
8]. DNN models can effectively encode all words in a sentence in a sequential way to obtain the semantic representation of the text to easily capture contextual information throughout the whole sentence and achieve state-of-the-art performance. Moreover, pre-trained language representations demonstrated efficiency in improving many NLP tasks, e.g., text classification [
9], event extraction [
10], and relation extraction [
11]. Benefiting from the contextual representations, these models significantly surpass competitive neural models in many NLP tasks [
12,
13], for example ELMo [
14,
15] and BERT [
16]. Among these representations, the pre-trained model BERT has an especially significant impact. It has proven improved results in 11 NLP tasks. Compared to conventional DNN methods, BERT provides a simple way to accurately and effectively obtain the sentence-level semantic representation, which greatly benefits the task of MR. However, there are still shortcomings of using pre-trained language representations such as BERT for metonymy resolution. From a linguistic point of view, BERT is relatively weak in two perspectives at least:
Facing the entity-targeted task MR, BERT does not find a proper way to specify explicit information about the entities. For the absence of entity information in BERT, the proposed entity-perceptive method can effectively improve the efficiency under the joint guidance of entity and context knowledge. Based on this, Entity BERT with entity awareness is constructed by us. The model applies BERT to train the neural network jointly by the entity and context and integrates them to be better word embedding. This way, it can effectively use the entity knowledge to improve the precision and recall rate.
Compared with sequence-based models, dependency-based models can capture non-local syntactic relations that are clues for the inference of the true meaning of entity nominals. Thus, selectively integrating the dependency relations to alleviate the lack of syntax in BERT is necessary.
Figure 1 shows an example of the dependency parse tree with multiple dependency relations. Given a sentence, it is convenient to obtain the dependency relations between two words by launching a dependency parser using NLP toolkits on the shift. For example, in
Figure 1, it is evident that the “nsubj” relation between “Ma” and “acquired” is a clue that indicates the role “Ma” played in the sentence intensely while the “nmod” relation between “Last year” and “acquired” has rare contributions. However, most MR models treat these relations as equivalent and have not distinguished these relations, leading to misunderstandings of the lexicon and output incorrect semantic representations. Dependency trees convey rich structural information that is proven helpful for extracting relations among entities in text. Therefore, in this paper, dependency relations are used as the prior knowledge to supervise the model to learn metonymies. Soft structural constraints based on dependency parse trees are imposed on state-of-the-art pre-trained language representations such as BERT. This work builds on a rich line of recent efforts on relation extraction models and graph convolutional networks (GCN). Differing from the previous works that either try to directly use dependency parse trees or concatenate the transition vector based on dependency parse trees, this work employs attention-guided GCN(AGCN) to integrate information. Additionally, the outputs of the BERT layer are passed to attention-guided GCN to extract the syntactic information through dependency parsing. Then both syntactic and semantic information will be fed into the attention-guided GCN integration component that learns how to selectively attend to the relevant structures valid for MR. Therefore, the model takes advantage of the relevant dependency ancestors and efficiently removes the noise from irrelevant chunks. Finally, sentence-level representations, as well as the outputs of attention-guided GCN, are combined as the input to a multi-layer neural network for classification.
Considering entity awareness and dependency information both benefit MR, this paper presents EBAGCN (Entity BERT with Attention-guided GCN) to leverage entity constraints from sequence-based pre-trained language representations and soft dependency constraints from dependency trees at the same time.
As a result, the proposed approach is superior in capturing the semantics of the sentence and the long-distance dependency relations, which better benefits MR. To summarize, the main contributions of this paper are as follows:
propose a novel method for metonymy resolution relying on recent advances on pre-trained language representations that integrate entity knowledge and significantly improve the accuracy.
incorporate attention-guided GCN to MR with hard/soft dependency constraints, which imposes the pre-trained language representations with prior syntactic knowledge.
experiments on two benchmark datasets show that the proposed model EBAGCN is significantly superior to previous works and improves the BERT-based baseline systems.
2. Related Work
Metonymy Resolution Analysis of metonymy as a linguistic phenomenon dates back to feature-based methods [
17,
18]. Ref. [
19] use an SVM with a robust knowledge base built from Wikipedia to remain the best result in all feature-based methods. These methods inevitably suffer from error propagation because of their dependence on the manual feature extraction. Furthermore, the work takes much effort, while helpful information is hard to be caught. As a result, the performance is not satisfied.
Recently, the majority of works for MR have focused on the deep neural network (DNN). Ref. [
8] propose PreWin based on Long Short Term Memory (LSTM) architecture. This work is the first to integrate structural knowledge into MR. In contrast, the effect is limited because of the setting of the predicate window, which only retains the words around the predicate while losing much important information in that process. Other approaches [
20] leverage NER and POS features on LSTM to enrich the representation of tokens. However, these methods only improve slightly since they only pay attention to the independent tokens and ignore the relations between tokens in a sentence that is beneficial to MR.
Pre-trained language models have shown great success in many NLP tasks. In particular, BERT, proposed by [
16], shows a significant impact. Intuitively, it is natural to introduce pre-trained models to MR. These models vastly outperform the conventional DNN models and reach the top of the leaderboard in MR. Nevertheless, pre-trained models encode the whole sentence to catch contextual features, leading to the ignorance of syntactic features in sentences. Ref. [
21] fine-tunes BERT with target word masking and data augmentation to detect metonymy more accurately.
In addition to the context information provided by the sequence-based model, [
22] pays attention to the entity word. They disambiguate every word in a sentence by reformulating metonymy detection as a sequence labeling task and investigate the impact of entity and context on metonymy detection.
Dependency Constraints Integration The research on MR so far has made limited application of dependency trees. However, research on other NLP classification tasks widely employ dependency information. Differing from traditional sequence-based models, dependency-based models integrate dependency information [
23], taking advantage of seizing dependency relations that are obscure from the surface form alone.
As the effect of dependency information is widely recognized, more attention is paid to pruning strategies (i.e., how to distill syntactic information from dependency trees efficiently). Ref. [
3] use the shortest dependency path between the entities in the full tree. Ref. [
24] apply graph convolutional networks [
25] model on a pruned tree and a novel pruning strategy to the input trees by retaining words immediately around the shortest path between entities among which a relation might hold. Although these hard-pruning methods remove irrelevant relations efficiently based on predefined rules, they suffer from eliminating useful information wrongly at the same time.
More recently, ref. [
26] proposed AGGCN and employed a soft-pruning strategy. The method enables the dependency relations to have weights to balance relevant and irrelevant information with multi-head attention mechanism. Ref. [
27] proposed a dependency-driven approach for relation extraction with attentive graph convolutional networks (A-GCN). In this approach, an attention mechanism in graph convolutional networks is applied to different contextual words in the dependency tree obtained from an off-the-shelf dependency parser, to distinguish the importance of different word dependencies.
Pre-trained Model This is the idea of pre-training, originated in the field of computer vision, and then developed into NLP. A pre-trained word vector is the most common application of pre-training in NLP. The annotated corpus is very limited in many NLP tasks, which is not enough to train excellent word vectors. Therefore, large-scale unannotated corpus unrelated to the current task is usually used for pre-training word vectors. At present, many deep learning models tend to use pre-trained word vectors (such as Word2Vec [
28] and GloVe [
29], etc.) for initialization to accelerate the convergence speed of the network.
To consider contextual information when setting word vector, pre-trained models such as Context2Vec [
30], ELMo [
15] were developed and achieved good results. BERT is a model training directly on deep Transformer network. The best results were achieved in many downstream tasks of NLP through pre-training and fine-tuning [
31]. Different from other deep learning models, BERT adjusts the context at all levels jointly before training to obtain bi-directional representation of each token. BERT solves the representation difficulty to a large extent by fine-tuning the output of a specific task. Compared with recurrent neural networks, BERT relying on Transformer can capture long-distance dependencies more effectively and have a more accurate semantic understanding of each token in the current context.
Graph Convolutional Network DNN models have achieved great success in both CV and NLP. As a representative model of deep learning, the convolutional neural network can solve regular spatial structures. While much data does not have structure, the graph convolutional (GCN) network arises at a historic moment. GCN is a widely used architecture to encode the information in a graph, where in each GCN layer, information in each node communicates to its neighbors through the connections between them. The effectiveness of GCN models to encode the contextual information over a graph of an input sentence has been demonstrated by many previous studies [
32,
33].
3. The Proposed Model
This section presents the basic components used for constructing the model. The overall architecture of the proposed model is shown in
Figure 2.
3.1. BERT Encoding Unit
The pre-trained language model BERT is a multi-layer bidirectional transformer encoder designed to pre-train deep bidirectional representations by conditioning both left and right context. This unit uses the BERT encoder to model sentences and output fine-tuned contextual representations. It takes as input sentence
S, and computes for each token a context-aware representation. Concretely, the input packs as
, where
is a special token for classification;
is the token sequence of
S generated by a WordPiece Tokenizer;
is the token indicating the end of a sentence. For each hidden representation
at the index
i, initial token embedding
is concatenated with positional embedding
as
After going through
N successive Transformer encoder blocks, the encoder generates context-aware representations for each token to be the output of this unit, represented as
:
3.2. Syntactic Integration Unit
As is shown in
Figure 3, the Syntactic Integration Unit is designed to integrate syntax into BERT and is the most crucial component of this approach. In a multi-layer GCN, the node representation
is produced by applying a graph convolution operation in layers from 1 to
, described as follows:
where
represents the weight matrix,
stands for the bias vector, and
is an activation.
and
are the hidden state in prior and current layer, respectively.
Syntactic Integration Unit contains attention guided layer, densely connected layer, and linear combination layer.
Attention Guided Layer Most existing methods adopt hard dependency relations (i.e., 1, 0 denote relation exists or not) to impose syntactic constraints. However, these methods require the pre-defined pruning strategy based on expert experience and simply set the dependency relations considered “irrelevant” as zero-weight (not attended). These rules may bias representations, especially toward a larger dependency graph. Reversely, the attention guided layer helps to launch the “soft pruning” strategy. This layer generally generates the attention guided adjacency matrix
whose weights range from 0 to 1 by multi-head attention [
34]. The shape of
is the same as the original adjacency matrix
A for convenience. Precisely,
is calculated as follows:
where
Q,
K,
V are, respectively, query, key, value in multi-head attention,
Q,
K are both equal to the input representation
(i.e., output of the prior module),
d is the dimension of
,
and
are both learnable parameters
,
is the
t-th attention guided adjacency matrix corresponding to the
t-th head.
In this way, the attention guided layer outputs a large fully connected graph to reallocate the importance of each dependency relation rather than pruning the graph into a smaller structure as tradition.
Densely Connected Layer This layer helps to learn more local and non-local information and train a deeper model using densely connected operations. Each densely connected layer has
L sub-layers.
L is a hyper-parameter for each module. These sub-layers are placed in regular sequence, and each sub-layer takes all preceding sub-layers’ output as input. The structure of Densely Connected Layer is shown in
Figure 4.
is calculated in this layer, which is defined as the concatenation of the initial representation and the representations produced in each preceding sub-layer:
where
is initial input representation,
are the outputs of all preceding sub-layers. In addition, the dimension of representations in these sub-layers is shrunk to improve the parameter efficiency, i.e.,
, where
L is the number of sub-layers,
d is the input dimension. For example, the number of sub-layer is 2 and input dimension is 1024,
. Then a new representation whose dimension is
is formed by concatenating all these sub-layer outputs.
N densely connected layers compute
N adjacency matrixes produced by attention guided layer. The GCN computation for each sub-layer should be modified because of the application of multi-head attention:
where
t represents
t-th head,
and
are learnable weights and bias, which are selected by
t and associated with the attention guided adjacency matrix
.
Linear Combination Layer In this layer, the final output is obtained by combining representations output by
N Densely Connected Layer corresponding to
N heads:
where
is the combined representation of
N heads as well as the output of the module.
and
are learnable weights and bias.
3.3. Joint Unit
In Joint Unit, the context representation and entity representation are united to form the final joint representation for MR.
Context Representation BERT encoder produces the final hidden state sequence H corresponding to the task-oriented embedding of each token. According to the BERT mechanism, the representation output by the special token “[CLS]” serves as the pooled representation of the whole sentence. Therefore, serves to represent the aggregate sequence as context representation.
Entity Representation To help the model capture the clues of entities and enhance the expression ability, entity indicator is inserted at the beginning and end of the entity.
The entity represents as follows: suppose that
…
are the hidden states of entity
E output by Syntactic Integration Unit (
m,
n represent the start index and end index of the entity, respectively), an average operation is applied to obtain a final representation:
Representation Integration For classifying, model concatenate
and
and consecutively apply two full connected layers with activation.
3.4. Classifier Unit
A softmax layer is applied to produce a probability distribution
:
refers all learnable parameters in the network , where r is the number of classification types, is the dim of BERT representation.
4. Methodologies and Materials
4.1. Dataset
The experiments are conducted on two publicly available benchmarks: the SemEval2007 [
18] and ReLocaR [
8] datasets. Unlike WiMCor [
35] and GWN [
36] that contains huge amount of instances, SemEval2007 and ReLocaR are relatively smaller. The samples lay into two classes: literal and metonymic. SemEval contains 925 training and 908 test instances, while ReLocaR comprises a train (1026 samples) and a test (1000 samples) dataset. The class distribution of SemEval is approx 80% literal, 20% metonymic. To eliminate the high class bias of SemEval, the class distribution of ReLocaR sets to be 50% literal, 50% metonymic.
Since the MR task is still in its infancy, there is no available MR dataset for Chinese. English datasets SemEval and ReLocaR are employed to construct a Chinese MR dataset through text translation, manual adjustment, and labeling.
Text translation: examples of SemEval and ReLocaR are translated using API on the Internet and finally obtain independent Chinese samples;
Manual adjustment: Considering the poor quality of the dataset obtained from API, all examples are corrected and well selected to meet the Chinese expression norms;
Labeling: Inserting a pair of indicators to mark the entity.
After the above steps, an MR dataset called CMR in Chinese is constructed to verify the model performance on Chinese texts. Finally, the dataset contains 1986 entity-tagged instances, of which 1192 are randomly divided as a training set and 794 as a test set. Each instance contains a sentence with the entity tag and a classification tag of literal or metonymic.
4.2. Dependency Parsing
Given a sentence from the dataset, first the sentence is tokenized with the tokenization tool “jieba”. Then, dependency parsing is launched for the tokenization list by the tool of Stanford CoreNLP [
37]. After the dependency graph is output by dependency parsing, the dependency relations are first encoded into an adjacency matrix
A. In particular, if there is a dependency edge existing between node
i and
j, then
and
, otherwise
and
.
4.3. Model Construction
The proposed models are encoded with Python 3.6 and deep learning framework PyTorch 1.1. They are trained on a Tesla v100-16GB GPU. EBAGCN requires approx. 1.5 times GPU memory compared with vanilla BERT-LARGE.
Given a sentence S with an entity E, MR aims to predict whether E is a metonymic entity nominal. The key idea of EBAGCN is to enhance BERT representation with structural knowledge from dependency trees and entities. Generally, the entire sentence will first go through the BERT Encoding Unit to obtain the deep bidirectional representation for each token. Then, launching dependency parsing to extract the dependency relations from each sentence. Subsequently, both deep bidirectional representations and dependency relations are fed into the Syntactic Integration Unit. The achieved vector representations are enriched by syntactic knowledge and integrated with context representation in Joint Unit. Finally, the fused embedding is served to produce a final prediction distribution in the Classifier Unit.
4.4. Model Setup
For both datasets, the batch size is set as 8 and number of training epochs as 20. The number of head for multi-head attention N is chosen from , max sequence length from , initial learning rate for AdamW from . The combination (), () and () give the best results on SemEval, ReLocaR and CMR, respectively.
The evaluation for experiments are accuracy, precision, recall, and F1.
accuracy The probability of the total sample that predicted correct results;
precision The probability of actually being a positive sample of all the predicted positive samples;
recall The probability of being predicted to be a positive sample in a sample that is actually positive;
F1 The F1 score is balanced by taking into account both accuracy and recall. The expression of F1 score is:
5. Results
5.1. Models
The baseline models used in the experiment are listed below.
SVM+Wikipedia: SVM+Wikipedia is the previous SOTA statistical model. It applies SVM with Wikipedia’s network of categories and articles to automatically discover new relations and their instances.
LSTM and BiLSTM: LSTM is one of the most potent dynamic classifiers publicly known [
38]. Because of the featured memory function of remembering last hidden states, it achieves promising results and is widely used in various NLP tasks. Moreover, BiLSTM improves the token representation by being aware of the conditions from both directions [
39], making contextual reasoning available. Additionally, two kinds of representations, GloVe [
29] and ELMo [
15] are performed separately to ensure a credible model result.
Paragraph, Immediate, and PreWin: These three models are primarily built upon BiLSTM. They simultaneously encode tokens as word embeddings and dependency tags as one-hot vectors (5–10 tokens in general). The difference between them is in the way of picking tokens. Immediate-
y selects
y number of words on the left and right side of the entity as input to the model [
40,
41]. The Paragraph model extends the Immediate model by taking the 50 words from the side of each entity as the input to the classifier. The PreWin model relies on a predicate window consisting of a direct vocabulary around the recognized predicate to eliminate noise over a long distance.
fastText: FastText is a tool that computes the word vector and classifies the text without great academic innovation. However, its advantages are obvious. In text classification, fastText can achieve performance similar to the deep network with little training cost.
CNN: The experiment applies the classical CNN model target for the text classification, which is composed of input layer, convolution layer, pooling layer, and softmax layer. Since the whole model adapts to the text (rather than CNN’s traditional application: image), some adjustments are made to adapt the NLP task.
BiLSTM+Att: BiLSTM+Att applies an attention layer on the BiLSTM to increase the representation ability.
BERT: BERT is a language model trained on deep Transformer networks which performs NLP tasks well through pre-training and fine-tuning, and achieves the best results in many downstream tasks of NLP [
31,
42].
5.2. Main Result on SemEval and ReLocaR
On SemEval and ReLocaR, this approach compares with the feature-based model, deep neural network model and pre-trained language model.
Table 1 reports the results.
Three self-constructed models are compared in the experiment: Entity BERT (BERT model integrating the entity information by joint representation but discarding the syntactic constraints), EBGCN (Entity BERT with GCN, apply normal GCN without attention to impose hard syntactic constraints on Entity BERT), EBAGCN (Entity BERT with Attention-guided GCN, apply attention-guided GCN to impose soft syntactic constraints on Entity BERT).
The result in
Table 1 shows that the models in this work significantly outperform previous SOTA model SVM+Wikipedia which is based on feature engineering. They also surpass all the DNN models including LSTM, BiLSTM, and PreWin, even if they incorporate POS and NER features to enrich the representation. This result illustrates that the conventional features cannot provide enough contextual information. Entity BERT and BERT are both pre-trained language models. However, there are significant differences in effect due to the incorporation of entity constraint, which greatly improves the accuracy of the model.
Furthermore, the three models also perform differently. The experiment on EBGCN produces a decent accuracy that is 0.3% and 0.2% higher than Entity BERT on SemEval and ReLocaR, which illustrates that the application of GCN helps improve performance by catching the ignored information from syntax. Moreover, EBAGCN obtains an improvement of 0.7% and 0.2% compared with EBGCN in terms of accuracy. This fact provides ample proof that the introduction of the multi-head attention mechanism assists GCNs in learning better information aggregations by simultaneously pruning irrelevant information and emphasizing dominating relations concerning indicators such as verbs in a soft method.
The table also gives the F1-score for literal and metonymic results, respectively. The consequence shows that EBAGCN achieves the best F1-score on SemEval (metonymic accounts for 20%) and ReLocaR (metonymic accounts for 50%), which suggests that EBAGCN is adaptive in various class distributions.
To be specific, Entity BERT uses the BERT-based neural network to aggregate information about context and entity semantics to form better semantic vector representation. In this way, the model leverages entity words and enhances the interaction between entity words and context information, thus improving the accuracy and recall rate of MR. The improved result of Entity BERT verifies the importance of entity information and proves that Entity BERT can effectively solve the missing entity information in metonymy.
In the framework of cognitive linguistics, syntactic structures are considered to contain important information. Existing models based on DNN scan the information encoding of the whole sentence sequence and compress it into vector expression, which cannot capture the syntactic structure that plays an important role in the transmission of natural language information. In addition, all syntactic dependencies are added to the syntactic representation vector with the same weight, so it is impossible to distinguish the contribution of each dependency. Thus, EBAGCN steps further to leverage syntax knowledge selectively. The weight allocation system for syntax dependencies of EBAGCN do not just resist noise interference, but also improve the accuracy of MR. Finally, the proposed EBAGCN can effectively solve the low accuracy of long and difficult sentences as well as key word recognition.
5.3. Main Result on CMR
Unlike the English dataset ReLocaR and SemEval, Chinese text is harder to understand.
Table 2 gives the result on CMR. As can be seen from the experimental results, fastText, CNN, and BiLSTM+Att have little gap in the Chinese MR task. BERT greatly improves the performance by relying on the powerful ability of the pre-trained model, and the Entity BERT proposed in this paper achieves a better result by reinforcing the entity information. Compared with Entity BERT, EBGCN is about 1.2% higher in accuracy, showing a huge performance improvement and proving the significance of dependency knowledge. However, taking advantage of syntactic noise elimination, EBAGCN obtains the SOTA result on CMR, proving the validity of this work.
6. Discussion
6.1. Entity Ablation Experiment
The validity of the Entity BERT was demonstrated in the main result above. This experiment mines further to understand the specific contributions of each module besides the pre-trained BERT component. Take Entity BERT as baseline, this work proposes three additional models:
Entity BERT NO-SEP-NO-ENT discard both entity representation and the entity indicators around the entity, i.e., only representation corresponding to “[CLS]” is used for classification;
Entity BERT NO-SEP only discard the entity representation but reserve entity indicators around the entity;
Entity BERT NO-ENT only discard the entity indicators around the entity but reserve entity representation .
Table 3 shows the results of the ablation experiment. From the table, all three methods perform worse than Entity BERT. Among them, Entity BERT NO-SEP-NO-ENT performs the worst, proving that both entity indicator and entity representation make great contributions to the model. The meaning of using the entity indicator is to integrate entity location information into the BERT pre-trained model. On the other hand, entity representation further enriches the information and helps the model achieve high accuracy.
Entity BERT enhances the influence of entity information on discriminant results and provides a solid foundation for accurately representing the joint embedding of entity and context. In MR task, most of the key information is focused on entity. The integrity of entity representation largely determines the performance of the model. Making full use of the semantic, location and structural information of entity words can effectively reduce the influence of noise. Therefore, as entity information in metonymy text is difficult to be extracted and represented, a fusion perception method is applied in this paper, which can effectively improve the accuracy and efficiency of MR under the joint guidance of entity and context knowledge. Based on the above fashion, this paper puts forward Entity BERT that jointly training BERT with entity and context. In that case, Entity BERT makes use of the key information of the entity to reduce the influence of the noise in the sentence, Thus, the accuracy and recall rate are greatly improved.
6.2. Entity Contribution Verification
To further study the contribution of entity information on MR, this experiment inputs a single entity representation
into the BERT model without using contextual information representation
. This work maps semantic representation of several entities into
Figure 5.
As shown in the left picture of
Figure 5, before BERT is fine-tuned, the semantic representation of the entity is not far from the original, indicating that entity information is very sparse and weak without entity fine-tuning. However, as shown in the right figure, after BERT fine-tuning, metonymic and literal entities are divided into two clusters, which shows that the fine-tuned model can judge the metonymy.
The existing deep learning model depends on the context, whose representation is formed by inputting all the tokens in the whole sentence into the fully connected layer, inevitably containing a lot of noise. This experiment proves that integrating entity information into the language model helps the task of MR and verifies the rationality of Entity BERT.
6.3. Comparison w.r.t Sentence Length
Figure 6 further compares the accuracy of EBAGCN and Entity BERT under different sentence lengths.
The causes of poor model performance may be more than one. For example, long sentences are likely to affect the accuracy of classification for the following reasons:
Contextual meanings for long sentences are more difficult to capture and represent.
The position of key tokens, such as a predicate, is noisy and, therefore, difficult to determine.
Intuitively, lacking model interpretability of non-local syntactic relations, sequence-based models such as Entity BERT cannot sufficiently capture long-distance dependence. Thus, the accuracy of Entity BERT drops fiercely as predicted as shown in
Figure 6 when the sentence length grows. However, such a performance degradation can be alleviated using EBAGCN, which suggests that catching a mass of non-local syntactic relations helps the proposed model accurately infer the meaning of the entity, especially in longer sentences.
The proposed EBAGCN solve the problem of complex sentence patterns and difficult syntactic understanding. By means of GCN based on the attention mechanism, the semantic and syntactic representation of the context is jointly trained. GCN effectively help integrate syntactic knowledge into vector representation, while attention mechanism highlights the expression of key information in dependency, which eliminates the syntactic noise to a certain extent.
6.4. Attention Visualization
This experiment provides a case study, using a motivating example that is correctly classified by EBAGCN but misclassified by Entity BERT, to vividly show the effectiveness of the proposed model. Given the sentence “He later went to report Malaysia for one year”, people can easily distinguish “Malaysia” as a metonymic entity nominal by extending “Malaysia” as a concept of “a big event in Malaysia”. Nevertheless, from the semantic perspective independently, the verb phrase “went to” is such a strong indicator that Entity BERT is prone to recognize “Malaysia” as a literal territory (for the regular usage of “went to someplace”) falsely and overlooks the true predicate “report”. How EBAGCN resolves the problems mentioned above is explained by visualizing the attention weights in the model.
First, the attention matrix of Transformer encoder blocks is compared in BERT Encoding Unit to display the syntactic integration’s contribution to the whole model.
Figure 7a,b shows that the weight for tokens in Entity BERT is more decentralized, while EBAGCN concentrates on “report” and “Malaysia” rather than “went to” thanks to the application of syntactic features, indicating that with the help of syntactic component, EBAGCN can better pick valid token and discard the irrelevant even misleading chunks.
The Proposed Model section shows that the Attention Guided Layer transfers the hard dependency matrix into an attention guided matrix, which enables the syntactic component to select relevant syntactic constraints efficiently. Thus, this work further displays the attention guided matrix to demonstrate the superiority of soft dependency relations. As shown in
Figure 7c, after launching the multi-head mechanism, despite the existence of dependency relations for the prepositional phrase “for one year”, the weights for these relations are pretty futile compared with the main clause that includes verb and other determining features for judging relations in prepositional phrase useless to the MR task. This approach sets the model free from pre-defined pruning strategy and automatically obtains high-quality relations.