Targeted Sentiment Classification Based on Attentional Encoding and Graph Convolutional Networks

: Targeted sentiment classiﬁcation aims to predict the emotional trend of a speciﬁc goal. Currently, most methods (e.g., recurrent neural networks and convolutional neural networks combined with an attention mechanism) are not able to fully capture the semantic information of the context and they also lack a mechanism to explain the relevant syntactical constraints and long-range word dependencies. Therefore, syntactically irrelevant context words may mistakenly be recognized as clues to predict the target sentiment. To tackle these problems, this paper considers that the semantic information, syntactic information, and their interaction information are very crucial to targeted sentiment analysis, and propose an attentional-encoding-based graph convolutional network (AEGCN) model. Our proposed model is mainly composed of multi-head attention and an improved graph convolutional network built over the dependency tree of a sentence. Pre-trained BERT is applied to this task, and new state-of-art performance is achieved. Experiments on ﬁve datasets show the effectiveness of the model proposed in this paper compared with a series of the latest models.


Introduction
Natural language processing is an important part of the new generation of artificial intelligence, particularly in human-machine interaction [1]. All the major computing companies have integrated or are integrating natural language processing capacity in their systems. Targeted sentiment classification [2,3] is a basic task of natural language processing that has attracted a great deal of attention in recent years. It is a fine-grained task in sentiment analysis, and it aims to predict the emotional polarity of each target within a sentence. For example, in the sentence "the price is reasonable while the service is poor", the emotional polarities of two targets "price" and "service" are positive and negative, respectively. A specific target is usually an entity or an aspect term.
Usually, researchers use machine learning algorithms to classify the sentiment of the given targets in a sentence. Some early work used handcraft features, such as sentiment lexicon and language bag-of-words features to train classifiers for a specific target sentiment classification [4,5]. However, these methods are highly dependent on the quality of the selected features and require a large amount of manual feature engineering. In later studies, various neural-network-based methods became popular [6,7], which did not need manual feature engineering. Most of them are based on long short-term memory (LSTM) neural networks [8], and some of them are convolutional neural networks (CNNs) [9]. Many of these neural-network-based methods embed specific target information into the sentence representation via an attention mechanism [7]. Some studies have applied attention mechanisms to generate target specific sentence representations [10,11], or to transform sentence representations according to the target words [12]. However, these studies rely on complex recurrent neural networks (RNNs) as sequence encoders to infer the hidden semantics of the context.
The first problem of the previous studies is that the semantic modeling only uses RNNs combined with the traditional attention mechanism. Each output state of RNNs depends on the previous state, while in semantic modeling, long-distance semantic information may be lost and the parallel computing of input data cannot be carried out [13]. In addition, the traditional attention mechanism is prone to introduce excessive noise because the distribution of weight values is too scattered, and thus it is difficult to accurately extract enough contextual sentiment information related to a specific target. Self-attention [14] is a novel attention mechanism. In sequence-to-sequence (Seq2Seq) tasks, experimental results have demonstrated that this method performs more satisfactorily than traditional RNNs in capturing semantic information.
Another problem in previous research is that these methods largely ignore the syntactic structure of the sentence, while in fact the syntactic structure helps to identify the emotional characteristics directly related to the specific target. When a specific target term is separated from its affective phrase, it is difficult to find related affective words in its surrounding words. The CNN-based models perceive multi-word features as continuous words by convolution of word sequences, whereas it is not sufficient to determine the sentiment expressed by multiple words that are not adjacent to each other [15]. Take the following sentence as an example: "The hotpot, though served with poor service, is actually delicious". As "delicious" and the target word "hotpot" in the word sequence are a little farther away from each other, the CNN models cannot capture the remote word dependency. On the other hand, in the syntactic dependency tree, the word "delicious" is closer to the target "hotpot" (see Figure 1). In addition, the use of syntactic dependency trees also helps to solve the potential ambiguity in word sequences [16]. In the simple sentence "nice beef terrible juice", nice and terrible can be used interchangeably. It is difficult to distinguish which word "nice" or "terrible" is related to the target word "beef" or "juice" if only the traditional attention-based method is applied. However, if a person has a good knowledge of grammar, she can easily realize that "nice" is the adjective modifier of "beef", and "terrible" is the modifier of "juice". Since the structure of the syntactic dependency tree is similar to the graph structure and the graph convolutional network (GCN) [17] is an effective convolutional neural network that is able to directly operate on graphs, this paper proposes an improved GCN to better extract and integrate the syntactic information displayed in the syntactic dependency tree of the sentences. Overall, semantic information and syntactic information are both crucial for determining the sentiment polarity of a specific target and this paper tries to embed the abundant semantic information and syntactic information into the word representation and specific aspect representation. The main contributions of this paper are as follows: • This paper proposes the novel attentional-encoding-based graph convolutional network (AEGCN) model, which leverages the syntactic structure of a sentence and utilizes multi-head self-attention combined with LSTM to capture context features and specific target features concurrently.
The AEGCN model combines semantic information and syntactic information to predict the sentiment polarity of a targeted aspect. • Syntactic information has not attracted enough attention in many related studies. This paper builds an improved graph convolutional network with point-wise convolution over the dependency tree of a sentence to extract syntactic information and utilize a multi-head self-attention to obtain the syntactic information encoding. • This paper evaluates the proposed method on five datasets. Experiments show that the AEGCN achieved competitive performance over the state-of-the-art approaches. This paper applies pre-trained BERT The rest of this paper is organized as follows. Section 2 gives a brief review of the related work. Section 3 describes the AEGCN model. Section 4 shows the experimental results. Finally, Section 5 concludes the paper.

Related Works
In this part, we briefly review the specific target sentiment classification and graph convolution network.

Targeted Sentiment Classification
Targeted sentiment classification is an important research topic as well as a fine-grained task in emotion analysis which is also known as opinion mining [18]. The early works on specific targeted sentiment classification mainly focus on extracting features to train sentiment classifiers [19], such as bag-of-word features and sentiment dictionary features. Most of these methods are rule-based [20] and statistical methods [4]-all of which are extremely dependent on feature engineering. Feature engineering is a labor-intensive task. In recent years, recurrent neural networks (RNNs) have achieved great success in this task because the deep learning model is able to utilize distributed representation to automatically learn and obtain the relevant features of targets. In addition, the use of attention mechanisms also makes sentence representation more focused on important information given a specific target [21]. ATAE-LSTM [7] combines LSTM and an attention mechanism. The model embeds specific targets into the calculation of attention weights. RAM was proposed by Chen et al. [11]; this work improves upon Mem-Net by representing memory with bidirectional LSTM (Bi-LSTM) and using a gated recurrent unit network to combine the multiple attention outputs for sentence representation. Ma et al. [10] designed a model with a bi-directional attention mechanism, which learned the attention weights of the contexts and the specific target words in an interactive way. AEN [22] avoids recurrence, and multiple multi-head attention was applied between the contexts and the specific targets. Li et al. [23] developed a new direction named coarse-to-fine task transfer, which aims to use bidirectional LSTM and multiple attention layers to accomplish this task. However, these studies do not take the syntactic information into account and ignore the syntactic interdependence between words, which may lead to ambiguity when identifying the sentiment polarity of a specific target.

Application of Graph Convolution Networks in NLP
GCNs [24] are very good at processing graph data with rich related information. To begin with, many studies are dedicated to extending GCN for image-related tasks [25,26]. Qi et al. [27] propose a 3D graph neural network (3DGNN) that builds a k-nearest neighbor graph on top of a 3D point cloud and each node, in the graph corresponds to a set of points which allows the model to directly learn its representation from 3D points. In recent years, GCN has attracted increasing attention in NLP, in applications such as semantic role labeling [28] and relationship classification [29]. In the semantic role labeling task, the GCN was applied to the NLP field for the first time, and the experimental results proved that the GCN is very suitable for this task. This has inspired many NLP scholars to explore the application of GCNs in their own research. Some researchers have explored the use of graph neural networks in text classification. Peng et al. [30] first converted texts to graphs-of-words, and then used graph convolution operations to convolve the word graphs. The graph-of-words representation of texts is a novel idea in this field, which has the advantage of capturing non-consecutive and long-distance semantics. There are also some works that have successfully applied GCNs in sentiment classification [15,31]. In [32], Zhao et al. propose a novel aspect-level sentiment classification model which can effectively capture the sentiment dependencies between multiple aspects in one sentence. They consider the sentiment dependencies between aspects in one sentence for the first time. The above studies show that GCNs can effectively capture the relationship between nodes. Inspired by [15], this paper improves the GCN and combines a GCN with a multi-head attention mechanism for targeted sentiment classification, achieving comparable experimental results to state-of-the-art methods.

The Attention-Encoding-Based Graph Convolutional Network Model (AEGCN)
The overall architecture of the AEGCN model is illustrated in Figure 2. In the figure, "embedding" denotes GloVe embedding or pre-trained BERT embedding; "hidden state" represents the Bi-LSTM; MHSA refers to multi-head self-attention; MHIA refers to multi-head interactive attention; L-layer GCN represents the layers in the GCN;"pool" indicates average pooling. First, a Bi-LSTM is used for preliminary semantic modeling of contextual and specific targets. After getting the hidden state of the context, the GCN and MHSA are combined to encode the syntactic information. Meanwhile, we exploit MHSA for further attentional encoding of the hidden state of the context and the specific targets to obtain richer semantic information. Then, contextual semantic encoding and target-specific semantic encoding interact with syntactic information encoding by utilizing MHIA. Average pooling is applied to the interactive information and contextual semantic encoding, and finally they are concatenated together to obtain the final feature representation that is used to predict the sentiment polarity.

Semantic Coding
In this part, AEGCN encodes the semantic information of an n-word sentence containing the m-word aspect term c = {e c 1 , e c 2 , . . . , e c τ , e c τ+1 , . . . , e c τ+m−1 , . . . , e c n } and the targets t = {e t 1 , e t 2 , e t 3 , ....e t m } with the combination of Bi-LSTM and multi-head self-attention. τ denotes the start token of the aspect term. This paper applies GloVe embedding and BERT embedding in our model. Accordingly, the models are named AEGCN-GloVe and AEGCN-BERT.

GloVe Embedding
Provided that L ∈ R d e ×|V| is the embedding matrix of pretrained GloVe [33], d e is the dimension of the word vector. |V| is the vocabulary size. Then, each word w i ∈ R |V| is mapped to its corresponding embedding vector e i ∈ R d e ×1 , where R d e ×1 denotes the column of the embedding matrix.

BERT Embedding
This paper uses pre-trained BERT [34] to generate word vectors of sequence as BERT embedding.
In order to better facilitate the training and fine-tuning of the BERT model, this paper transforms the formation of the given context and given target to "[CLS] + context + [SEP]" and "[CLS] + target + [SEP]" respectively.

Bi-Directional LSTM
After obtaining the context sequence c = {e c 1 , e c 2 , e c 3 , . . . , e c n } and the target sequence t = {e t 1 , e t 2 , e t 3 , ....e t m }, this paper builds a Bi-LSTM to generate the hidden state vector of the contexts

Multi-Head Attention
Multi-head attention (MHA) [22] is an attention function that can be performed in parallel subspaces. In this paper, multi-head self-attention and multi-head interactive attention are employed to model different goals. This paper defines a Key sequence k = {k 1 , k 2 , ..., k n } and a Query sequence q = {q 1 , q 2 , ..., q n } according to our specific task. The attention value is obtained by calculating the attention distribution with Key and attaching it to Value. Since keys and values are often the same in the application field of NLP, here Key = Value. Then, an attention function projects Key and Query to an output sequence: f m is the function used to calculate and study the semantic relevance between q j and k i : W a ∈ R 2d hid is the learning weight matrix. MHA is able to learn different scores of n_head in parallel subspaces, and the parameters between heads are not shared because the values of q and k are constantly changing. The outputs of n head are concatenated and projected to the specific hidden dimension d hid by: where "⊕" represents the vector concatenation, . . , o h m is the output of the h-th head attention, and h ∈ [1, n head ]. Multi-head self-attention (MHSA) is a special situation of MHA whose input q = k. Given the context hidden state H c and the target word hidden state H t , AEGCN can derive semantic encoding of the context and the target words H cs , H ts as follows: where

Syntactic Information Encoding
In this section, the architecture of graph convolution network is improved. The modified graph convolution network is used to better integrate syntactic information into each word representation, followed by a multi-head self-attention that encodes the final syntactic information.

Graph Convolution Network
Graph convolution networks [24] are particularly skilled at dealing with graph data with rich relational information. Given a graph with k nodes, an adjacency matrix A ∈ R k×k can be obtained by listing the graphs. For convenience, A GCN has L layers l ∈ [1, 2, · · · , L], where h L i is the final state of node i. The graph convolution of a node can be described as: where W l is the linear transformation weight matrix, b l is the offset vector, and σ is a nonlinear function, such as ReLU. An example of a GCN layer is shown in Figure 3. First, the hidden state of the previous layer h l−1 i undergoes a linear transformation. Then, each hidden state is shaped by the node information directly related to it to obtain the hidden state of the current layer h l i . Graph convolution over dependency trees [15] was proposed by Zhang et al. In the process of graph convolution, each graph convolution can only encode the information of the immediate neighbor nodes. So, if a GCN has L layers, the information of a node in the graph will only be affected L times by its adjacent nodes. In view of this, syntactic constraints are added to the targets of the sentence by convoluting the syntactic dependency tree of the sentence. Thus, the syntactic distance of the corresponding descriptors can be determined. In addition, when a specific target is described by a non-consecutive word, the method can aggregate the features of non-consecutive words without missing their information. Therefore, this paper puts forward a method to include the syntactic information by combining the graph convolution over dependency trees with point-wise convolution, then encoding the syntactic information obtained by multi-head self-attention to get the final syntactic information coding. The position-aware transformation is made for h l i before feeding H c into continuous GCN layers: where F (·) is a position weight distribution function used in [11,12] that is used to enhance the importance of the words close to a specific target. The F (·) function is as follows: where q i ∈ R is the position weight of the i-th token. Next, after constructing the dependency tree of a given sentence, AEGCN first obtains the adjacency matrix A ∈ R n×n according to the words in the sentence. Then, following the idea of self looping [17], each word is manually set to be adjacent to itself-that is, the diagonal value of A is 1. The opposite direction of a dependency architecture is also included, which means A ij = 1 and A ji = 1 if there is an edge going from node i to node j; otherwise, A ij = 0 and A ji = 0 (see Figures 4 and 5). Finally, the representation of each node is updated with a graph convolution [17] with a normalization factor, as follows:h where s l−1 j ∈ R 2d h is the representation of the j-th symbol, which is the output of the previous GCN layer. h l i ∈ R 2d h is the output of the current GCN layer, and d i = ∑ n j=1 A ij is the degree of the i-th symbol in the tree. W l and bias b l are both learnable parameters.

Point-Wise Convolution
When the output of the current layer h l = {h l 1 , h l 2 , ..., h l n } is derived, point-wise convolution (PWC) is performed. Point-wise means the convolution kernel size is 1. Same operation is applied to each token belonging to the input. Formally, given the input sequence h, PWC is defined as: where σ represents the activate function ReLU, * is the convolution operation, W pwc ∈ R 2d h ×2d h is the learning weight of the convolution kernel, and b pwc ∈ R 2d h is the bias of the convolutional kernel.
The output of the current layer of GCN h l p is obtained by exploiting point-wise convolution to h l : The

Information Fusion
In this part, this paper interacts syntactic information with semantic information, and finishes the final concatenation.

Multi-Head Interactive Attention
Multi-head interactive attention (MHIA) is the common form in which q is different from k. Given the syntactic information encoding H gs , the context semantic encoding H cs and the target semantic encoding H ts , the context-perceptive syntactical information H cg , and the syntax-perceptive target information H gt are derived by: where H cg = {h

Information Mosaic
The context-perceptive syntactical representation H cg , the syntax-perceptive target H gt , and the context semantic encoding H cs are averaged by utilizing average pooling, then they are concatenated as the final feature representation u as the follows:

Sentiment Classification
After the final feature representation u is obtained, it is fed into the so f tmax layer, and the probability distribution of the sentiment polarity of the different targets is gained: y ∈ R c is the predicted distribution of the sentiment polarity, and c is the category of classification. W T u ∈ R 1×c and b u ∈ R c are the learning weight matrix and the bias, respectively.

Model Training
In the model, the sum of the classification cross entropy and L 2 -regularization is introduced as the loss function, and the back propagation algorithm is employed to update the weights and parameters: where i is the subscript of the i-th sample, j is the script of j-th sentiment category; y is the real distribution of sentence sentiment polarity, y is the predicted distribution of sentence sentiment polarity, c is the classification category, and θ is all trainable parameters. λ is the parameter of regularization.

Datasets
In order to verify the effectiveness of AEGCN, our experiment was implemented on five datasets: a Twitter dataset used in [35], herein denoted "twitter", rest14 and lap14 (semeval 2014 task 4 [36]), rest15 (semeval 2015 task 12 [37]), and rest16 (semeval 2016 task 5 [38]). The accuracy and macro average F1 were selected for the evaluation. The experimental results were obtained by the average of three random initializations. The experimental datasets are shown in Table 1.

Hyper-Parameters
In the experiments, for AEGCN-GloVe, the word embeddings were initialized from GloVe with a dimension of 300, and the learning rate was 0.001. The parameter of the regulation was set as 0.00001. The coefficient of the batch size was 32. In order to prevent over-fitting, the dropout rate was 0.5. For AEGCN-BERT, the embedding dimension was set to 768. The learning rate was set to 2 × 10 −5 . Regulation was set as 0.001. Dropout was 0.1 and batch size was 16. For both AEGCN design models, the number of multi-head attention heads was 3, the number of GCN layers was set as 2, and the model weights were initialized by uniform distribution. The hidden layer dimension was 300. In addition, the AEGCN models utilized the Adam optimizer.

The Number of Heads in MHA
This paper explores the influence of the number of heads k on the experimental results. The results over dataset lap14 are shown in Figure 6, and similar results were achieved on the other four datasets. From Figure 6, it can be observed that the value of accuracy and F1-score fluctuated with increasing k. But when k = 3, the highest accuracy and F1 values appear. Then the values of accuracy and F1 decrease with the rising k. We speculate that due to the increase of k, the integration of semantic information from too many context words produces unnecessary interference, confusing the representation of the current words. Therefore, the performance of the model was better when k = 3.

Impact of GCN Layer Number
The number of GCN layers is also an important parameter that affects the performance of our model. We tested this on the lap14 dataset with different number of GCN layers R. The results are shown in Figure 7. From the above results, it can be seen that the model achieved the best performance when the number of GCN layers R = 2. However, when R was greater than 2, the performance of the model worsened with increasing GCN layers. The possible reason for this is that as the number of GCN layers R increases, the model parameters grow, causing the model to be more difficult to train, and leading to overfitting. In order to avoid too many training parameters and overfitting, this paper set the GCN layer number as 2.

Experimental Results
Eight baseline models are selected for comparison to evaluate the effectiveness of AEGCN, and the comparison of experimental results is shown in Table 2. SVM [39] is a traditional support vector machine method based on complex feature engineering. LSTM [8] gets the hidden layer output of sentences by LSTM, then obtains the sentiment analysis by so f tmax classifier.
MemNet [6] regards the contexts as external memory, which makes the model benefit from a multi-hop architecture.
AOA-LSTM [40] obtains the hidden layer output of the contexts and target words through Bi-LSTM, then obtains the corresponding representation of the contexts and target words through the interactive learning of attention over attention, and finally acquires the polarity distribution of sentiment by a so f tmax classifier.
IAN [10] acquires the hidden output of context and target words through LSTM, and then obtains the expression of the context and target words through the interactive learning of the interactive attention mechanism. It models the relationships between the target words and their contexts interactively. After splicing the expression of context and target words, the polarity distribution of sentiment is received by a so f tmax classifier.
TNet-LF [12] proposes a method to generate target-specific representations of words in the sentence, incorporating a mechanism for preserving the original contextual information from the RNN layer.
AEN [22] eschews recurrence and uses an attentional encoder network to model the relation between the contexts and the specific targets.
ASGCN [15] puts forward a graph convolution network (GCN) on the dependency tree of sentences to take advantage of syntactic information and word dependency.
From the experimental results in Table 2, we can see that the AEGCN-GloVe model was slightly better than all the other models in the twitter and lap14 datasets. Compared with the baseline model ASGCN-DT, it obtained comparable results in the rest14 dataset. Compared with the baseline model ASGCN-DG, it still obtained comparable results on the rest15 dataset. However, in the rest16 dataset, the results were slightly inferior to those of the baseline model TNet-LF. In particular, AEGCN-BERT obtained new state-of-the-art performances on all datasets.
Based on the deep learning model, the performance of the proposed model was better than the traditional machine learning methods. In Table 2, the SVM model proposed by Kiritshenko uses an SVM for classification, which relies on a large number of artificial feature extractions. There is no artificial feature extraction in AEGCN-GloVe, and its accuracy was 9.76%, 5.42%, and 0.88% higher than SVM on twitter, lap14, and restaurant datasets, respectively. This shows that the deep learning model is suitable for specific aspect sentiment analysis.
It is better to model context semantics by exploiting Bi-LSTM combined with a multi-head attention mechanism rather than by applying only a standard multiple-attention or multi-head attention mechanism. Taking MemNet as an example, it uses multiple hops to combine different attentions linearly, and its accuracy and F1 score were lower than that of AEGCN-GloVe on all five datasets. One possible reason is that once the traditional attention mechanism incorrectly assigns weights to words that are not related to determining the sentiment polarity of a specific target, repeating the traditional attention mechanism multiple times will make it more difficult for the model to predict the correct sentiment polarity of a specific target. In addition, AEN only uses the multi-head attention mechanism to semantically model the contextual and specific targets. Without using Bi-LSTM, it may not be possible to adequately consider the contextual semantics of the entire sentence from front to back and back to front. Except for one index (rest4, F1), AEGCN-GloVe obtained higher values of accuracy and F1 score than AEN in three data sets (twitter, lap14, and rest14). This indicates that the semantic information extracted by combining Bi-LSTM and multi-head self-attention was more abundant.
Due to the combination of syntactic information, the effect of this model was better than the model that does not consider it. Although AOA emphasizes the influence between the contexts and the target words through attention over attention, it only achieved slightly higher accuracy and F1 score over rest16 dataset (0.11% higher) and lower values in the other datasets. Moreover, though IAN improves the interaction between the contexts and the target words through the interaction of the interactive attention mechanism, the accuracy and F1 score of the model in five datasets were lower than those for AEGCN-GloVe. TNet-LF could integrate the specific target information into word representation well, and the context-preserving mechanism could retain semantic information well. In the rest16 dataset, the performance of TNet-LF was slightly higher than AEGCN-GloVe (1.7% and 2.21% respectively). However, it was worse than AEGCN-GloVe in the other four datasets. We suppose that the sentences in the rest16 dataset are particularly dependent on the original semantic information of the text, and the application of syntactic information for targeted sentiment classification was successful.
The performance of GCN with point-wise convolution was better than the GCN with an aspect-specific masking layer. In the ASGCN model, a multi-layered graph convolution structure is implemented on top of the LSTM output, followed by a masking mechanism that filters out non-aspect words and keeps only aspect-specific features. However, the contextual representations with syntactic information are lost. We cannot conclude that some context words with syntactic information are useless in determining the emotional polarity of a particular target, so we chose to keep all the context words with syntactic information and encoded them with multi-head self-attention. We believe that after using the syntactic information to reshape the representation of each context word through GCN, the syntactic information-rich words will be more prominent after the parallel calculation of multi-head self-attention, which is more conducive to the full use of syntactic information to determine the sentiment polarity of specific targets. Among the five data sets, only three results of AEGCN-GloVe were lower than those of ASGCN in rest14, rest15 and rest16 (0.87%, 1.02%, and 1.60% respectively), and the other seven indexes were better than that of ASGCN, which proves the efficiency of AEGCN-GloVe.
Compared with all the models listed on the Table 2, AEGCN-BERT achieved new state-of-the-art results, which demonstrates the power of pre-trained BERT and the huge superiority of the BERT-based model over GloVe-based models.

Ablation Study
In order to further identify the level of benefit that each component of AEGCN-GloVe contributes to the model performance and the importance of each component, we conducted an ablation study on AEGCN-GloVe. The results are shown in Table 3. As we can see from Table 3, most results of AEGCN-GloVe ablations were inferior to AEGCN-GloVe in both accuracy and macro-F1 measure.

Ablate Graph Convolution Network
For AEGCN, a graph convolution network was deployed between the Bi-LSTM and multi-head self-attention in the proposed model, since it can reconstruct the representation of each word using syntactic information. We ablated the graph convolution network to examine the performance of AEGCN without it.
As a result of removing the GCN, the accuracy and F1-score on three datasets (lap14, rest14, and rest15) decreased, while the accuracy and F1-score on the twitter and rest16 datasets increased. This is because the sentences from the twitter dataset are colloquial and less grammatical. Since the syntactic structure of most sentences in the twitter dataset is not perfect, the introduction of syntactic information would interfere with the prediction of the sentiment polarity of a specific target. TNet-LF is able to learn more abstract contextualized word features from deeper networks, and it achieved comparable results on the rest16 dataset in terms of both accuracy and macro-F1 measure ( Table 2). We can conclude from this that the twitter and rest16 datasets are less-sensitive to syntactic information. Our experimental results demonstrate that syntactic information is quite helpful for targeted sentiment classification on most datasets.

Ablated Point-Wise Convolution
Point-wise convolution (PWC) is within every layer of the graph convolution network. We improved the GCN by adding PWC to every layer, aiming at better integrating the syntactic information. We ablated the PWC to see what would happen to the results if the GCN ran without PWC.
The AECGN without PWC performed better than baseline AEGCN on the rest16 dataset, while its performance on the lap14 dataset worsened obviously. On the twitter dataset, AEGCN (without (w/o) PWC) achieved an almost equal performance compared to the baseline model. PWC is deployed to better learn and integrate the syntactic information representation of words of the current GCN layer. Table 3 indicates that PWC was very significant for AEGCN, especially on the lap14 dataset.

Ablated Multi-Head Self-Attention
Multi-head self-attention (MHSA) is mainly used to extract richer semantic information and encode the syntactic information. We ablated MHSA to examine the performance of AEGCN without it.
The removal of multi-head self-attention (MHSA) led to poor performance on the twitter dataset and a slight performance increase on the rest15 dataset, which indicates that the twitter dataset contains a great deal of semantic information and that the application of MHSA in our model could effectively capture it. Based on these results, we suppose that the rest15 dataset is more sensitive to syntactic information than the other datasets displayed in Table 3.

Ablated Multi-Head Interactive Attention
Multi-head interactive attention (MHIA) aims at assembling features and interactively learning the correlation between syntactic information and semantic information. Concatenation and pooling can replace MHIA, but then the learning process is no longer interactive.
Without MHIA, the performance of the proposed model on the five datasets was unsatisfactory. This shows that MHIA is crucial for AEGCN and its core architecture. Without the interaction of syntactic and semantic information, the results onthree datasets (lap14, rest14, and rest15) were disastrous. Based on the results, we can say that the interaction of syntactic and semantic information is significant in most datasets, and especially for lap14, rest14, and rest15.

AEGCN Ablations Analysis
According to Table 3, the performance of AEGCN ablations was significantly reduced. Compared to the AEGCN model, the AEGCN ablations of the GCN layer achieved limited performance on lap14 and rest14 datasets, especially on lap14. AEGCN attained inferior performance when PWC was removed since the model lost the ability to better study and integrate the syntactic information. We utilize multi-head self-attention to capture more abundant semantic information and encode syntactic information. Its absence led to poor performance on three datasets (twitter, lap14, rest14). Multi-head interactive attention is applied to interactively study the features of the syntactic information and semantic information. The removal of MHIA was fatal since performance on all datasets dropped-dramatically in some (lap14, rest14, rest15). In conclusion, the experimental results reveal that, for AEGCN, all the components work well and brought a huge improvement in all five datasets. If AEGCN is run without interactive semantic and syntactic information, the performance decreased by 2%-3% on lap14 and rest15 datasets. Experimental results show that each component of AEGCN is indispensable and effective in achieving overall good results on all datasets.

Conclusions and Future Work
In this paper, a model based on attention encoding and a graph convolution network is proposed for targeted sentiment classification. In order to solve the problems of losing long-distance emotional words in semantic modeling and parallel computing for input data, this paper proposes a semantic encoding method combining bidirectional LSTM with a multi-head self-attention mechanism. To make use of the syntactic information that most models ignore, we developed a graph convolution neural network that integrated the point-wise convolution and builds the graph convolution network over the syntax dependency tree to encode the syntactic information. It is then interacted with the semantic information through multi-head interactive attention. The experimental results on five datasets-twitter, lap14, rest14, rest15, and rest16-show that the AEGCN model proposed in this paper was significantly better than the models based on traditional machine learning and other deep learning models, proving its effectiveness.
Although the current model achieved good experimental results, there is still a great deal of work to be done. In our future work, we intend to reduce the training parameters of the model to make our model more lightweight. Second, extracting more original contextual semantic information will also be an important part of our future work. Finally, combining domain knowledge with syntactic information can be taken into consideration in future research.