Attention-Enhanced Graph Convolutional Networks for Aspect-Based Sentiment Classification with Multi-Head Attention

: The purpose of aspect-based sentiment classification is to identify the sentiment polarity of each aspect in a sentence. Recently, due to the introduction of Graph Convolutional Networks (GCN), more and more studies have used sentence structure information to establish the connection between aspects and opinion words. However, the accuracy of these methods is limited by noise information and dependency tree parsing performance. To solve this problem, we proposed an at-tention-enhanced graph convolutional network (AEGCN) for aspect-based sentiment classification with multi-head attention (MHA). Our proposed method can better combine semantic and syntactic information by introducing MHA and GCN. We also added an attention mechanism to GCN to enhance its performance. In order to verify the effectiveness of our proposed method, we conducted a lot of experiments on five benchmark datasets. The experimental results show that our proposed method can make more reasonable use of semantic and syntactic information, and further improve the performance of GCN. (using pre-trained BERT as embedding has a huge improvement in the performance of the model). The experimental results prove the effectiveness of our model. Models based on graph convolutional networks and dependency trees are significantly better than models based on attention in capturing long-distance dependent information. This result re-flects the superiority of graph convolutional network in ABSC task. information to obtain the final feature vector to predict the sentiment polarity. The experimental results on five datasets show that the interactive integration of sentence syntactic information and semantic information can indeed effectively improve the performance of the model. This paper aims to strengthen the interaction between syntactic information and semantic information, make rational use of the advantages of graph convolutional network and attention mechanism, and use the semantic information and syntactic information of sentences for aspect-based sentiment prediction. However, we only interacted semantic information and syntactic information in the last layer of the model. Future research can consider building a multi-layer network architecture, interacting semantic information and syntactic information at each layer.


Introduction
Aspect-based sentiment classification (ABSC) [1] is a fine-grained subtask in the field of sentiment analysis.Its purpose is to identify the sentiment polarity of the aspects that clearly appear in the sentence.For example, in a restaurant review: "This restaurant has a good environment, but the price is a bit expensive", the sentiment polarity of the two aspects "environment" and "price" are positive and negative, respectively.In our research, aspects are usually noun or noun phrases.The difficulty of aspect-based sentiment classification task is how to accurately find out the opinion words related to aspects.For example, in the above example, the opinion words corresponding to environment and price are "good" and "expensive", respectively.
Early work is mainly based on neural networks, like Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) [2,3].Since Tang et al. pointed out the importance of modeling the semantic relationship between context and aspects [4], more and more research consider introducing an attention mechanism on the basis of RNN or CNN to establish connections between various aspects and context words [5][6][7][8][9][10].However, due to the complexity of sentences, the attention mechanism cannot accurately capture the relationship between aspects and context words.For example, in the sentence "so delicious was the food but terrible servers", for the aspect "food", the attention mechanism may assign a higher weight to the word "terrible" that is closer to it.
Other works consider using sentence structure information to establish connections between aspects and opinion words.The main idea is to construct a dependency tree Appl.Sci.2021, 11, 3640 2 of 14 based on the syntactic structure of the sentence, use the dependency tree, and Graph Convolutional Networks (GCN) to update the representation of the sentence [11][12][13][14].There is no doubt that dependency trees can establish long-term dependencies between aspects and opinion words.However, due to the limitations of the dependency tree itself, when the sentence structure is more complex or the sentence expression is more colloquial, the model usually cannot correctly predict the sentiment polarity of the aspect.Specifically, when the dependency tree cannot correctly establish the connection between the aspect and the opinion word, the model will not be able to accurately predict the sentiment polarity of the aspect.In addition, in the process of using the dependency tree and GCN to update the word representation, noise information is usually integrated into the new representation, causing the model to learn wrong parameters.
To solve the above two problems, we proposed an attention-enhanced graph convolutional network for aspect-based sentiment classification with multi-head attention.For the first problem, we combine the graph convolutional network with the multi-head attention, using the advantages of the multi-head attention mechanism to capture contextual semantic information to alleviate the defects of the graph convolution network in processing data with unobvious syntactic features.For the second problem, we introduce the attention mechanism into the traditional graph convolutional network, alleviating the problem of introducing too much noise when updating node information by assigning appropriate attention weights to each adjacent node.
In this paper, we divide multi-head attention into multi-head self-attention (MHSA) and multi-head interactive attention (MHIA).The model is divided into two parts, and both parts use the same input.The first part captures contextual semantic information though the attention coding layer (ACL), and the second part integrates syntactic information through the attention-enhanced graph convolutional network.Finally, we use multi-head interactive attention (MHIA) to integrate the above information to obtain the final feature representation.We conducted a lot of experiments on five benchmark datasets.The experimental results show that our proposed method can effectively utilize syntactic and semantic information, and further improve the performance of graph convolutional networks.
Our contributions are as follows: 1.
We introduced an attention mechanism into the graph convolutional network to enhance its performance.

2.
We introduced multi-head self-attention to capture contextual semantic information, used multi-head interactive attention to interact semantic and syntactic information to obtain a more complete feature representation.

3.
In order to better match the dependency tree, we applied the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model of the whole word masking version to the task and achieved better performance.

4.
The experimental results on five benchmark datasets prove that our proposed model is effective compared with other mainstream models.

Related Work
A lot of research work has proved that the method of using neural network model to model the feature representation of word sequence in ABSC can well capture the context information.For example, Convolutional Neural Networks (CNNs) [15,16], Recurrent Neural Networks (RNNs) [17], or a combination of both Convolutional Recurrent Neural Networks (CRNNs) [18].Recently, in view of the good performance of the attention mechanism in modeling contextual information, more and more works have begun to consider the use of attention-based neural network models for ABSC.The main idea is to capture and establish the connection between aspects and opinion words through the attention mechanism.Strictly speaking, attention-based neural network models can also be used as a method of using sentence structure information, because the distance between aspects and opinion word is generally not too far.Wang et al. [5] proposed using attention-based to identify important sentiment information related to aspects.Li et al. [19] introduced a multi-layer attention mechanism to capture the long-distance opinion words of the aspect.For similar purposes, Tang et al. [20] proposed a deep memory network with multi-hop attention and explicit memory.Fan et al. [21] proposed a multi-granularity attention network.In addition, in view of the advantages of the multi-head attention mechanism in modeling contextual semantic relations, Song et al. [22] proposed an attention encoder network to draw the hidden states and semantic interactions between target and context words.Zhu et al. [23] proposed a novel Interactive Dual Attention Network (IDAN) model that aims to interactively learn the representation between contextual semantics and sentimental tendency information.
Sun et al. [24] utilizes the transformer module to learn the word-level representations of aspects and context, respectively, and further utilizes the tree transformer module to obtain the phrase-level representations of contexts.In addition, they adopt dual-pooling method and multi-grained attention network to extract high quality aspect-context interactive representations.Zhang et al. [25] introduced multi-head interactive attention based on the work of Song et al. [22] to enhance the interaction between aspect items and context.
Other works consider using sentence structure information for ABSC.Aspects are usually the core of this task.Therefore, using sentence structure information to establish connections between aspects and opinion words can improve the performance of sentiment classification models.Since the graph convolutional network [26] was first introduced, due to its excellent performance in processing graph structure information, it has been quickly applied to various tasks in the field of natural language processing (NLP) and achieved good results.Marcheggiani and Titov [27] proposed a GCN-based semantic role annotation model.Huang and Carley [28] propose a novel target-dependent graph attention network, which explicitly utilizes the dependency relationship among words.Wang et al. [29] defined a unified aspect-oriented dependency tree structure rooted at a target aspect, and proposed a relational graph attention network (GAT) to encode the new tree structure for sentiment prediction.Tang et al. [30] proposed a dependency graph enhanced dualtransformer network by jointly considering the flat representations learnt from transformer and graph-based representations learnt from the corresponding dependency graph.
In addition, Gao et al. [31] constructed three target-dependent variants of the BERT model for target-dependent sentiment classification.Chen and Qian [32] proposed a transfer capsule network model to transfer sentence-level semantic knowledge from documentlevel sentiment classification to ABSC.Tang et al. [33] proposed a gradual self-supervised attention learning method to strengthen the performance of the attention mechanism.Sun et al. [34] transformed the ABSC task into a sentence pair classification task by constructing auxiliary sentences.

Methodology
Given a sentence sequence W c = w c 1 , w c 2 , • • • , w c n composed of n words and an aspect sequence W a = w a 1 , w a 2 , • • • , w a m composed of m words, the goal of this model is to predict the sentiment polarity of sentence W c on aspect W a .Figure 1 shows the network architecture of our proposed attention-enhanced graph convolutional network (AEGCN) model.We use attention coding layer to capture semantic information (it contains a multiheaded self-attention and a point-wise convolution transformation), use the syntactic dependency tree and attention-enhanced graph convolutional network to capture syntactic information, and use the multi-head interact attention mechanism to interact the two parts of information.Finally, we use the output of multi-head interactive attention to do a pooling splicing operation as the feature vector for sentiment prediction.Next, we will introduce the various components of the AEGCN model.

Input Layer
We use two methods to obtain embedding vector and contextualized representation.The first method is pre-trained GloVe static embedding and BiLSTM.GloVe is a popular embedding method, and we use it to embed each word token into a low-dimensional real-valued vector space.Through the pre-trained GloVe embedding matrix ℝ ×| | , each word is matched to the corresponding embedding vector ∈ ℝ × , where represents the embedding dimension of the word vector, and |V| represents the size of the vocabulary.Then, we send the word embedding matrix to the BiLSTM to obtain the hidden state output of the input layer.BiLSTM is an extension of RNN.It solves the gradients vanishing or exploding problem in standard RNN, makes the hidden state output of any time step contain previous and subsequent timing information.
The second method is pre-trained BERT.It is worth noting that we are using the Whole Word Masking variant of BERT-Large.The reason is that the tokenizer used by BERT is Wordpiece Tokenizer, which cuts certain words into a collection of several subwords, and then randomly selects subwords to mask for prediction training.In order to match our dependency tree, we need to add these subwords first.Therefore, compared

Input Layer
We use two methods to obtain embedding vector and contextualized representation.The first method is pre-trained GloVe static embedding and BiLSTM.GloVe is a popular embedding method, and we use it to embed each word token into a low-dimensional real-valued vector space.Through the pre-trained GloVe embedding matrix L R d m ×|V| , each word is matched to the corresponding embedding vector e i ∈ R d m ×1 , where d m represents the embedding dimension of the word vector, and |V| represents the size of the vocabulary.Then, we send the word embedding matrix to the BiLSTM to obtain the hidden state output of the input layer.BiLSTM is an extension of RNN.It solves the gradients vanishing or exploding problem in standard RNN, makes the hidden state output of any time step contain previous and subsequent timing information.
The second method is pre-trained BERT.It is worth noting that we are using the Whole Word Masking variant of BERT-Large.The reason is that the tokenizer used by BERT is Wordpiece Tokenizer, which cuts certain words into a collection of several subwords, and then randomly selects subwords to mask for prediction training.In order to match our dependency tree, we need to add these subwords first.Therefore, compared with the word vector obtained by the random sub-word masking method, the word vector obtained by the whole word masking method is more consistent with our model.
We use the output of BERT or BiLSTM as the contextual representation of the input text.

Attention Coding Layer
As shown in Figure 2, Attention Coding Layer (ACL) includes a Multi-Head Attention (MHA) and a Point-wise Convolution Transformation (PCT).We use MHA to capture the semantic information of the sentence, obtain the hidden layer state based on the contextual semantic information, and further transform the semantic information extracted by MHA through PCT.
Appl.Sci.2021, 11, 3640 5 of 15 with the word vector obtained by the random sub-word masking method, the word vector obtained by the whole word masking method is more consistent with our model.We use the output = {ℎ , ℎ , ⋯ , ℎ } ∈ ℝ × of BERT or BiLSTM as the contextual representation of the input text.

Attention Coding Layer
As shown in Figure 2, Attention Coding Layer (ACL) includes a Multi-Head Attention (MHA) and a Point-wise Convolution Transformation (PCT).We use MHA to capture the semantic information of the sentence, obtain the hidden layer state based on the contextual semantic information, and further transform the semantic information extracted by MHA through PCT.

Multi-Head Self-Attention
Multi-head Attention (MHA) uses multiple heads to capture the semantic information of the context in parallel, each attention head focuses on different aspects, and finally, the information of each attention head is combined to obtain the semantic representation of the input sentence.According to whether the two inputs of MHA are the same or not, we divide it into multi-head self-attention (MHSA) and multi-head interactive attention (MHIA).In this layer, we use MHSA to capture contextual semantic information.Formally, given two identical inputs = {ℎ , ℎ , ⋯ , ℎ }, MHSA is defined as: Among them, ℎ represents the number of attention heads in multi-head attention, ⨁ represents vector connection, ∈ ℝ × is a parameter matrix, ℎ represents the output of the i-th attention head, is the dimension size of ℎ .

Multi-Head Self-Attention
Multi-head Attention (MHA) uses multiple heads to capture the semantic information of the context in parallel, each attention head focuses on different aspects, and finally, the information of each attention head is combined to obtain the semantic representation of the input sentence.According to whether the two inputs of MHA are the same or not, we divide it into multi-head self-attention (MHSA) and multi-head interactive attention (MHIA).In this layer, we use MHSA to capture contextual semantic information.Formally, given two identical inputs Among them, h represents the number of attention heads in multi-head attention, ⊕ represents vector connection, W O ∈ R d hid ×d hid is a parameter matrix, head i represents the output of the i-th attention head, d k is the dimension size of h c i .

Point-Wise Convolution Transformation
We perform two convolution operations on the output of MHSA, and the size of the convolution kernel is 1.The activation function of the first convolution operation is ReLU, and the second convolution operation uses the linear activation function.Formally, given the input sequence h, the PCT is defined as: Where * represents the convolution operation.W 1 c ∈ R d hid ×d hid and W 2 c ∈ R d hid ×d hid , respectively, represent the weight of the two convolution kernels.b 1 c and b 2 c are the bias of the two convolution kernels.We denote the output of PCT as

AEGCN Layer
In order to use the syntactic information of the sentence when predicting the sentiment polarity of the sentence, we constructed the L-layer AEGCN to capture the syntactic information.First, we use spaCy toolkit to construct a dependency tree for each sentence, and then we use these dependency trees to get the corresponding adjacency matrix A ∈ R n×n .Among them, n represents the length of the sentence.Each element in the adjacency matrix represents whether the two word-nodes at the corresponding position are adjacent in the sentence structure.If they are adjacent, the value is 1, and if they are not adjacent, the value is set to 0. For example, element A ij in the i-th row and j-th column of the matrix represents whether the i-th word and the j-th word in the sentence are adjacent in the dependency tree.If it is 1, it means adjacent, and if it is 0, it means not adjacent.In particular, the diagonal elements of the adjacency matrix are all 1, that is, each word is adjacent to itself.After getting the adjacency matrix A, we can use it to capture the syntactic information of the sentence.Figure 3 shows an example of an AEGCN layer.We denote the output of each layer in AEGCN as If all adjacent nodes of node i are represented as N i , the output of the i-th node in l-th layer AEGCN can be expressed as: Among them, the weight W l and the bias b l are parameters that need to be learned, A ij represents the adjacency coefficient, and e l ij represents the normalized attention coefficient of node i and j in the l-tph AEGCN.The output of the last layer of AEGCN is

Interaction Layer
In order to realize the interaction between syntactic information and semantic information, we added an MHIA after ACL and AEGCN, respectively.Expressing the output of ACL and AEGCN as H A and H L , respectively, the output H AI and H LI of the interactive layer are calculated as follows: where H Aa and H La denote the aspects in H A and H L , respectively.
represents whether the i-th word and the j-th word in the sentence are adjacent in the dependency tree.If it is 1, it means adjacent, and if it is 0, it means not adjacent.In particular, the diagonal elements of the adjacency matrix are all 1, that is, each word is adjacent to itself.After getting the adjacency matrix A, we can use it to capture the syntactic information of the sentence.Figure 3 shows an example of an AEGCN layer.We denote the output of each layer in AEGCN as = {ℎ , ℎ , ⋯ , ℎ }, ∈ [1, ].If all adjacent nodes of node i are represented as , the output of the i-th node in l-th layer AEGCN can be expressed as:

Output Layer
As shown in Figure 1, we first perform the average pooling operation on the two outputs of the interactive layer, and then connect these average pooling outputs as the final feature representation.The final feature representation h o is computed as follow: Finally, we send the feature representation h o into the fully connected softmax layer to obtain the probability distribution p of the sentiment polarity.
where W p and b p are the learnable parameters, and d p represents the number of categories of sentiment polarity.

Training
The model is trained by standard gradient descent algorithm, and the objective function is defined as minimizing the cross-entropy loss with L 2 regularization:

Experiments
In this chapter, we first introduced five datasets and experimental parameter settings.Secondly, we compared our proposed model with other popular models and analyzed the comparison results.Finally, we conducted experimental analysis on our proposed model from multiple perspectives.

Datasets and Experimental Settings
In order to make a comprehensive comparison with other baseline models and the most advanced models, we conducted experiments on five datasets.Among them, Twitter is composed of Twitter posts collected by Dong et al. [2], the other four (Lap14, Rest14, Rest15, Rest16) are, respectively, from SemEval 2014 task 4 [35], SemEval 2015 task 12 [36], SemEval 2016 Task 5 [37].Among them, SemEval 2014 task 4 contains two datasets, Lap14 and Rest14.The detailed statistical results of each dataset are shown in Table 1.In the experiment, we used two different input methods.In AEGCN-GloVe, we use a 300-dimensional pre-trained GloVe vector as a static embedding, and the vector dimension of the hidden state is also set to 300.In AEGCN-BERT, we use pre-trained BERT as the embedding layer and fine-tune it on our task.Both the embedding dimension and the hidden state dimension are 768.All weight parameters in the model (except BERT) are initialized with uniform distribution.We use Adam optimizer [38] in our model, using different learning rates for Glove static embedding and BERT embedding: 1 × 10 −3 and 3 × 10 −5 .The coefficient of the L 2 regularization term is 1 × 10 −5 .The dropout rate and batch size are 0.1 and 64, respectively.In addition, according to the optimal experimental results of the model, the number of AEGCN layers is set to 2. We use Accuracy and Macro-F1 indicators as the criteria for evaluating model performance.The experimental results are the average of three random initialization runs.

Model Comparisons
In order to comprehensively evaluate and analyze the performance of our proposed models, we compared them with a series of baselines and state-of-the-art models.According to their method types, we divide these models into attention-based and syntactic-based models.

ATAE-LSTM:
They proposed to use LSTM and attention mechanism to obtain a vector representation for sentiment prediction, and add aspect embedding to each context word embedding.
MemNet: They proposed to use external memory to model the context representation, using a multi-hop attention architecture.
IAN [39]: An interactive modeling model of aspects and context is designed, using BiRNN and attention mechanism to interactively learn aspects and context representation.
AOA: They proposed an attention-over-attention neural network to model aspects and sentences in a joint way and explicitly capture the interaction between aspects and context sentences.
T-MGAN: They proposed a transformer-based, multi-granularity attention network (T-MGAN), which uses a tree transformer module to obtain a phrase-level representation, using a dual-pool operation and a multi-granularity attention network to extract highquality feature representations.

IMAN:
They made improvements based on the AEN model, adding a multi-head interactive attention mechanism to the last layer to interact with context information and aspect information to obtain the final feature representation.
AEN-Glove: They proposed an attention encoder network to model the relationship between context and specific aspects, and the embedding layer uses Glove static embedding.
AEN-BERT: Different from the AEN-Glove model, a pre-trained BERT-base model is used in the embedding layer.[40]: They proposed a method that can better capture the semantic meaning of aspects, proposing an attention model to integrate syntactic information into the attention mechanism.

LSTM + SynATT
CDT: They propose to use BiLSTM to obtain the feature representation of the sentence, and to further enhance the embedding by directly performing convolution operations on the dependency tree.
ASGCN: It is proposed to learn feature representations of specific aspects through GCN and dependency trees to solve the long-distance multi-word dependency problem.
BiGCN [41]: They built a concept hierarchy on both the syntactic and lexical graphs for differentiating various types of dependency relations or lexical word pairs, designing a bi-level interactive graph convolution network to fully exploit these two graphs.

Results and Analysis
Table 2 shows the experimental results of our proposed model and other comparison models.According to the data in the table, we can get the following conclusions.The performance of our proposed model is stronger than all comparison models on most datasets, and the improvement is particularly obvious when using pre-trained BERT as embedding (using pre-trained BERT as embedding has a huge improvement in the performance of the model).The experimental results prove the effectiveness of our model.Models based on graph convolutional networks and dependency trees are significantly better than models based on attention in capturing long-distance dependent information.This result reflects the superiority of graph convolutional network in ABSC task.information and syntactic information through the multi-head attention mechanism, and alleviate the impact of the limitations of the dependency tree.
Compared with the AEN model, our model has a significant improvement effect.The AEN model models the context and aspect words separately, extracts semantic features through a multi-head attention mechanism, and interacts context information and aspect information to obtain feature representations.The effect of the model highly depends on whether the multi-head attention mechanism can accurately establish the connection between the aspect words and the context.However, due to the defects of the attention mechanism in capturing long-distance dependent information and the complexity of the sentence structure itself, simply using the attention mechanism cannot accurately model the relationship between context and aspect words.
The IMAN model is an improvement based on the AEN model.They added a multihead interactive attention mechanism to the last layer, interacting context information and aspect information, and obtained good results.Our model has achieved better results on all datasets except Lap14, which shows that using sentence structure information as a supplement to the model to determine the polarity of emotions can further improve the performance of the model.However, our model achieved sub-optimal results on the Lap14 dataset, and we suspect that this dataset may be insensitive to syntactic information.
The T-MGAN model also uses sentence structure information.They used the tree transformer module to capture phrase grammatical information and obtained phrase-level feature representation.However, they only considered the phrase information in the sentence, not the global structure information.The experimental results of our model are better than the T-MGAN model, which shows that the global structure information of sentences is positively helpful for aspect-based sentiment analysis.
Compared with the GCN-based models, our model has a certain improvement on all datasets.Through our analysis, the GCN-based models have a significant improvement in the construction of long-distance multi-word dependence compared with the traditional attention-based neural network model.However, the prerequisite is that the dependency tree must be complete and effective.When the sentence is too complex, the dependency tree cannot accurately establish the relationship between the aspects and the opinion words, which will lead to the degradation of model performance.Secondly, in the process of using dependency tree and graph convolutional network to introduce syntactic information, noise information will also be introduced into it.This problem becomes more obvious when the graph convolutional network layer is deeper.Our model combines the multi-head attention mechanism with the graph convolutional network, adds semantic information on the basis of syntactic information, and interacts with the two parts of information to obtain a more complete feature representation, thereby enhancing the accuracy of the model.

Ablation Study
In order to further study the influence of each component of AEGCN on performance improvement, we designed several ablation experiments.The experimental results are shown in Table 3 (using Accuracy as the evaluation indicator).
First, we removed the attention mechanism (w/o att) in the graph convolutional network, and the experimental results dropped slightly, indicating that using the attention mechanism to assign weights to the syntactic-related items of each node can improve the performance of the model.We removed the MHIA behind ACL and AEGCN, respectively (w/o MHIA 1 and w/o MHIA 2 ).It can be seen from the experimental results that these two components play a positive role in the process of learning semantic information.Compared with MHIA 1 , MHIA 2 has a greater impact on the model.We believe that the reason is that the graph convolutional network will also capture noise information in the process of capturing syntactic information, which leads to a decrease in the accuracy of the model when judging sentiment polarity.4. The attention visualization column in the table shows the attention scores of each model, and we use darker to lighter colors according to their scores.In the first example, "delicious food but terrible environment", because the sentence contains two entities "food" and "environment" and two opinion words "delicious" and "terrible".The AEN model based on the attention mechanism cannot capture the connection between the two very well, leading to incorrect prediction results.Due to the complexity of the sentence, in the second example, the AEN model is still unable to correctly model the connection between aspect words and emotional words, and the attention mechanism focuses on the wrong point.In the ASGCN model, since "lovely" is also close to "son" in the sentence structure, it is also updated to the sentence representation when the node information is updated, which causes the model to also assign a high weight to "lovely" when calculating the attention score.When the dependency tree contains noisy information or syntactic information is not obvious, the ASGCN model cannot correctly model the relationship between the aspect words and the opinion words, which leads to prediction errors.Our model correctly predicted the sentiment polarity of the two samples, which means that our model can make better use of the semantic and syntactic information of the sentence when processing complex sentences, thereby improving the model's performance to a certain extent, and can maintain good stability on different datasets.

Impact of the AEGCN Layers
In order to verify the influence of the number of AEGCN layers on the model, we used different layers of AEGCN on the Lap14 dataset to compare its effects.Accuracy is still used as the evaluation indicator, and the experimental results are shown in Figure 4.
the sentence when processing complex sentences, thereby improving the model's performance to a certain extent, and can maintain good stability on different datasets.

Impact of the AEGCN Layers
In order to verify the influence of the number of AEGCN layers on the model, we used different layers of AEGCN on the Lap14 dataset to compare its effects.Accuracy is still used as the evaluation indicator, and the experimental results are shown in Figure 4.It can be seen from the above figure that when the number of AEGCN layers exceeds two, the model performance begins to decline.Due to the limitations of the dependency tree itself, when the number of AEGCN layers is too large, a lot of noise is also updated to the representation of the last layer, which has a negative impact on the performance of the model.

Conclusions
Recently, neural network models based on dependency trees have attracted widespread attention in ABSC.However, due to the imperfect parsing performance of the dependency tree, the noise information will be updated to the sentence representation during the process of introducing syntactic information.Therefore, we propose an AEGCN model with multi-head attention.This model uses a multi-head self-attention mechanism to obtain input semantic information, using attention-enhanced graph convolutional network and dependency tree to obtain input syntactic information.The multi-head interactive attention mechanism integrates semantic and syntactic information to obtain the final feature vector to predict the sentiment polarity.The experimental results on five datasets show that the interactive integration of sentence syntactic information and semantic information can indeed effectively improve the performance of the model.It can be seen from the above figure that when the number of AEGCN layers exceeds two, the model performance begins to decline.Due to the limitations of the dependency tree itself, when the number of AEGCN layers is too large, a lot of noise is also updated to the representation of the last layer, which has a negative impact on the performance of the model.

Conclusions
Recently, neural network models based on dependency trees have attracted widespread attention in ABSC.However, due to the imperfect parsing performance of the dependency tree, the noise information will be updated to the sentence representation during the process of introducing syntactic information.Therefore, we propose an AEGCN model with multi-head attention.This model uses a multi-head self-attention mechanism to obtain input semantic information, using attention-enhanced graph convolutional network and dependency tree to obtain input syntactic information.The multi-head interactive attention mechanism integrates semantic and syntactic information to obtain the final feature vector to predict the sentiment polarity.The experimental results on five datasets show that the interactive integration of sentence syntactic information and semantic information can indeed effectively improve the performance of the model.This paper aims to strengthen the interaction between syntactic information and semantic information, make rational use of the advantages of graph convolutional network and attention mechanism, and use the semantic information and syntactic information of sentences for aspect-based sentiment prediction.However, we only interacted semantic information and syntactic information in the last layer of the model.Future research can consider building a multi-layer network architecture, interacting semantic information and syntactic information at each layer.

Figure 1 .
Figure 1.Overview of the proposed model for aspect-based sentiment classification.

Figure 1 .
Figure 1.Overview of the proposed model for aspect-based sentiment classification.

Figure 2 .
Figure 2. The structure of the Attention Coding Layer.

Figure 2 .
Figure 2. The structure of the Attention Coding Layer.

Table 1 .
Detailed statistical of five datasets in our experiments.

Table 2 .
Model comparison results of accuracy and macro-F1(%) on five datasets.The best results of different categories in each dataset are shown in bold.The best results in all models are bolded and underlined."-" means not reported.

Table 3 .
Ablation study results of accuracy (%) on five datasets.In order to better understand our model, we compared it with the AEN and ASGCN models in several test examples.The experimental results are shown in Table