Transformer-Based Graph Convolutional Network for Sentiment Analysis

: Sentiment Analysis is an essential research topic in the ﬁeld of natural language processing (NLP) and has attracted the attention of many researchers in the last few years. Recently, deep neural network (DNN) models have been used for sentiment analysis tasks, achieving promising results. Although these models can analyze sequences of arbitrary length, utilizing them in the feature extraction layer of a DNN increases the dimensionality of the feature space. More recently, graph neural networks (GNNs) have achieved a promising performance in different NLP tasks. However, previous models cannot be transferred to a large corpus and neglect the heterogeneity of textual graphs. To overcome these difﬁculties, we propose a new Transformer-based graph convolutional network for heterogeneous graphs called Sentiment Transformer Graph Convolutional Network (ST-GCN). To the best of our knowledge, this is the ﬁrst study to model the sentiment corpus as a heterogeneous graph and learn document and word embeddings using the proposed sentiment graph transformer neural network. In addition, our model offers an easy mechanism to fuse node positional information for graph datasets using Laplacian eigenvectors. Extensive experiments on four standard datasets show that our model outperforms the existing state-of-the-art models.


Introduction
With the rapid growth of textual content on the Internet such as social networks and e-commerce websites, the need for contextual processing and mining of the subjective information that text holds is increasing [1]. Sentiment analysis, also called opinion mining, is an automatic technology to extract, process, judge, and summarize opinions, attitudes, and emotions from opinionative data. Nowadays, text sentiment analysis has become essential for many fields such as movie recommendation, e-commerce, and public opinion analysis [2]. For example, sentiment analysis aims to obtain the sentiment tendency of the person's opinions towards products, hot events, or any specific topic, which helps human decision-making [3]. Generally, researchers have explored three types of sentiment analysis approaches dictionary-based sentiment methods, machine learning-based sentiment methods, and deep learning-based sentiment methods.
Sentiment dictionary-based methods utilize dictionaries to determine the sentiment words in the given text and obtain the sentiment values. Then, using the sentiment calculation rules, the sentiment tendency is calculated [4,5]. The implementation of this approach is easy and does not require labeling samples. However, the quality of sentiment analysis depends on the sentiment dictionaries, which are insufficient to cover the sentiment words and lack the domain words, leading to the low quality of the sentiment analysis.
Later, to address the problem of dictionary dependency, machine learning approaches were proposed; such approaches utilize support vector machine SVM algorithm, naive Bayesian algorithm, graph-based semi-supervised classification algorithms to analyze the text sentiment [6][7][8]. Despite the improvement of sentiment analysis that machine learning made, it strongly relies on corpus quality labeled with polarity.
In recent years, deep learning models have attracted the attention of many researchers to address the problem of feature extraction. They propose various deep learning-based methods for sentiment analysis, which achieved promising results compared to machine learning methods in sentiment association and sentiment classification [3,[9][10][11]. However, deep learning models face the difficulty of extracting more comprehensive sentimental and emotional features since a large amount of emotional information is not utilized. As a result, more researchers try to integrate emotional information [12] and language knowledge [13] into the models [14][15][16]. Despite the great success of these models, they face the problem of extracting more comprehensive text emotional features since such models heavily rely on emotional resources and text information.
More recently, graph neural networks [17], or graph representation learning is a new research field that has received much attention form researchers. The entire corpus is represented as a graph in graph-based methods [18]. In graph embeddings, graph convolutional networks have proven to be effective at tasks involving knowledge representation and can retain the global structure information of a graph. However, most of the existing GNNs are built to learn node representations on fixed and homogeneous graphs. When learning representations on a misspecified graph or a heterogeneous graph with multiple types of nodes and edges, the restrictions become increasingly severe. In this work, we present a novel text graph transformer networks to address the GNNs issues. The text graph transformer network contains a new graph structure that can determine the useful connections between not directly connected nodes and learn the soft selection of edge types and complex relations.
To summarize, our contributions are as follows: 1.
We propose a novel Sentiment Transformer Graph Convolutional Network (ST-GCN) that learns a new graph structure on a heterogeneous graph, including determining the useful connections between nodes that are not directly connected, and learning the soft selection of edge types and complex relations for learning node representation for sentiment classification. To the best of our knowledge, this is the first study to model the sentiment corpus as a heterogeneous graph and learn document and word embeddings using the proposed text graph transformer network; 2.
Inspired by the widespread use of positional encoding in NLP transformer models and current research on node positional features in GNNs, our model offers an easy mechanism to fuse node positional information for graph datasets using Laplacian eigenvectors; 3.
Results on several sentiment benchmark datasets demonstrate that our model outperforms the state-of-the-art sentiment classification methods.

Sentiment Analysis
The origin of sentiment analysis refers to the sciences of psychology, sociology and anthropology which focus on human emotions [19][20][21]. Scholars have conducted extensive related research because of its usefulness in online review monitoring and business competitive intelligence. To date, several methods have been used for such analysis. They can be classified into two broad groups: the traditional methods based on feature engineering, which essentially use dictionaries and machine learning approaches, and modern methods based on deep learning methods.
Early models performed sentiment analysis based on a set of rules, relying on sets of emotion dictionaries, and a large amount of labeled data was required for feature engineering. Liu et al. [22] defined emotion as a tuple of (holder, target, polarity, time) where holder represents the opinion's author, target refers to the related subject, polarity is the category of the expressed emotion, and time means the time of the evaluation. Another method by [23] classifies sentiments by combining the individual word-level sentiment. Ref. [24] introduced a generative model that jointly models emotion words, subject words and emotion polarity in a sentence as a triple. The main drawback of this method is the resulting high dimensional feature space. For addressing this problem, many works have used feature selection techniques [25,26] applying various machine learning approaches. Of the various machine learning classification methods used to classify users' sentiments from a text, decision tree, LDA, Naive Bayes, Support Vector Machine (SVM), and artificial neural networks are the most common and have achieved a higher performance [9,22,27,28]. However, these methods need massive training data and are often slow. To approach these problems, unsupervised lexicon-based methods were proposed, making use of both supervised and lexicon-based approaches [29,30]. Following this idea, many other methods [31,32] have been introduced.
In recent years, many researchers have applied deep neural networks for classifying sentiment. Unlike traditional machine learning methods, they can automatically complete the feature generation step and learn more extensive representation. Ref. [33] used a convolutional neural network (CNN) based model and connected a max-pooling layer after each convolution to extract features from the text. The emotion polarity is determined after inputting the fully connected layer. Ref. [34] adopted a dynamic max-pooling to capture fine-gained features. The authors learn the embedding of text regions by applying CNNs to high-dimensional text data. Later, Ref. [35] used the CNN model based on letter-level features, combining six convolutional layers and three fully connected layers for largescale text classification datasets. Although CNN models are faster than RNNs because of parallelization, they can only extract the local features in the filter region. A memory unit is introduced with recurrent neural networks (RNNs) to make the network have memory ability. Hence, RNN can consider the long-distance dependency within texts. However, original RNNs suffer from gradient dispersion and gradient disappearance, which affect the learning process [3]. To solve this problem, the long short term memory (LSTM) model has been used [36]. LSTMs use a gate mechanism which can keep the connection within instances and capture the relationship between words. Recently, attention-based sentiment analysis models have been used and outperform previous methods. Yang et al. [37] propose an attention-based model that mirrors the hierarchical structure of documents before applying two attention mechanism layers at the sentence and word level.
More recently, graph neural networks (GNNs) have become a powerful approach for industries and academies. GNNs have been widely used in NLP tasks [38][39][40]. Ref. [18] proposed Text-GCN, which uses a heterogeneous graph where nodes are documents and words appear in documents. An edge between two words means the words appear in the same text and an edge between a text and a word means the word appears in the text. Edge weight is calculated using TF-IDF for words-text edge and positive point-wise mutual information (PPMI) for a word-word edge. Next, the data graph representation is learned using a convolutional graph network. The task, which can be seen as node classification, suffers from memory problems because they have to build a single graph for a whole dataset. Moreover, the graph is built ignoring the order information of words. To overcome the former drawback, Huang et al. [41] proposed another GNN-based method for text classification using a text-level graph for each input text. Thus, they perform graph classification instead of node classification. However, they ignore the rich word positional information, which is critical in sentiment analysis. To address the problems above, we propose a transformer-based Graph Convolutional Network, following up on [18] and adding word positional information encoding to word features, and propose a new batching mechanism to alleviate the memory problem.

Transformer Convolutional Networks
NLP problems, such as language modeling and machine translation, have been solved by recurrent neural networks (RNNs). RNN factor computation along with the positions of elements in the input and output sequences to keep the order of the sentence in place. This intrinsically sequential nature prevents parallel computation inside the training set and is non-trivial for extended length sequences computation because the memory constraints limit batch processing between samples. To overcome this limitation, Refs. [42,43] proposed factorization tricks, and conditional computation, respectively, notably increase the computational efficiency. However, they still make use of sequential computation. To mitigate the effect of the sequential computation, many researchers have used attention mechanisms [44,45] as they allow the modeling of dependencies regardless of their distance in the input or output sequence. Attention mechanisms break the memory constraint problem and have become an indispensable part of sequence modeling, but such attention was used in conjunction with RNNs. Ref. [46] proposed a transformer model architecture, which avoids recurrence and alternatively relies entirely on an attention mechanism to describe global dependencies between input and output. Unlike RNNs, transformers do not necessarily process data in order. Instead, the attention mechanism provides context for any position in the input sequence, which can be passed in parallel. This feature allows greater parallelization than RNNs and therefore reduces training times. Thus, only attention mechanisms without any RNN can match the performance of RNNs with attention. In this work, we propose a sentiment transformer graph convolutional network to predict sentiment.

Method
In this section, we describe the framework of the proposed model as shown in Figure 1. First, we describe the data preprocessing step. Next, we introduce textual graph building. We introduce the word embedding representation. Then, we introduce the transformer convolutional networks. Finally, we present the text graph transformer convolutional network.

Data Preprocessing
In this section, we describe the data preprocessing step. First, we remove the irrelevant data from reviews. For example, punctuation, URLs, mentions, numbers, and non-English words have been removed from the reviews using the regular expression library in Python. Secondly, we define our stop words list, which contains words that do not hold emotional and systematical feelings, such as the articles and determiners, because the commonly used stop word lists (e.g., NLTK stop words (https://www.nltk.org/nltk_data/ accessed on 14 October 2021)) contain words that have a sentiment role. Then, we remove the defined stop words from the long review datasets. We use the white space to tokenize text into words.
All upper words are changed into lowercase. The output tokenized words will be used to build the text graph.

Textual Graph Building
In this section, we construct the text graph from the corpus. Let G = (N, E), be a graph where N is the node-set and E is the edge set. We represent the textual graph as follows:

Node Assignment
Each review and a unique keyword are represented as nodes in the text graph. The number of nodes in the textual graph is the number of reviews D plus the number of the unique keywords in V in the entire corpus.

Edging
Two types of edges are built between nodes. Term frequency-inverse document frequency (TF-IDF) is used to build the edges between a review node r i and keyword nodes r j , and point-wise mutual information (PMI) is used to build the edges between two keyword node pairs within a fixed window. We build an adjacency matrix that represents the edge weights. Those weights determine the relationship strength between two nodes.
We build the adjacency matrix A (the edge weights) as follows: The PMI for keyword pair is calculated as follows: Given a sliding window #W for the entire review corpus, the sliding windows in which keyword i and j appear together #W(i, j), and the sliding window in which the keyword i occur #W(i) the p(i, j) and p(i) is calculated as:

Embedding (Word Representation)
In most natural language processing applications, words are used as features. The most popular word vector representations are distributed representation and one-hot representation [27,47]. However, the one-hot representation has various problems, such as the too-large vector dimension, the sparsity of the word vector, and ignoring the word semantic association. Although the distributed representation has addressed the problem of one-hot representation, the need to improve the accuracy of the word vector and the training speed is still crucial [48]. Recently, different word vectors have been applied to sentiment analysis [49][50][51]. However, the current word representation used in sentiment analysis does not take into account the sentiment information contained in words. In our work, we address the above problems by using a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model [52], which makes use of the transformers to learn the contextual information from the corpora, to obtain the review node embedding in the textual graph. To the best of our knowledge, this is the first study that utilizes the Bert model for document node embeddings in sentiment analysis tasks.

Graph Transformer Convolutional Networks
The Figure 2 shows the architecture of the proposed model. The model consists of a stack of functions of operator including positional encoding, feature transform, sampling, message computing, multi-head, and aggregation.

The Positional Encoding
The positional encoding consists of encoding positional information for each word of a sentence, which is difficult to apply to a graph because the presence of symmetries in the graph makes it non-trivial to get the canonical position of nodes. Meanwhile, words in the text need disambiguation, that is, words with the same spelling, but different meanings need to be differentiated. Ideally, each node of a graph should have unique PE, and nodes that are close in the graph should have similar PE whereas nodes that are far from each other should have different positional encoding. Node position embedding has been explored in recent GNN works [53][54][55][56] to learn both positional and structural features of nodes in graphs. We leverage the success of the recent works on positional information in GNNs [54,56] and use pre-computed Laplacian eigenvectors as Positional Encodings, which allow us to differentiate isomorphic nodes. Eigenvectors are defined using the factorization of the graph Laplacian matrix: where A is n × n adjacency matrix, D is the degree matrix, and ν,Λ are the eigenvectors and eigenvalues, respectively. We use pre-computed Laplacian eigenvectors to add into the feature of the nodes, which are used as input for the first layer.

Feature Transform Operator
We input the node and edge features described above into the graph transformer. The input node and edge feature are d−dimensional hidden feature h 0 i and e 0 i , respectively. Then we embed the pre-computed node PE of dimension k using a linear projection. It should be noticed that we add the Laplacian positional encoding only to the node feature for the first layer uniquely. Basically, for a graph G with node feature X u ∈ R 1×d for each node v i and edge feature X e ij ∈ R 1×d e for each node between nodes v i et v j where d and d e denote the node feature size and edge feature size respectively, the input node features x u and edge features e ij are passed via a linear projection to embed these to d dimensional hidden features h 0 i and e 0 ij .ĥ where A 0 ∈ R d×d n , B 0 ∈ R d×d e and a 0 , b 0 ∈ R d are parameters of the linear projection layer. λ 0 i represents the pre-computed node positional encoding of dimension k.

Message Computation Operator
Based on attention mechanisms, the message computation operator makes it possible to focus on the most relevant neighboring nodes to improve information aggregation. Our message computation operator aims to learn an importance weight w ij for each edge relationship e ij between the two corresponding nodes v i and v j . We better utilize edge attributes information by designing an attention layer with edge feature (see Figure 2). We maintain a node-symmetric edge feature representation pipeline for propagating edge features. The update equation for a layer l is defined as follows: with Q k,l , K k,l , V k,l , E k,l ∈ Rd k , O l h , O l e ∈ R d×d , k ∈ {1, 2, . . . , H} represents the number of attention head, and where O l h ∈ R d×d , V k,l ∈ R d k ×d , H denotes the number of heads, L the number of layers, d is the hidden dimension and d k is the dimension of a head d H = d k . Note that h l i is the i − th node's feature at the l − th layer.

Multi-Head Operator
For stabilizing the learning process, we follow up on [46] and perform multiple attentions independently. The multiple representation outputs by multi-head attention for each node v i are then concatenated or averaged to generate the final representation h i

Aggregation Operator
For combining features from multiple neighbors to obtain the representation h i , an aggregation function is required. We use max formulated as :

Baselines
The proposed model is compared with multiple state-of-the-art sentiment analysis models as follows: • RGWE: Unsupervised methods, in particular neural network-based approaches, exploit unstructured data to generate and retrieve hidden sentiment information by identifying the constraints of conjunctions on the positive or negative semantic orientations [57]; • Seninfo+TF-IDF: an improved word representation method, which integrated the contribution of sentiment information into the traditional TF-IDF algorithm and generated weighted word vectors [58]; • Re(Glove): a word vector refinement model to refine pre-trained word vectors using sentiment intensity scores provided by sentiment lexicons, which improved each word vector and performed better in Sentiment Analysis [59]; • CHIM: a model in which the author represents attributes as chunk-wise important weight metrics. The authors consider four locations to inject attributes (i.e., encoding, embedding, classifier, and attention) with simple BiLSTM [60]. In our comparison, we compare with the embedding location inject since it achieved the highest accuracy score; BERT-pair-TextCNN: a representation framework called Bert-pair-Networks (p-BERTs) in which BERT is used to encode sentences for sentiment classification to classify a single sentence utilizing, on the top, the auxiliary sentence and feature extraction [67].

Datasets
We select four classical public datasets to evaluate the proposed TGTCN model. The statistics of the datasets are shown in Table 1. For the datasets that have standard train/valid/test such as SemEval [68] and SST-B [36], we have conducted our experiments according to the standard split. For those datasets that do not have a standard split, we split the datasets with 7:1:2 to obtain the corresponding train/valid/test. We also made sure that the intersection of the training and test sets was not empty to avoid technical terms influencing Sentiment Analysis.

Experiments Settings
SGTN is implemented using PyTorch and is optimized with an Adam optimizer. Training and experiments are done using an NVIDIA GeForce GTX 1080 Ti graphics card. We select the optimal values of learning parameters when the model achieves the highest accuracy for the validation samples. The optimal value of the learning rate α is set to 0.0005. L2 regularization is set to 10 −6 , and the dropout rate is set to 0.3 for the best performance. For learning SGTN, the model is trained for 100 epochs with the early-stop strategy. For baseline models, we either run the codes provided by the authors using the same parameters described in the papers or the results reported in the previous work [57].

Evaluation Criteria
To evaluate the performance of the proposed SGTN model, we use the two main evaluation criteria, namely Accuracy (Acc) and F1 measure (F1). These criteria have been used extensively in text classification, and sentiment analysis tasks [69], which are computed as follows: To calculate the F1 measure, we first compute the Precision (Pr) and Recall (Re) as follows.
Then the F1 is calculated as follows: where TN, TP, FN and FP are true negative, true positive, false negative, and false positive, respectively [69].

Comparison Results
The optimal parameters that achieved the best results in our model are shown in Table 2. The proposed model is compared with 12 models on four public datasets. The main results are reported in Tables 3 and 4. From the result in Table 3, we noticed that the proposed model has achieved better classification accuracy than the baseline state-of-the-art models over all datasets. For example, the classification performance is improved by 2.63%, 0.43% over SMART RoBERTa and BERT_pair_RCNN. On SST-B, the classification accuracy rate of the proposed model reached 95.43%, on the IMDB, the accuracy rate reached 94.95%, on the Yelp dataset, the accuracy rate reached 72.7%.
We also report the F1-score of the proposed model compared with five state-of-the-art models. From the results in Table 4, it is noticed that our model outperforms the baseline model over the four datasets. For example, our model achieved 74.12% on the Semeval dataset, 95.11 on SST-B, 93.52 on IMDB, and 50.2 on the Yelp dataset. The F1-score is improved by 1.23% and 3.95% over RCNN and BiLSTM on SST-B, respectively.
For more in-depth analysis, the Bert-based models have achieved better classification results than the conventional deep learning models. We can also see that the neural network models have better results compared with the machine learning methods. Table 3. The sentiment classification accuracy of different models over datasets. The best score on each task produced by a single model is in bold and "-" denotes the missed result. Removing the less frequent words from tweets may affect the performance of sentiment analysis. We conduct an ablation study to test the impact of removing the less-frequency words. We delete the words with frequency less than five times in the entire corpus. The result from Tables 5 and 6 show that removing the less frequent words have slightly degraded the performance. For example, the sentiment accuracy performance decreases by 0.21% on the SST-B dataset and by 0.42%. We also test the influence of our predefined stop words. From the results in Tables 5 and 6, it shows that using NLTK stop words has affected the accuracy sentiment performance.  The number of iterations in the training set are called epochs. The model's generalization ability improves as the number of Epochs increases. However, if the number of epochs is too great, the over-fitting problem can easily arise, reducing the model's generalization capabilities. As a result, selecting the appropriate Epochs is critical. Figure 3   It is noticed from Figure 3 that with the increasing of the epoch, the classification performance (accuracy score) of the model is gradually increasing. It tends to be stable when epochs are 60.

Learning Rate
When it comes to optimizing weights and offsets, identifying the appropriate learning rate is critical. It is easy to overshoot the extreme point if the learning rate is too high, causing the system to become unstable. The training duration will be excessive if the learning rate is too slow. The model's classification impact at various learning rates is depicted in Figure 4.

Conclusions and Future Work
In this research, we propose a convolutional network of transformer-based graphs for sentiment analysis. We represented the problem as a node classification task and learned the representation of nodes on a heterogeneous graph through the message passing. We show that using a transformer to aggregate local substructures with appropriate position encoding is a very efficient node representation strategy, and the multi-head attention allows a simple interpretation of the model. The learned graph structure leads to a more efficient node representation, resulting in peak performance without any predefined metapath from domain knowledge. Comprehensive experiments illustrate the effectiveness of the proposed model. ST-GCN outperforms previous cutting-edge models on four real-world datasets: SemEval, SST-B, IMDB, and Yelp 2014. In addition to generalizing the ST-GCN design to inductive parameters, some interesting future directions include using Dynamic neighborhood aggregation operators to improve classification performance. As several heterogeneous graph datasets have been recently studied for other network analysis tasks, such as link prediction and graph classification, applying ST-GCN to the other tasks can be interesting future directions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: