Benchmarking Deep Learning Methods for Aspect Level Sentiment Classiﬁcation

: With the advancements in processing units and easy availability of cloud-based GPU servers, many deep learning-based methods have been proposed for Aspect Level Sentiment Classiﬁcation (ALSC) literature. With this increase in the number of deep learning methods proposed in ALSC literature, it has become difﬁcult to ascertain the performance difference of one method over the other. To this end, our study provides a statistical comparison of the performance of 35 recent deep learning methods with respect to three performance metrics-Accuracy, Macro F1 score


Introduction
The widespread use of e-commerce and social media in the 21st century has led to the generation of massive unstructured data which is publicly accessible. This unstructured data primarily consists of user reviews regarding products and services as well as opinions and emotions on social and political issues. The unstructured data may be in the form of text, images, audio, video, and emoticons. The automated analysis of this unstructured data is of enormous importance to successful business organizations and governments. This has led to the emergence of Affective Computing and Sentiment Analysis as a specific discipline within Artificial Intelligence [1].
Analysis of sentiments from the textual opinions can be carried out at various levels: document level, sentence level, and aspect level [2]. In contrast to coarse-grained and overall sentiment analysis, Aspect Based Sentiment Analysis (ABSA) analyzes the sentiments of specific attributes or aspects of a product or service. The task of ABSA is further divided into three main sub-tasks mainly, Aspect Extraction (AE), Aspect Category Detection (ACD), and Aspect Level Sentiment Classification (ALSC). In ABSA, first, an aspect of a product or service is extracted from the text, and this task is known as Aspect Extraction (AE). Further, extracted aspects are mapped to a specific category, and this task is known as Aspect Category Detection (ACD). Finally, the task of determining the sentiment polarity of an aspect is performed. This task is named as Aspect Level Sentiment Classification (ALSC).
Traditionally, ABSA is performed either with the help of statistical methods or using machine learning with efficient feature engineering techniques [2][3][4]. Feature engineering is a time-consuming task and performing ABSA using traditional methods requires large domain-specific datasets and expert knowledge [5,6]. As an alternative, Deep Learning (DL) based methods are competent to learn continuous features from data without any feature engineering. In addition, deep learning methods are also efficient in capturing the relatedness between context and target. The easy availability of GPU and cloud-based computational resources has made it feasible to train deep neural networks efficiently at a low cost. Thus, in recent years, many deep learning methods, mainly Convolution Neural Networks (CNN), Memory Networks, Recurrent Neural Networks (RNN), and their multiple variants, have been proposed for ALSC [7]. The difference between Traditional and Deep learning-based ALSC can be well understood with the help of Figure 1. With a surge in usage of deep learning methods to perform the ALSC task [8], it is imperative to ascertain whether the newly proposed methods improve over the previous models in a statistically significant way. Most of the existing deep learning methods are ignored by the researchers for comparison while proposing their new method. In addition, none of the comparative studies [7,8] perform a statistical comparison of the newly proposed model with existing deep learning-based models for ALSC. Furthermore, existing studies do not compare the newly proposed deep learning methods with existing methods in terms of training time.
The significant contribution of our study is that it provides a statistical comparison of state-of-the-art deep learning methods available in ALSC literature up to 2021 that include RNNs, Memory Networks, CNNs, the latest Hybrid Networks, and BERT-based methods. The statistical comparison is carried for three evaluation metrics that are Accuracy, Macro F1 score, and Time. To the best of our knowledge, this is the first study in which training time is considered for evaluating the performance of deep learning methods in ALSC.
The statistical comparison performed in our paper looks out for statistical evidence for the enhanced performance of recently proposed advanced deep learning methods in ALSC. Friedman test [9] and post hoc tests viz. Nemenyi test [10] and Wilcoxon test [9,11] are applied to experimental results of various deep learning methods across eight datasets of different domains.
This study addresses the following research question (RQ): Is there any statistically significant difference in the performance of various deep learning methods proposed in ALSC literature from 2016 to 2021 in terms of Accuracy, Macro F1 score, and Time?
The rest of this paper is organized as: Section 2 introduces the ABSA, its sub-tasks, and ALSC. Section 3 discusses the technical details of deep learning methods studied in this work. The information related to datasets and experimental settings, along with statistical test details, is explained in Section 4. Section 5 provides an analysis of results and answers the research question addressed in this study. Finally, the conclusion and future work are presented in Section 6.

Overview of ABSA
This section is divided into two subparts. Section 2.1 discusses in detail ABSA, its various subtasks, and Section 2.2 describes the ALSC task in detail.

Aspect Based Sentiment Analysis
The fine-grained analysis of sentiments for extracting various aspects and detecting the polarities of the extracted aspects [12] is referred to as Aspect Based Sentiment Analysis (ABSA). ABSA can be further divided into mainly three main sub-tasks [13]: (a) Aspect Extraction (AE) or Opinion Target Extraction (OTE), (b) Detecting the Aspect Category, and (c) Determining the sentiment polarity of aspect. Many studies have been carried in recent years, dealing with one, two, or all three subtasks of ABSA.
In this case, e.g., consider the sentence in Figure 2, "Awesome Thai food, price friendly but poor ambience", Thai food, price, and ambience are aspects where 'Thai food' belongs to category {food}, 'price' belongs to category {cost} and 'ambience' belongs to category {miscellaneous}. Further, the sentiment polarity of 'Thai food', 'price', and 'ambience' is positive, positive, and negative, respectively. Since this study deals with the statistical comparison of various deep learning methods proposed in ALSC literature, thus in the subsequent section, the ALSC task is explained in detail.

Aspect Level Sentiment Classification (ALSC)
The aspects discussed in the ALSC task can be either implicit or explicit aspects. An implicit aspect expresses an opinion about some feature of a product without explicitly using a target term for that feature. In contrast to this, the explicit aspect expresses an opinion about some feature of a product by explicitly mentioning the target term. Thus, ALSC is often interchangeably called Target dependent sentiment classification as well. Essentially, the terms 'aspect' and 'target' are used interchangeably by many researchers. For a better understanding of these two terms, consider the sentence in Figure 3: "The Mobile Phone is quite bulky, but the camera is great". Here we have two types of aspects of the mobile phone handset, explicit and implicit [2]. The first phrase, "The Mobile phone is bulky" talks about the implicit aspect 'weight' while the second phrase, "the camera is great", describes the explicit aspect "camera". So, the camera is an explicit aspect term as well as the target word. In short, we can use target and aspect interchangeably when we are dealing with explicit aspect terms. Finally, the task of ALSC is to map these aspects with suitable sentiment polarity. As per Figure 3, the polarity of explicit aspect "Camera" is positive while the polarity of implicit aspect "bulky" refers to the feature of mobile phone, and the polarity is negative. Most of the research in ALSC is been carried out for explicit aspects and thus the sentiment classification of implicit aspect is out of the scope of this study.

Deep Learning Methods for Aspect Level Sentiment Classification
This study aims to provide a statistical comparison of various deep learning methods proposed in ALSC literature. Section 3.1 discusses the recent work in deep learning-based ALSC. Afterward, Section 3.2 discusses the data modelling procedure for deep learningbased ALSC.

Recent Trends in Aspect Level Sentiment Classification
ALSC can be performed using three main approaches: unsupervised, semi-supervised, and supervised. The unsupervised and semi supervised techniques mainly follow corpus based and lexicon-based approaches [14]. The corpus-based approach utilizes the domain specific large corpora for generating relevant information, which requires substantial manual effort, extensive training, and big data. The lexicon-based approach (also known as the knowledge-based approach) works with external sentiment knowledge bases known as lexicons. Such techniques are entirely dependent on the quality of the knowledge base and often suffers the problem of the limited word (or out of vocabulary) [15].
The supervised ALSC can be carried using conventional or deep learning methods. The conventional (also known as machine learning-based) techniques require extensive feature engineering, while deep learning methods can work efficiently without feature engineering. Thus, recently ALSC researchers are inclined more towards deep learning -based approaches. Deep Learning is a learning paradigm that utilizes artificial neural networks for the efficient learning process and has recently gained importance in many tasks related to NLP such as text summarization [16], machine translation [17], question answering [18], etc. Deep learning methods utilized for sentiment analysis have also proven to perform better than traditional methods. The ensemble of deep learning methods with symbolic models has also been leveraged to create a sentiment lexicon. Senticnet5 [19] and Senticnet6 [20] are the lexicons generated by combining symbolic and sub-symbolic paradigms of AI. The symbolic paradigms refer to the use of logic and semantic network, and sub-symbolic refers to the usage of deep learning methods to encode the meaning of words in the lexicon.
In ALSC, the classification is reliant on the semantic structure of sentences, thus researchers have widely utilized LSTM in the field of ALSC. LSTM with the capability of handling non-linear data has proven to be successful for this task. Tang et al. [21] used LSTM for the ALSC task for the first time. The authors also proposed two variants of LSTM called TD-LSTM and TC-LSTM. Later, Wang et al. [22] leveraged the idea of combining aspect embedding with LSTM. The authors also proposed the attention-based variant of LSTM known as ATAE-LSTM.
The integration of attention mechanisms in deep learning methods was initially proposed for computer vision tasks [23]. It has also been successfully applied on many NLP tasks [24,25]. Attention-based methods have the capability of capturing the significance of the context words. This feature makes attention-based deep learning methods more promising for ALSC [22,[26][27][28].
Along with attention, memory networks have also been utilized for ALSC. Tang et al. [29] proposed a deep memory network by using a pre-trained word vector as memory. The authors leveraged the attention mechanism for updating memory. Chen et al. [30] proposed the Recurrent Attention Memory (RAM) method, which exploits attention mechanisms and uses hidden states generated by LSTM as memory.
Simple attention networks can generate noisy features as well. This problem can be resolved using Capsule networks [31,32]. The capsule network can dynamically route the spatial features from the lower layer to the upper layers. The hidden vectors generated from the lower layer are considered as one capsule, and the upper layer features are considered as another capsule.
Another limitation of a simple attention network is the incapability to handle longrange dependencies. The syntactical structure of a sentence plays a crucial role in resolving the issue of long-range dependencies. However, most deep learning-based methods have not leveraged the syntactical structure of the sentence. There are limited hybrid methods that have incorporated syntactical knowledge along with deep neural networks using Graph Convolutional networks (GCN) [33][34][35].
Another way of handling the syntactical information is with the usage of Graph Attention Network (GAT). Wang et al. [36] and Bai et al. [37] have leveraged Graph Attention networks for ALSC task. The GAT method proposed by Bai et al. [37] uses typed dependencies (i.e., dependency label or relation) for enhanced performance. The authors also utilized BERT (Bidirectional Encoder Representations from Transformers) [38] model to generate contextual embeddings. BERT is a pre-trained language model that has gained popularity in many NLP tasks, including ALSC. There are few attempts in literature exploring BERT for ALSC. The incorporation of BERT with the deep learning methods has shown promising results. Jiang et al. [31] leveraged BERT in embedding layer and encoding layer to generate contextual embeddings. Song et al. [39] utilized BERT embeddings as input to their proposed deep neural network. Yang et al. [40] used a pre-trained BERT model directly for prediction. Figure 4 summarizes the 35 deep learning methods statistically tested for significant performance in this study across multiple datasets of different domains. For a fair comparison, the methods based on domain-specific corpus or domain-specific embeddings are not dealt with in this study.

Data Modelling Procedure for Deep Learning-Based ALSC
In ALSC, it is desired to detect the polarity of aspects contained in a sentence. A sentence with aspects is transformed into machine-readable vector form as an input to a deep learning network. This input to the network varies according to the architecture of the deep learning method. Table 1 provides a brief description of various deep learning methods along with the inputs required by them.
To understand the different types of inputs, consider a sentence S followed by an Aspect A as given below: "But the wine list is excellent". Sentence S is a sequence of words {w 1 , w 2 . . . w n }. In the above example S will be denoted as {'but', 'the', 'wine', 'list', 'is', 'excellent'} An Aspect that may consist of one or more than one word, is a subsequence of a sentence and is denoted by A. In the above example, Aspect A is denoted by {"wine list"} Context C is the part of a sentence other than aspect.
For, e.g., in above sentence, C is {'but', 'is', 'the', 'excellent'} C l and C r are left and right context, respectively. For above, e.g., C l is {'but', 'is'}: and C r is:{ 'the', 'excellent'} Dependency Tree D T is a directed graph showing the relationship between different words of a sentence. The dependency tree for the above sentence is shown in Figure 5. Utilizes Long Short Term Network (LSTM) for generating hidden vector for the sentence which is fed into softmax layer for predicting the sentiment polarity of an aspect.

S LSTM No No
DL7 TD-LSTM [21] TD-LSTM (Target dependent-LSTM) model utilizes two LSTMs: LSTM L and LSTM R for collectively considering left and right context of target. Target Connection-LSTM is an extension of TD-LSTM(DL7) with the enablement of a target connection mechanism to establish the relation between the target and each context word.
Interactive Attention Network utilizes two separate attention-based LSTMs for capturing the interaction between aspect and context words using the pooling layer. BERT-SPC is a simple pre-trained BERT model designed for ALSC task.
GAT leverages dependency relation by adopting a Relation graph attention network for exchanging the information between words based on the dependency tree.  Dependency Graph D G is an undirected graph similar to a dependency tree showing the relationship between different words of a sentence.
Dependency Relation D R is the relationship between two words in a sentence based on the dependency tree. The dependency relation is also known as typed dependency and dependency label. In Figure 5, the dependency relations are the labels on the arc as: cc, det, compound, nsubj, and acomp.
Location of aspect LOC Aspect is the starting and ending index of the aspect location in the sentence.
The inputs to non-BERT based methods are converted to vectors using GloVe embeddings [45]. However, BERT embeddings are used for converting input to vector in the BERT-based method. In addition, BERT based method require [CLS] and [SEP] for starting and separating the input, respectively.

Experimental Setup and Datasets
In Section 4.1, the details of datasets are provided. Section 4.2 presents the experimental settings. Section 4.3 provides the details of evaluation metrics, and finally, in Section 4.4 the procedure of statistical significance testing is explained.

Characteristics of Datasets
In this study, the experimental evaluation is carried out on 8 benchmark datasets of different domains that are Restauarnt14, Laptop14, Restaurant15, Restaurant16, Twitter, Sentihood, Mitchell, and MAMS. All the datasets except Sentihood are of 3-way polarity. The 3-way polarity means, for these datasets, each aspect term can belong to the positive, negative, or neutral category. Table 2. shows the statistics regarding the number of positive, negative, and neutral samples in each dataset (train and test separately).
In the literature of ALSC, the majority of the proposed deep learning methods are evaluated on the datasets released by the International Workshop for Semantic Evaluation (SemEval). Restaurant 14 and Laptop14 datasets released in SemEval 2014 [46] task are the most popular datasets. In continuation of previous workshops, two more datasets of the restaurant domain are released by SemEval 2015 [47] and SemEval 2016 [13] named as Restaurant 15 and Restaurant 16. Another popular dataset in ALSC literature is Twitter [48]. It is the data derived from tweets specifically and is also known as target-dependent sentiment classification data. The other three datasets included in this study are Mitchell [49], Sentihood [50], and MAMS [31] dataset from Twitter, neighborhood, and restaurant domain, respectively. Mitchell dataset is the tweets data originally released for the English and Spanish languages. In this study, the English language sentences of the dataset are evaluated. Sentihood data is obtained from the yahoo platform. The data is related to the aspects discussed in the neighborhood of London city. This dataset is of two-way polarity.
MAMS(Multi-Aspect-Multi-Sentiment) is the latest dataset in ALSC literature. MAMS dataset is obtained from the CitySearch New York dataset [51] by manually annotating the aspect terms in the sentences with their polarity. This dataset can be called a challenging dataset because each sentence in the dataset is having multiple aspects with different polarities. Thus, handling of context-aspect relationship for any method will be more critical. Although, the other datasets in ALSC literature also have such multi-aspect sentences but the number of such sentences is quite low. In addition to this, the MAMS dataset is larger as compared to all other 7 datasets. Table 2. clearly shows that the size of the MAMS dataset is more than double the size of other datasets. Since the performance of deep learning methods can be better evaluated on datasets of large size, thus it is pertinent to check for the performance of different deep learning methods on MAMS data.
Our study incorporates all 8 datasets discussed above. To the best of our knowledge, no other previous study in the ALSC literature has considered 35 deep learning methods on 8 datasets.

Experimental Design
In this experimental study, GloVe embeddings [45] are used for non-BERT methods while for BERT-based methods, pre-trained BERT embeddings are utilized. Dimensions are kept at 300 for both embeddings and hidden state vectors. The learning rate has been kept as 0.001. L2 regularization is used along with a drop-out rate of 0.1 to avoid overfitting. The initialization of weight matrix and biases is carried by sampling from a uniform distribution U(0-0.01, 0.01). Adam optimizer is adopted for model training. The batch size has been kept as 64 with step size being 5. The selections of hyperparameters for our experimental study are based on existing research [7,33,37]. Furthermore, the architecture-specific hyperparameters for some deep learning methods are taken from original work. The number of graph convolution layers in ASGCN and ASTCN is kept 2. For CapsNet, the capsule size is 300. For RGAT, GAT, and GAT-BERT, the Deep Biaffine [52] parser is used. The average scores of Accuracy and Macro F1 on test data are reported. The implementation work is carried using the PyTorch framework.

Evaluation Metrics
In this study, Accuracy, Macro-F1 score, and Time (training time per epoch) are used as metrics for evaluating the performance of different deep learning methods. The evaluation metrics are discussed next.

Accuracy
Accuracy is a widely used and most intuitive evaluation measure used by researchers for any classification problem. In simple terms, it is just a ratio of correctly predicted observations to the total number of observations in the dataset. Mathematically, Accuracy is calculated using Equation (1).

Macro-F1 Score
The ALSC problem discussed in this study is a multi-class classification problem with three classes viz. neutral, negative, and positive. For multiclass classification settings, a Macro-F1 score computes the individual class score independently before taking the average. This ensures that all classes are treated equally. The macro-F1 score is calculated using Equation (2), where MacroPrecision and MacroRecall are calculated by taking the class-wise average of precision and recall defined in Equations (3) and (4).

Time
A significant criterion for performance that is widely ignored in previous research of deep learning-based ALSC except in the work of Xu et al. [53] is the time taken in training a model. Usually, training a deep learning-based model is time-consuming, and for this reason training time has also been taken into consideration in this benchmarking study. The training is stopped once maximum accuracy is reached. Thus, instead of using total training time, we have used training time per epoch to compare different methods Time has been calculated as: Time = Training time per epoch The training time might vary with the speed of the processor. But for a fair comparison, we have run all methods on the same processor Nvidia Tesla K80 GPU.

Statistical Tests
This study uses statistical significance testing for empirical comparison of the performance of various deep learning methods used for ALSC.
To the best of our knowledge, no previous study in ALSC has so far used statistical significance testing for comparing the performance of various deep learning methods. The statistical significance testing procedure adopted in this study is based on the seminal work of Demsar [9]. The procedure involves the Friedman test and two Post hoc tests: 1. Nemenyi Test 2. Wilcoxon Test.

Friedman Test
Parametric tests require validation of assumptions regarding data distributions while non-parametric tests are distribution-free [11]. Friedman test is a non-parametric counterpart of ANOVA (Analysis of Variance). Since the assumptions of parametric tests cannot be guaranteed on our datasets, Friedman Test is used in this study. The purpose of performing the Friedman test is to analyze that whether there are any significant differences in the performance of different deep learning methods in ALSC. Friedman test is applied for testing the following statistical hypothesis.

Hypothesis 1 (H1).
The performance of deep learning methods is not significantly different with respect to of Accuracy, Macro-F1 score and Training Time, i.e., all deep learning methods perform alike in terms of these evaluation metrics.

vs.
Hypothesis 2 (H2). At least two of the investigated deep learning methods have significant differences in their performance with respect to Accuracy, Macro-F1 score, and Training Time. (6) and (7). Let 'm' be the number of deep learning methods and 'n' be the number of datasets, then the test statistic of the Friedman test is calculated as:

Friedman test is explained with the help of Equations
where F f follows the F-distribution with (m − 1) and (m − 1)(n − 1) degrees of freedom with critical value as available in F-distribution table [54]. In case the value of F f is more than the critical value, the null hypothesis is rejected leading to the conclusion that the performance of at least two deep learning methods is significantly different.

Post Hoc Tests
A multiple test procedure is recommended while comparing more than two methods. When the null hypothesis of equivalent performances becomes rejected for multiple methods, post hoc tests are performed to find the significantly different methods. In this study, two post hoc tests are performed. They are discussed as follows:

Nemenyi Test
Nemenyi test is a post hoc test performed after Freidman test. Nemenyi test is applied for relative comparison of all classifiers evaluated in the study [9]. The performance differences of various classifiers are checked against the value of Critical distance obtained using Equation (8) where m is the number of classifiers, n is the number of datasets and the value of q α is based on studentized range statistics of the Nemenyi test. The Nemenyi test can be well understood with the help of critical distance diagrams presented in Section 5.2.

Wilcoxon Test
It is also recommended to perform the pairwise comparison of classifiers based on values of evaluation metrics obtained from experiments [55]. Wilcoxon test is a nonparametric test useful for this purpose. The null hypothesis HW0 of the Wilcoxon test is that the median difference between pairs of experimental methods is zero. The term significance level α in hypothesis testing is referred to as the probability of rejecting a true null hypothesis. α is generally set to 0.05 in empirical studies [11]. The observed significance level is called as p-value. The null hypothesis can be rejected if the p-value is less than or equal to α, leading to the conclusion that a given pair of deep learning methods is significantly different in performance.

Experimental Results and Analysis
This section presents the experimental results and statistical analysis of results. Section 5.1 presents the experimental results along with the discussion of results. Section 5.2 presents the statistical comparison and answers the RQ (Research question) posed in this study.

Discussion of Results
Tables 3-5 report the scores obtained by different deep learning methods with respect to Accuracy, Macro-F1 score, and Time. The top 10 best performing deep learning model for each dataset has been highlighted in boldface.
Some observations from Tables 3 and 4 regarding Accuracy and Macro F1 score of deep learning models in ALSC are as follows: The best performing method across all datasets is GAT-BERT with average Accuracy and Macro F1 score of 0.8478 and 0.7334, respectively. The worst performer differs for every dataset while GRU is performing worst as per the average Accuracy and LSTM is the worst performer for Macro F1 score. The average Accuracy score range varies from 0.63 to 0.84 while the average range of Macro F1 score is from 0.  Though we have top performers based on the average Accuracy and Macro F1 score, still there are only small numerical differences in the scores of the top 10 deep learning methods. Thus, it is imperative to carry out statistical significance tests for empirical comparison of multiple deep learning models across various datasets on a scientific basis. Table 5 reports the time taken by different models.
It is visible from Table 5 that the BERT-based methods are taking maximum time. The time range for non -BERT methods is almost similar (1 to 5 s) except RGAT. RGAT is the worst performer in terms of time with an outlier value of 260s per epoch.
One interesting finding that can be derived from the results is that the top 10 performers in terms of Accuracy and Macro F1 score are taking maximum time with few exceptions. ASGCN, ASTCN and TD-LSTM are among the top 10 best performers but at the same time, they are not very time-consuming methods.

Statistical Comparison of Deep Learning Methods in Aspect Level Sentiment Classification
Statistical comparison of deep learning methods is carried out by applying the Friedman test as mentioned in Section 4. The null hypothesis of the Friedman Test is that all deep learning methods perform equally in ALSC. The first step of the test is that all deep learning methods are assigned ranks in ascending according to their performance scores for each dataset. The best method is assigned the lowest rank. Thus, the lower the rank of a model, the better the performance. Next, an average rank is calculated for each deep learning method which is the mean of ranks of methods on multiple datasets.
The details of the Friedman Test have already been explained in Section 4 with Equations (6) and (7). If the null hypothesis of the Friedman Test is rejected, it is concluded that there is a statistical evidence of significant differences among at least two deep learning methods in ALSC. The rejection of the null hypothesis is followed by two post hoc tests: To answer the research question, it is imperative to perform statistical significance testing to investigate the evidence of the performance differences between deep learning methods. This testing is carried out by applying the Friedman test followed by post hoc tests. As per the statistical table of F-distribution [54], the critical value of F f for rejecting the null hypothesis of the Friedman test is 1.47. The critical value of F f is looked up in the statistical table of F-distribution. The degrees of freedom as per Equation (6) where m = 35 (number of deep learning methods) and n = 8 (number of datasets) are 34(m − 1) and 238(m − 1) (n − 1).

Hypothesis 3 (H3).
There is no significant difference in the performance of 35 different deep learning methods in terms of Accuracy.

Hypothesis 4 (H4).
At least two of the deep learning methods compared have a significant difference in their performance in terms of Accuracy.

Hypothesis 5 (H5).
There is no significant difference in the performance of 35 different deep learning methods in terms of Macro-F1 score.
Hypothesis 6 (H6). At least two of the investigated deep learning methods compared have a significant difference in their performance in terms of Macro-F1 score.

Hypothesis 7 (H7).
There is no significant difference in the performance of 35 different deep learning methods in terms of Time.

Hypothesis 8 (H8).
At least two of the deep learning methods compared have a significant difference in their performance in terms of Time.
Friedman Test is conducted using ranks as seen in Figures 6-8. The test statistics obtained from Equations (6) and (7) H5, and H7). Thus, alternate hypotheses H4 (corresponding to null hypothesis H3), H6 (corresponding to null hypothesis H5), and H8 (corresponding to null hypothesis H7) are accepted. This implies that there are non-random and significant differences in performance metrics of at least two out of the 35 deep learning methods for ALSC. Thus, post hoc tests are required for the relative comparison of deep learning methods.

Nemenyi Test Results.
The critical distance value for comparison of 35 deep learning methods over eight datasets is calculated using Equation (8). For q α = 3.82, the value of critical distance turns out to be 19.59. In Figures 6-8, the deep learning methods are plotted against their mean ranks and are placed in ascending order of their ranks. As per Figures 6 and 7, the best performer in terms of Accuracy and Macro F1 score is GAT-BERT and the worst performer is the simple LSTM method.
The lines falling inside the gray region in Figures 6 and 7 indicate methods that do not have significant performance differences in terms of Accuracy and Macro F1 score. Figures 6 and 7 reveal that top performer GAT-BERT could not outperform 18 deep learning methods in Accuracy and 19 deep learning methods in Macro F1 score. It is also observed from Figure 8 that GAT-BERT is amongst the worst three performers in terms of Time.
Thus, it is difficult to conclude better performing methods across all three evaluation criteria based on the Nemenyi test. To deal with this problem Pareto approach [56] for finding non-dominated sets of deep learning methods is applied in the next section.

Selection of deep learning methods based on non-dominated sets.
For the selection of the best performing deep learning method across all the three evaluation criteria, the Pareto dominance concept is applied in this study. As per the Pareto dominance concept, one method m 1 dominates the other method m 2 if and only if m 1 is strictly better than m 2 in terms of at least one of the evaluation criteria and m 1 performs no worse than m 2 in terms of all the evaluation criteria. The methods that cannot be dominated by any other method using this approach are called non-dominated methods. Figure 9a,b illustrate the Pareto dominance approach as applied in this study. Figure 9a shows the plot of Accuracy vs. Time whereas Figure 9b shows the plot of Macro F1 score vs. Time. The objective is to maximize the Accuracy and Macro F1 score and to minimize the time. As per the figures, ASGCN dominates TNET, TD-LSTM, ASCNN, and ASTCN. GAT-BERT dominates AEN-BERT, BERT-SPC, RGAT, and CapsNet. However, ASGCN and GAT-BERT are dominated by none of the other methods. Thus, as per the Pareto dominance approach, ASGCN and GAT BERT are the two non-dominated methods. For statistical comparison of these two non-dominated methods, the Wilcoxon test is performed as discussed next.

Wilcoxon Test
As per the Pareto approach, ASGCN and GAT-BERT fall under the category of nondominated methods. To further compare these two methods, a pairwise Wilcoxon test is performed. Significance level α is set at 0.05. If the observed p-value is less than 0.05 the null hypothesis HW0 of Wilcoxon test is rejected.
To compare two non-dominated methods across multiple evaluation criteria, the methods are tested one by one for significant differences on each criterion, in order of priorities assigned to each criterion. This is accomplished by conducting a Wilcoxon matched-pair test on each criterion in order of priority. In this study, we assign priority to Macro F1 score, second priority to Accuracy, and third to Time. If no statistically significant differences are observed in the performance of two methods for the F1 score criterion, then the Wilcoxon test is carried for Accuracy and then for time. The process is repeated until a clear winner is found or until all the criteria are exhausted. The results of the Wilcoxon test for pair-wise comparison of ASGCN and GAT-BERT methods are shown in Table 6. It can be observed from Table 6 that there are no significant performance differences in Accuracy and Macro F1 score of GAT-BERT and ASGCN. However, the Wilcoxon test for time testifies that ASGCN performs significantly better than GAT-BERT in terms of Time.

Conclusions and Future Work
In this study, we investigated the performance differences of a wide range of 35 different deep learning methods in ALSC through a statistical comparison framework [9] utilizing eight ASBA datasets. The studied deep learning methods include RNNs, CNNs, Memory Networks, and Hybrid networks. The methods utilizing BERT pre-trained models, known as BERT-based methods are also evaluated in this study.
Although the average numerical Accuracy and Macro F1 scores of BERT-based methods are higher than the non-BERT-based methods, no statistically significant differences could be observed between top-ranking GAT-BERT method and several non-BERT methods based on post-hoc tests. However, the experimental results with respect to Time reveal that BERT-based methods require a lot of time for training. GAT-BERT method is second-worst performer in terms of Time.
Thus, the results of post hoc tests could not lead to any concrete conclusion regarding the selection of deep learning methods with respect to multiple performance metrics. To deal with this problem, we applied the Pareto dominance approach to select methods that perform optimally with respect to various performance metrics. Pareto dominance approach revealed that GAT-BERT and ASGCN are the only two non-dominated methods amongst the top 10 accurate methods. The reason behind the good performance of both methods is the usage of syntactical information. Both methods utilize the dependency graph of the input sentence, but their underlying architectures are different. GAT-BERT uses Graph Attention Network, whereas ASGCN uses Graph Convolution Network. Furthermore, GAT-BERT also leverages dependency labels along with the dependency graph. Moreover, GAT-BERT generates contextual embeddings with the help of the BERT pretrained model. This utilization of BERT for generating contextual embeddings penalizes the GAT-BERT model in terms of Time.
To select from these two non-dominated methods, we applied the Wilcoxon test for pair-wise comparison of GAT-BERT and ASGCN on eight datasets. On basis of the Wilcoxon test GAT-BERT could not outperform ASGCN on Accuracy and F1 score metrics, whereas ASGCN outperformed GAT-BERT on Time Metric. This enabled the selection of ASGCN as the most optimal method with respect to multiple performance metrics.
ABSA is very important for e-commerce business organizations. Deep Learning technology for ALSC is evolving at a very high pace. The selection of the right technology is very important to retain customers. The results of our study will aid business managers to select superior methods from a wide range of Deep Learning methods. If the performance difference of the two methods is not significantly different in terms of Accuracy and Macro-F1 score, then excessive Training time for any model is undesirable. To this end, this study evaluates the performance of 35 deep learning methods in terms of training time as well. As per this study, ASGCN is an optimal method with better performance in terms of Accuracy and Macro F1 score without compromising in time. ASGCN leverages the syntactical information of the input sentence which ensures its better performance across datasets of multiple domains.
The contributions of this study are: After an in-depth review, it is been concluded that this is the first study to perform an extensive statistical comparison on the performance of various deep learning methods on eight datasets for ALSC.
For the cost-effective research in deep learning-based ALSC, it is essential to choose a method with good performance and taking less time. Motivated by this fact, this is the first study (to the best of our knowledge) in which training time has been considered for finding the effectiveness of the deep learning method.
Our study also establishes a framework for validating the performance of new and alternate methods in ALSC for future research in this area. It is worthwhile to note that Hidden Markov Model (HMM) and genetic algorithm hybrids have proven to perform better than baseline models in the coarse-grained sentiment analysis [57], but researchers have not explored HMM models in ALSC literature. Thus, it would be interesting to perform the statistical comparison of HMM hybrids with deep learning hybrid methods for ALSC as a future work.
This study has, however, certain limitations. One such limitation is the small number and size of datasets used in this study to evaluate various deep learning methods. Most of the existing research has used maximum of five datasets. Although we have considered eight datasets, still for better evaluation, more significant number of datasets of different domains is desired. Therefore, in the future it would be interesting to propose datasets that are large and belong to different domains.
Another limitation is related to the hyper-parameter tuning of the discussed methods. The various values of hyper-parameters in our study have been taken from the original work. However, for better evaluation of such methods, hyper-parameter tuning is desired, which is not a trivial task. The hyperparameter tuning of deep learning architectures requires huge computational cost and is time-consuming. Thus, this limitation could be investigated in future work.
While deep learning-based methods have transformed the research in ALSC, the surge of such deep learning methods is stained by the opaque tendency of their architecture and high computational cost for more complex architecture. Globally, researchers have started emphasizing the interpretability and model complexity of the model as well. However, the ALSC literature lacks such contribution. Our study has shown that the performance difference between most of the methods is insignificant. Thus, rather than improving accuracy or F1 score by an insignificant number, the researchers should focus on building less complex and high interpretable solutions for ALSC.