Aspect‑Based Sentiment Analysis in Hindi Language by Ensembling Pre‑Trained mBERT Models

: Sentiment Analysis is becoming an essential task for academics, as well as for commer‑ cial companies. However, most current approaches only identify the overall polarity of a sentence, instead of the polarity of each aspect mentioned in the sentence. Aspect‑Based Sentiment Analysis (ABSA) identifies the aspects within the given sentence, and the sentiment that was expressed for each aspect. Recently, the use of pre‑trained models such as BERT has achieved state‑of‑the‑art re‑ sults in the field of natural language processing. In this paper, we propose two ensemble models based on multilingual‑BERT, namely, mBERT‑E‑MV and mBERT‑E‑AS . Using different methods, we construct an auxiliary sentence from this aspect and convert the ABSA problem to a sentence‑pair classification task. We then fine‑tune different pre‑trained BERT models and ensemble them for a final prediction based on the proposed model; we achieve new, state‑of‑the‑art results for datasets belonging to different domains in the Hindi language.


Introduction
Sentiment analysis (SA) [1][2][3][4] is one of the most critical tasks in the field of natural language processing (NLP). It is the process of analyzing and summarizing users' opinions and emotions as expressed in a sentence. In simple words, in SA, we determine whether the underlying sentiment in a piece of text is positive, negative, or neutral. This has gained attention in academia and business, particularly when identifying customer satisfaction with products and services, offering valuable feedback on websites such as Flipkart, Amazon, etc., through online reviews.
In the last decade, the popularity of e-commerce websites among consumers has increased tremendously. Users are sharing their experiences online regarding the products and services that they have used. There is a steep increase in the number of online reviews that are being posted daily. These opinions and feedback act as a measure for the goodness of the products and services. Reading all these reviews is time-consuming and analyzing them is practically challenging. Therefore, there is a need for automation to effectively maintain and analyze these reviews. Reviews are used for decision-making by the organizations as well as the consumers. Consumers decide or confirm which products to buy based on the reviews, while organizations tend to improve or develop new products, plan marketing strategies and campaigns, etc.
Early studies in this field were centered only on detecting the overall polarity of a sentence, irrespective of the entities to which they referred (e.g., mobiles) and their aspects (camera, display, etc.). The basic assumption behind this task is that there is a single overall polarity for the whole review sentence. The sentence, however, can include various aspects, e.g., "This mobile comes with 6.53-inch AMOLED display which is pretty good but the 16MP camera disappoints". The polarity of the aspect 'display' is Positive, while the polarity Table 1. An example for the ABSA task. The first subtask is to identify the two aspects present in the sentence: one is Food; the other is Service. The second subtask is to predict the polarity of both the aspects. The polarity of Food is negative, while the polarity of Service is positive.
Sentence: I liked the service and the staff, but the pizza was below par.

Aspect Category Sentiment Polarity
Food Negative Service Positive Recently, pre-trained language models, such as ELMo [5], OpenAI GPT [6], and BERT [7], have demonstrated their efficacy in solving many natural language-processing problems. BERT has performed exceptionally well on Question Answering (QA), and Natural Language Inference (NLI) tasks [7], both of which are sentence-pair classification tasks. However, direct application of the BERT model does not result in significant improvements in the ABSA task. The authors in [8] assumed that this was due to the unsuitable application of the BERT model.
Before the introduction of Transformers [9] in 2017, language models mainly used RNNs and CNNs to perform NLP tasks. The Transformer is a significant improvement as it doesn't need text to be processed in any predetermined order. Also, Transformers allow training on a massive amount of data in very little time. They are the basis for models like BERT.
Word embedding models like GloVe [10], and word2vec [11] map each word to a vector that tries to represent some aspects of the word's meaning. Word embeddings are useful for many NLP tasks, but some limitations prevent them from being used. There is a limitation to what these word models can capture, as they are not trained on deep modeling tasks, so they cannot effectively represent the negation of words and word combinations. Another significant flaw is that these models ignore the context of the words. For example, the word 'bank' has different meanings in the sentences "He opened a new account in the bank" and "A dead body was found on the bank of the river". However, embedding methods will assign the same vector for the word 'bank' in both sentences, so a single vector is forced to capture both meanings.
The above drawbacks were motivated by context-based language models that train a neural network to assign a vector to one word, based on either the surrounding context or the entire sentence. For example, the sentence, "He opened a new account in the bank", represents 'account' based on the word's context. A unidirectional model represents 'account' based on "He opened a new" but not "in the bank". However, a bidirectional contextual model represents 'account' using the context-"He opened a new…in the bank".
OpenAI GPT, ELMo, and BERT are examples of transfer-learning-based models. Ope-nAI GPT and ELMo were previous state-of-the-art contextual pre-training methods. Ope-nAI GPT is unidirectional, based on a unidirectional Transformer. ELMo is shallowly bidirectional. Two LSTMs are trained independently: one is left-to-right, and the other is rightto-left. Then, the learned embeddings are concatenated to generate the features that are used in downstream tasks. Only BERT is deeply bidirectional. In BERT, representations are learned based on both left and right contexts. ELMo is a feature-based approach, while the other two are fine-tuning approaches.
In this paper, we propose two ensemble models based on multilingual BERT, namely, mBERT-E-MV and mBERT-E-AS. As mBERT can take a single sentence or a pair of sentences as input, we transform ABSA task into a sentence-pair classification task by constructing an auxiliary sentence using aspect. Then, we fine-tune different pre-trained mBERT models for each auxiliary sentence construction method, based on the newly generated task. Finally, we ensemble the models using majority voting and average score for final prediction, and achieve state-of-the-art results on datasets belonging to different domains in the Hindi language.
The main contributions of the paper are as follows: 1.
To the best of our knowledge, this is the first time the transfer-learning-based method has been used for aspect-based sentiment analysis in the Indian language; 2.
The proposed methodology can be treated as a baseline for solving further problems involving Indian languages.
The rest of the paper is organised as follows. Section 2 summarizes the relevant works in the field of aspect-based sentiment analysis. Section 3 discusses the methodology of the proposed framework. Section 4 presents the datasets used in the experiments and the experimental results. The paper concludes with the derived conclusions and the scope for the future presented in Section 5.

Related Work
In prior works for ABSA, methods related to machine learning were dominant [12,13]. They were primarily concentrated on the extraction of hand-crafted lexical and semantic features [14]. The authors [15] proposed sentiment-specific word embedding. Such feature-engineering-based studies require professional-level knowledge in linguistics and have limitations regarding the achievement of the best possible performance. An SVMbased model was proposed in [16], which used word-aspect-association lexicons for sentiment classification. The authors [17] proposed a multi-kernel approach for aspect category detection. Previous aspect-based techniques did not appropriately adapt general lexicons in the context of aspect-based datasets, resulting in a reduced performance. The authors in [18] presented extensions of two lexicon generation methods to handle this problem: one using a genetic algorithm and the other using statistical methods. They combined the generated lexicons with the well-known static lexicons to categorize these aspects into reviews.
Neural networks can dynamically extract features without feature engineering. They can transform the original features into continuous, low-dimensional vectors because of this ability; they have been gaining huge popularity in ABSA. The sentences and aspects were independently modeled using two separate LSTM models in [19]. Then, pooling operation was performed to measure the attention given to the sentences and aspects. In recent years, with the increased use of attention mechanisms in deep learning models, many researchers have incorporated them into RNNs [20][21][22], CNNs [23], and memory networks [22,24]. This enables the model to learn various attention distribution levels for different aspects, as well as create attention-based embeddings. The authors [22] proposed the use of delayed, context-aware updates with a memory network. Context-aware embeddings were generated using interaction-based embedding layers in [25]. To handle the complications and increase the expressive power of LSTM, several attention layers were used with LSTM in [20,26]. In [21], Attention-Based LSTM with Aspect Embedding (ATAE-LSTM) was proposed, which focused on identifying the sentiment-carrying words that were relatively correlated with the entity or target.
Most recently, the authors have used transfer-learning-based models. BERT has been used in various papers [27,28] to produce contextualized embeddings for input sentences, which were subsequently used to identify the sentiment for target-aspect pairs. The authors in [29,30] used BERT as the embedding layer, while the authors in [31] used a finetuning approach for BERT, with an additional layer acting as the classification layer. BERT was fine-tuned for Targeted Aspect-Based Sentiment Analysis (TABSA) in recent works [32,33] by altering the top-most classification layer to include the targets and aspects. Instead of utilizing the top-most classification layer of BERT, the authors in [34] investigated the possibility of using the semantic knowledge present in BERT's intermediate layers to improve BERTs fine-tuning performance. The authors [8] proposed the construction of sentences from the target-aspect pairs, before feeding them to BERT to fully utilize the power of BERT models. However, BERT's input format is limited to a sequence of words that cannot provide more contextual information. To overcome this issue, authors in [35] introduced GBCN, a new method that enhances and controls the BERT representation for ABSA by combining a gating mechanism with context-aware aspect embeddings. The input texts are first fed into the BERT and context-aware embedding layers, resulting in independent BERT representations and refined context-aware embeddings. The most associated information chosen in this context is contained in these refined embeddings. The flow of sentiment information between these context-aware vectors and the output of the BERT encoding layer is then dynamically controlled by employing gating units.
However, these works are mainly carried out in the English language. For Indian languages, most of the existing works aim to classify the sentiments at either the sentenceor at document level. ABSA in Indian languages is still an open challenge, as minimal resources are available, and hardly any significant work has been performed in this field in Indian languages. The authors [36] used different models such as Decision Tree, Naive Bayes, and a sequential minimal optimization implementation of SVM (SMO) to solve the ABSA problem in the Hindi language. They used lexical features such as n-grams, noncontiguous n-grams, and character n-grams, together with a PoS tag and semantic orientation (SO) score [37], for polarity classification.
The author [38] showed the relationship between affective computing and sentiment analysis. The primary tasks of affective computing and sentiment analysis are emotion recognition and polarity detection. They can enhance the customer relationship management and recommendation system abilities, for example, to reveal which features customers enjoy or should be excluded from a recommendations system that received negative feedback. In [39], the authors showed a range of the current approaches and tools for multilingual sentiment analysis. In addition to the challenge of understanding the formal textual content, it is also essential to consider the informal language, which is often coupled with localized slang, to express 'true' feelings.
The authors proposed BabelSenticNet [40], the first multilingual concept-level knowledge base for sentiment analysis. The system was tested on 40 languages, proving the method's robustness and its potential for utility in future research.
The authors [41] proposed an attention-based bidirectional CNN-RNN deep model for sentiment analysis (ABCDM). The effectiveness of ABCDM is evaluated on five reviews and three Twitter datasets. It showed that ABCDM achieves state-of-the-art results for both long-review and short-tweet polarity classification.
In [42], the authors proposed a multi-task ensemble [43] framework of three deep learning models (i.e., CNN, LSTM, and GRU) and a hand-crafted feature representation for the predictions. The experimental results suggest that the proposed multi-task framework outperformed the single-task frameworks in all experiments.

Proposed System
We propose two ensemble models, namely, mBERT-Ensemble-Majority Vote (mBERT-E-MV) and mBERT-Ensemble-Average Score (mBERT-E-AS). Figure 1 shows the workflow for both the ensemble methods. At first, auxiliary sentences are constructed from the aspect information using four auxiliary sentence construction methods, namely Natural Language Inference-Multi (NLI-M), Question Answering-Multi (QA-M), Natural Language Inference-Binary (NLI-B), and Question Answering-Binary (QA-B), which are discussed in the following subsection. Then, the constructed auxiliary sentence and the input review sentence are fed to the WordPiece tokenizer, which breaks the two sentences into a stream of tokens. Segment embeddings and position embeddings are then added to the token embeddings. Then, the generated token sequences are fed to four different mBERT models for fine-tuning. After fine-tuning, each mBERT model predicts the output label and softmax scores for every label. The final step is to aggregate the four predictions and make a final prediction. The two ensemble models are similar to each other except for this aggregation step. The mBERT-E-MV model makes the final label prediction based on the majority of the output labels. In contrast, the mBERT-E-AS model averages the output softmax scores of the four mBERT models and outputs the label's highest softmax score. Then, the constructed auxiliary sentence and the input review sentence are fed to the WordPiece tokenizer. The final step is to aggregate the four predictions and make a final prediction.

Auxiliary Sentence Construction
To obtain better results from the mBERT model, we transform the ABSA task into a sentence-pair classification task. Apart from the original input text review, we need to add an auxiliary sentence for each input sentence. We use the following four methods, proposed in [8] to construct an auxiliary sentence.

QA-M
In this method, we generate the sentence as a question using the aspect. The format of the question should be the same for all the auxiliary sentences. For example, if the aspect pair is price, then the format of the question generated can be "what do you think of the price ?".

NLI-M
In this method, we do not generate a full standard sentence but a simple pseudosentence. The format for this is much simpler. For example, if the aspect is price, the auxiliary sentence is: "price".

QA-B
In this method, we also use the label information while creating the auxiliary sentence Therefore, the ABSA problem is temporarily transformed into a binary classification problem yes, no. For each aspect, three sequences have to be generated. For example, suppose the aspect is price, then the three sequences are "the polarity of the aspect price is positive", "the polarity of the aspect price is negative", "the polarity of the aspect price is none". The class of the sequence for which we obtain the highest probability value of 'yes' is chosen as the predicted category.

NLI-B
This method is similar to QA-B, with the only difference being that we generate pseudosentence instead of a standard sentence. For example, if the aspect is price, the auxiliary sentences are: "price-positive", "price-negative", and "price-none".
All four methods are summarized in Table 2. Table 2. The form of auxiliary sentences generated using the auxiliary sentence construction methods and the expected output labels for the newly generated sentence-pair classification task.

Fine-Tuning Pre-Trained mBERT
Since mBERT is already pre-trained on a large corpus, we can now use it to fine-tune the ABSA task. Next, we discuss the input data representation and the process of finetuning.

Input Representation
We feed the original review and the constructed auxiliary sentence to the WordPiece embedder that converts the two sentences into a sequence of tokens. A unique classification token ([CLS]) is present at the first position of each sequence, which is used for the classification task. Two separating tokens ([SEP]) are also added: one after the tokens corresponding to the original review, and another after those corresponding to the auxiliary sentence. The first ([SEP]) token acts as the separator for the two sentences, and the second ([SEP]) token signifies the end of the token sequence.
In mBERT, two sentences are fed at a time into the model. So there is a need for segment embeddings that tells the mBERT model how to distinguish between the two inputs in a given pair. Suppose the two sentences are "My dog is cute" and "He likes playing". This layer has only two vector representations. All tokens that belong to the first sentence are assigned to the first vector, while all tokens that belong to the second sentence are assigned to the second vector.
mBERT is based on Transformers, which do not encode the sequential information of the input [9]. For input text, "I do, what I think", both 'I's should have different vector representations. That is why positional embeddings are required. mBERT learns a vector representation for each position during training. Every word present at the same position has the same position embedding. Hence, for input texts like "Good job" and "Well done", both 'Good' and 'Well' have same position embeddings. Similarly, both 'job' and 'done' will have the same representation.
The relevant token, segment, and position embeddings are added together to form the input representation for a given token.

Fine-Tuning Procedure
Fine-tuning mBERT is pretty straightforward. The final hidden state corresponding to the ([CLS]) token is considered as the fixed-dimensional pooled representation of the input sequence. We denote this vector as C ∈ R H , where H is the size of the hidden state. This is fed to a classification layer with parameter matrix W ∈ R K×H , where K denotes the number of labels. Finally, the softmax function P = softmax(CW T ), is used to calculate the probability for each label.

Ensembling mBERT Models
After fine-tuning, each mBERT model predicts the output label and softmax scores for every label. The final step is to aggregate the four predictions and make a final prediction. The aggregation process uses two methods: one is based on majority voting, the other is based on average scoring. The mBERT-E-MV model makes the final label prediction based on the majority of the output labels, while the mBERT-E-AS model averages the output softmax scores of the four mBERT models and outputs the label with the highest softmax score.

Dataset Description
For the aspect category detection and sentiment classification task, there was no dataset available for the Indian languages; therefore, authors in [36] introduced the IIT-Patna Hindi Reviews dataset to facilitate research in this field for Indian languages. They collected user reviews from various Hindi websites and annotated them manually using a pre-defined set of aspect categories. The reviews belong to four different domains, which are discussed below.  Table 3 shows the distribution of instances for each aspect and polarity. Some examples of input sentences and output labels for this domain are presented in Table 4.  If we talk about quality, then the camera is not very special, but given the price at which this device is available, the camera quality is good. Labels: (Price, Positive) (Hardware, Conflict)

Mobile Apps
The Mobile Apps domain contains 197 reviews for various mobile apps. The four predefined aspect categories for this domain are: 'Price', 'Ease of use', 'GUI' and 'Miscellaneous'. The aspect 'GUI' refers to the graphical user interface of the app. For every aspect, a polarity class of 'Positive', 'Negative', 'Neutral' and 'Conflict' is also provided. Table 5 shows the distribution of instances for each aspect and polarity. Some examples of input sentences and output labels for this domain are presented in Table 6.  The layout of this application is very clean, and there is a facility to paste the download link directly into the application.

Travel
The Travel domain contains 565 reviews for different tourist places. The four predefined aspect categories for this domain are: 'Place', 'Reachability', 'Scenery' and 'Miscellaneous'. The aspect 'Reachability' signifies the convenience in reaching the destination. For every aspect, a polarity class among 'Positive', 'Negative', 'Neutral' and 'Conflict' is also provided. Table 7 shows the distribution of instances for each aspect and polarity. Some examples of input sentences and output labels for this domain are presented in Table 8.

Movies
The Movies domain contains 878 reviews for different movies. The four pre-defined aspect categories for this domain are: 'Story', 'Performance', 'Music' and 'Miscellaneous'. The aspect 'Performance' covers various prospects of the movie, such as action, direction, etc. For every aspect, a polarity class among 'Positive', 'Negative', 'Neutral' and 'Conflict' is also provided. Table 9 shows the distribution of instances for each aspect and polarity. Some examples of input sentences and output labels for this domain are presented in Table 10.  One has to go through a steep and winding route to reach this fort.

Labels:
(Reachability, Negative)  Even the acting of the actors of this film is such that it seems that they are more energetic than necessary while intoxicated and their enthusiasm is irritating. Labels: (Performance, Negative)

Result Analysis
For fine-tuning, we use the multilingual-cased BERT-base pre-trained model on Hindi datasets. For the model, the number of transformer blocks and the self-attention heads is 12 each, the size of the hidden layer is 768, and the total number of parameters is 110 M. The dropout probability is set at 0.1 while fine-tuning. The optimizer used is 'Adam' and the activation function is 'gelu'. Table 11 summarizes the different hyperparameters' values used in the experiments. All experiments were conducted on an Intel(R) Xeon(R) Gold 5120 CPU @ 2.20 GHz, 96 GB RAM, and NVIDIA Quadro P5000 graphic card with 16 GB memory. The results of the various datasets are presented in the following sections.
For each domain in the IIT-Patna Hindi reviews dataset, a separate mBERT ensemble model is trained. For the experiments, all four datasets are split into training and testing sets in the ratio of 4:1. The results obtained for each domain are presented in the following subsections. We use the results reported in [36] for comparison purposes. The authors in [36] used two techniques, the (i) binary relevance approach and (ii) label powerset approach, to solve the multi-label aspect category detection subtask. In the binary relevance approach, the first n distinct models are build for each n unique label. Then, the final prediction is produced by combining the predictions of the n models. However, in the label powerset approach, each label combination is treated as a unique label. The model is then trained and evaluated on these labels.
For all the datasets, we have compared our ensemble models with the best-performing individual mBERT model.

Electronics
The results obtained for the Electronics domain are presented in Table 12. It was observed that both mBERT-E-MV and mBERT-E-AS models achieved much better results than the previous models on both the subtasks. The mBERT-E-MV model achieved the best F1-score on the aspect category detection task while mBERT-E-AS achieved the best accuracy on the aspect polarity classification task.

Mobile Apps
The results obtained for the Mobile Apps domain are presented in Table 13. Both mBERT-E-MV and mBERT-E-AS models achieved better accuracy than the previous models on the aspect polarity classification task. However, the mBERT-E-AS model fails to surpass the F1-score value obtained by the Naive Bayes model for the aspect category detection task. Overall, mBERT-E-MV turns out to be the best performer in both the subtasks. The results obtained for the Travel domain are presented in Table 14. Both mBERT-E-MV and mBERT-E-AS models achieved much better results than the previous models for both subtasks. The mBERT-E-MV model is the best performer for the aspect category detection task. In contrast, mBERT-E-MV and mBERT-E-AS performs equally well on the aspect polarity classification task. Overall, for this domain, mBERT-E-MV model performed better than the other models.

Movies
The results obtained for the Movies domain are presented in Table 15. From the table, it can be observed that the mBERT-E-MV and mBERT-E-AS models achieved better results than the previous models for the aspect category detection task. However, they failed to surpass the results obtained by DT and SMO models in the aspect polarity classification task by a significant amount. Among the ensemble models, mBERT-E-MV performed better in the aspect category detection task while mBERT-E-AS achieved better results for the aspect polarity classification task.

Conclusions and Future Work
This paper proposes two ensemble models based on Multilingual BERT, namely mBERT-E-MV and mBERT-E-AS. Our proposed models outperformed the existing state-ofthe-art models on Hindi datasets. On the III-Patna Hindi Reviews dataset, mBERT-E-MV reports F1-scores of 74.26%, 59.70%, 63.74% and 79.08% on the aspect category detection task in Electronics, Mobile Apps, Travel and Movies domains, respectively. It reports accuracies of 69.95%, 51.22%, 75.47% and 78.09% on the aspect polarity classification task for the four respective domains. Similarly, mBERT-E-AS reports F1-scores of 73.38%, 52.31%, 59.65% and 78.61% on the aspect category detection task for the respective domains. It reports accuracies of 70.49%, 48.78%, 75.47% and 79.77% on the aspect polarity classification task for the four respective domains.
Overall, BERT-based models performed much better than the other models. This is possible because of the construction of auxiliary sentences from the aspect information, which is analogous to exponentially increasing the dataset. A sentence s i in the original dataset is transformed into (s i , a 1 ), (s i , a 2 ),…, (s i , a n a ) in the sentence pair classification task. The BERT model has an additional advantage in handling sentence pair classification tasks, which is evident from its impressive improvement on the QA and NLI tasks. This improvement comes from both unsupervised Masked Language Modeling (MLM) and the Next Sentence Prediction (NSP) tasks, which are used to pre-train the BERT model [7]. In MLM, a word in a sentence is masked, and then the model is trained to predict which word was masked based on the context of the word. In NSP, the model is trained to predict whether the two input sentences are connected logically/sequentially or whether they are unrelated to each other.
In future work, the proposed system can be applied to other NLP problems. As is evident from the obtained results, there is scope for augmenting the Hindi datasets for further improvements in performance. There is also scope for introducing a dataset for the TABSA task in Indian languages, as there is no dataset available for the same purpose.