Incorporating Word Significance into Aspect-Level Sentiment Analysis

Aspect-level sentiment analysis has drawn growing attention in recent years, with higher performance achieved through the attention mechanism. Despite this, previous research does not consider some human psychological evidence relating to language interpretation. This results in attention being paid to less significant words especially when the aspect word is far from the relevant context word or when an important context word is found at the end of a long sentence. We design a novel model using word significance to direct attention towards the most significant words, with novelty decay and incremental interpretation factors working together as an alternative for position based models. The interpretation factor represents the maximization of the degree each new encountered word contributes to the sentiment polarity and a counter balancing stretched exponential novelty decay factor represents decaying human reaction as a sentence gets longer. Our findings support the hypothesis that the attention mechanism needs to be applied to the most significant words for sentiment interpretation and that novelty decay is applicable in aspect-level sentiment analysis with a decay factor β = 0.7.


Introduction
Aspect-level sentiment analysis is a prominent task in Natural Language Processing (NLP) and has attracted growing attention in research and business community as it identifies sentiment as a key driver of human behavior [1,2]. Given a sentence with context and aspect words, this task aims to infer a positive, negative or neutral sentiment polarity of a context word towards an aspect word. For example, in the sentence "Well, it happened because of a graceless manager and a rude bartender who had us waiting 20 min for drinks and then tells us to chill out", the aspect "drinks" would be assigned a neutral polarity and aspect words "waiting," "manager" and "bartender" would have a negative polarity. Recent advances in aspect-level sentiment analysis have gravitated towards neural network based models [3,4]. Such models learn text representations from high dimensional data without careful feature engineering and capture semantic relations between context and aspect words in a more scalable manner [5,6]. Researchers have further used the attention mechanism in conjunction with Recurrent Neural Networks (RNNs) as effective sequence modeling techniques to achieve higher model accuracy [7,8]. Continuing from the sentence example provided earlier, one can see that the context word "graceless" is more relevant to the aspect word "manager," the word "rude" is more applicable to the word "bartender", while the words "20 min" are more pertinent to the word "waiting." In this example, the attention mechanism is used to highlight the most influential context word with respect to an aspect word. In existing approaches, the magnitude of attention on a context word correlates to its relevance to the aspect word in predicting the output at each position of a sentence [9,10]. These existing approaches model the attention weight based on explicit positional encodings [11,12] or generate target specific sentence representations [5,7] but there are still some outstanding limitations in these models as they overlook the notion that humans use attention filters to direct attention towards items deemed as important.
The significance of attention filters is highlighted by psychologists, who define attention as a basic component of human cognitive biology which is limited in terms of capacity and duration [13,14]. An everyday example showing that there is a limit to items humans can pay attention to is, one can not effectively listen to a telephone conversation if someone else in the room is simultaneously giving them complex instructions. Consequently selective towards stimuli [15] and this is supported by theories behind language reading comprehension. These theories illustrate that text elements are first processed at a minimal level and graded for significance, then extra attention is paid to elements in proportion to their significance [16,17].
In this backdrop, the first problem with current aspect-level sentiment analysis models is paying attention to less significant words as they allocate attention weights to words without grading for significance. For example, it is common for a writer to use italics or bold to draw attention to important words that may otherwise be missed in a sentence. Taking the sentence "Great food, REASONABLE prices, makes for an evening that can't be beat" as an example, human attention is drawn to the word "REASONABLE" with respect to the aspect word "prices" because of the capitalization. Current methods bypass this step resulting in most methods encompassing of embedding, contextual, attention and output layers [2,8,12], resulting in attention not being directed to the most significant words.
The second problem in current models is overlooking important words that may be at the end of a sentence. Such models ignore the notion that humans make incremental interpretations of a sentence by maximizing the degree each new encountered word affects sentence interpretation [18,19]. This implies that the ultimate sentiment polarity keeps being reviewed as more words are accessed. For example in this sentence "My day off after a wedding consist of wii zelda and chinese food. I feel like im in jr high. This rocks!!", the polarity towards the aspect word "wii" is neutral if only "My day off after a wedding consist of wii zelda and chinese food" is considered, the polarity is still neutral if the part sentence "My day off after a wedding consist of wii zelda and chinese food. I feel like im in jr high" is considered. The ultimate sentiment polarity is established when the ending words "This rocks!!" are considered, then the polarity towards the aspect word "wii" is positive. We therefore argue that incorporating incremental interpretation into aspect-level sentiment analysis may tackle sentences where there the relevant aspect and context words are far apart.
The last problem current models have is determining sentiment polarity when an aspect word and the relevant context word are far apart from each other. Current models overlook the concept of decaying human reaction to repeated stimuli [20,21] and therefore word importance decreases as sentence length increases until there is a complex stimuli triggered by an encountered context word [22,23]. Current methods do not take this quality into consideration but instead use global attention based models which do not take explicit positions of a word [5,6,24,25], while position based attention assumes that a context word has higher importance if it is closer to an aspect word [11,12]. We argue that computational methods could benefit from taking language processing psychological evidence into consideration in aspect-level sentiment analysis.
To mitigate the first problem we propose a unified module dubbed as word significance as an attention filter to guide our model to pay more attention to the most significant words. The word significance module also acts as a propagation mechanism that allows a high significance context word that may be far from an aspect word to still have influence on the sentiment polarity. The word significance factor is made up of two aspects being an increasing incremental interpretation factor as a solution to the second problem, which is counter balanced by a novelty decay factor as a solution to the third problem. We solve incremental interpretation using a growth function that increases as more words are accessed in a sentence, as previously used in determining research paper significance growth [21,26]. We propose to represent novelty decay using the stretched exponential law which has proven successful in representing phenomena relating to human nature dynamics with a natural time span decay. To the best of our knowledge the use of the stretched exponential has not been used in NLP, particularly in aspect-level sentiment analysis. Experimental results show that our model performs competitively with current models.
The aim of this paper is to demonstrate the importance of word significance in directing attention to the most attention worthy words in aspect-level sentiment analysis. The contributions of this paper are as follows: • We use the word significance factor to model attention worthiness in aspect-level sentiment analysis using the Significant Attention Network (SAN) model. • We introduce two novel factors in aspect-level sentiment analysis being incremental interpretation and novelty decay as an alternative to position based models. • We conduct qualitative research on three real world datasets to prove the universality of stretched exponential novelty decay in aspect-level sentiment analysis.
The rest of the paper is organized as follows: We first review related works in aspect level sentiment analysis, novelty decay and incremental interpretation in Section 2. Then the proposed model is presented in Section 3, while Section 4 is an extensive comparison of experiments conducted to test the effectiveness of the proposed model. Finally, Section 5 summarizes this work and provides future directions.

Related Works
Aspect-level sentiment analysis is a fundamental task in NLP, aimed at determining the polarity with respect to a specific aspect term. Early studies typically depend on the quality of handcrafted intensive lexicons, ngram and parse trees for feature engineering [27][28][29] to train sentiment classifiers such as Support Vector Machines (SVM) [30,31]. Since effective features are based on expensive, complicated domain expert knowledge, recent methods prefer neural networks which offer efficient generation of low dimensional sentence representations [10,32,33]. The attention mechanism as a human capability has been used to achieve higher performance in aspect-level sentiment analysis [5,6,25], and since we propose that word significance can be used to guide models to apply attention to the most significant words, we discuss prevailing attention, incremental interpretation and novelty decay research in the following section.

Attention Mechanism in Aspect-Level Sentiment Analysis
Inspired by visual attention, researchers in various domains such as machine translation [34], question answering [35,36] and machine comprehension [37] have successfully used attention to achieve performance gains. When used in conjunction with neural networks, the attention mechanism is used to do a soft selection over hidden representations and automatically learn the most important context works for a specific target word in a sentence [8,9]. Due to numerous influential papers in attention research [34,36,38], there is a wide array of Long Short Term Memory(LSTM) attention based architectures which use aggregated attention scores to aggregate contextual features for prediction. Tang et al. [10] model the aspect word dependent left and right context and concatenate both as the final representation before prediction. Wang et al. [7] then introduce the concept of modeling aspect words with context words by concatenating aspect word embeddings to each context word, before using an LSTM and attention to generate the final representation. However these models only consider word level attention without taking the overall sentence meaning into consideration, making it difficult to make correct predictions when related words are far apart in long sentences.
Ma et al. [5] bridge this gap by separately modeling aspect word attention to extract mutual features of aspect and context words at sentence level though the IAN model. They use two attention based LSTMs to represent aspect and context words, then use the hidden context states to calculate attentions towards aspect words using a pooling operation and vice versa, thus capturing important parts in both the context and aspect words interaction. Huang et al. [6] make modifications by forgoing the pooling operation as it ignores interactions between word-pairs through the AOA-LSTM model. These models generate mutual attentions to concentrate on important aspect and context words at sentence level, however they do not explicitly take the distance between aspect and context words into consideration. Fan et al. [25] therefore enhance LSTM hidden states with position encodings before applying interactive attention, while Li et al. [11] introduce position weightings to transformation LSTM outputs before applying a single convolution to extract important features. Liu et al. [24] also use a position weighted memory module in conjunction with sentence level attention to capture aspect global attention and context attention to simultaneously capture the order and correlations of words. However, these methods assume that a context word closer to an aspect word has higher importance, which is not always the case. Our model presents novelty decay and incremental interpretation as an alternatives to position weighted models, with attention being guided by word significance.

Novelty Decay
It has been observed that repeated exposure to a stimuli has diminishing effects, this phenomenon is described as novelty decay [20]. This concept has been widely accepted in modeling of attention dynamics in research journals [26,39] and social media [15,40,41] but there is no consensus on whether the universal distribution shape of novelty decay is exponential [26,42] or logarithmic [39,43]. The main disadvantage of the exponentially shaped novelty decay is that it makes wrong predictions in short decay periods [21,42,44], resulting in stretched exponential shaped novelty decay which describes data over many orders of magnitude [45,46]. The stretched exponential law provides a simple, economical mechanism for novelty decay as it has only one extra parameter β as a novelty decay factor [47,48]. Therefore, Wu et al. [21] use a stretched exponential based model to simulate novelty decay in collective attention in online stories, while Laherrere et al. [47] find the stretched exponential based models to perform better in paper citation dynamics. Asur et al. [49] also use stretched exponential in modeling novelty decay in Twitter trends and Feng et al. [50] go further by modeling influence maximization and novelty decay in Twitter. Our model examines the relevance of novelty in aspect-level sentiment, to determine whether novelty decay can be characterized by the stretched exponential law.

Incremental Interpretation
One of the basic assumptions of language interpretation is that it processes incrementally by seeking to maximize the degree of interpretation with each new word encountered [19]. Researchers have consequently adopted a divide and conquer approach to guide the interpretation process where the ultimate contribution of a context word to the sentiment polarity prediction is not determined by an individual word but is instead an accumulation of its linguistic relations with the rest of the sentence [51,52]. The application of this found in psychology linguist research areas in sentence and dialogue comprehension [18,53] but not in neural networks based models of sentiment analysis.

Proposed Model
In this section we provide a detailed description of the proposed model, with a high-level illustration of the model in Figure 1. We first provide a task definition of aspect-level sentiment analysis, then look at the word embedding and LSTM conceptual layers. We then examine the word significance and interactive attention layers and lastly present the output layer and the training details of our model.

Task Definition
In this aspect level sentiment classification task, we have a sentence s = [w 1 , w 2 , ....w i , ..., w n ] consisting of n words and an aspect list a = [w i , w i+1 ..., w i+m−1 ] consisting of m words. Taking the polarity and the sentence-aspect word pair as input, the goal is to predict the polarity of sentence s towards aspect a.

Word Embedding Layer
The word embedding layer maps each word to a high dimensional vector using pretrained Glove embeddings [54] to get fixed low dimensional word embeddings. Each word w i is represented by vector v i ∈ R d w from lookup matrix R V×d w , where V is vocabulary size and d w is vocabulary dimension. After the embedding lookup we get two vectors [v 1 ; v 2 ; ...v n ] ∈ R n×d w and [v i ; v i+1 ; ...v i+m−1 ] ∈ R m×d w .

Contextual Layer
We feed the vectors from the previous layer into two bidirectional LSTMs to capture temporal interactions in the context and aspects words respectively. Each Bi-LSTM is created using two stacked LSTMs to learn long term dependencies and avoid vanishing gradient to produce hidden states − → h s ∈ R n×d where d is the hidden states dimension. Given embedding x at each time step t, the update process of the forward LSTM can be formalized as follows: where σ is the sigmoid activation function to control the outflow of irrelevant information and i t , f t and o t are the input forget and output gates respectively to control the inflow of information.
where d is the hidden dimension size. The symbol "*" is element-wise multiplication, the symbol "·" is matrix multiplication. The backward LSTM performs a similar process to produce as the output of the BiLSTM. Given sentence s and aspect word a, we separately perform two BiLSTMs for the sentence output as h s ∈ R n×2d and aspect output as h a ∈ R m×2d .

Word Significance Layer
We introduce the word significance layer which guides the attention layer to the most significant words for sentiment prediction. The word significance factor is determined using a growth model whereby incremental interpretation is counter balanced by a novelty decay factor. We first represent word significance a t without novelty decay in a sentence at time t with a t = (1 + z t ) a t−1 , where z t ∈ R n as the incremental interpretation variable and a 0 as the initial word significance are learned variables. In order to incorporate novelty decay into word significance, the arrangement of word significance is adjusted as follows: where r t ∈ R n is the novelty decay variable. As previously discussed, incremental interpretation is counterbalanced by novelty decay which is parameterized as r t where r 1 = 1 and r t ↓ 0 as t ↑ ∞.
where 0 < β < 1 is decay factor. The same process is repeated for context words. The final significant word vector for context words is represented as a s = [a 1 ; a 2 ...a n ] and the final significant word vector for aspect words is represented as a a = [a 1 ; a 2 ...a m ]. The significant context representationĥ s and significant aspect representationĥ a are derived as following using element-wise multiplication: where a s ∈ R n , a a ∈ R m ,ĥ s ∈ R n×2d andĥ a ∈ R m×2d .

Interactive Attention Layer
Given representationsĥ s ∈ R n andĥ a R m , we calculate our attention to exploit mutual information between the two representations following a methodology similar to the AOA-LSTM model [6] as shown in Figure 2. Using the significant aspect and context representations, we calculate the interaction matrix K asĥ s ·ĥ a T ∈ R n×m using a dot product with each value representing a context-aspect pair.
Using a column-wise softmax we get aspect to context attention α ∈ R n×m , while a row-wise softmax gets a context to aspect attention δ ∈ R n×m . We then use column-wise averaging to get aspect specific attention δ ∈ R m , to finally have the final sentence level attention γ ∈ R n as the weighted sum of each aspect to context attention as given by Equation (14).

Output Layer
The final sentence representation is a weighted sum of sentence hidden semantic states h s ∈ R n×2d using the attention score γ ∈ R n . m = h T s · γ (15) where m ∈ R 2d This final representation is then feed into a linear layer to project m into C aspect classes: where W q and b q are weight and bias respectively, q ∈ R C where sentiment polarity c ∈ C and C is set to 3 as the number of aspect classification classes. After the linear layer we use a softmax layer to determine polarity.

Model Training
The model is trained in an end-to-end manner through back-propagation, using cross-entropy loss with L 2 regularization as the objective function: where y is the target sentiment distribution and y predicted sentiment distribution, λ r is L 2 regularization coefficient and Θ is the parameter set.

Datasets and Parameter Setting
We conduct the experiments on three datasets being ACL 14 Twitter dataset [4] and SemEval 2014 Task 4 composed of Restaurant and Laptop reviews [55]. Experienced annotators tagged the aspects terms and polarities as positive, negative and neutral. Table 1 illustrates the distribution of sentiment polarity. We initialize word embeddings with 300-dimensional Glove vectors [54] with vocabulary size 1.9 M. The dimension of LSTM hidden states is set to 300, the L2 regularization coefficient is set to 10 5 and dropout rate is 0.1. The Adam optimizer with learning rate 0.01 is used to update parameters and we adopt Accuracy and Macro-F1 as metrics.

Baseline Comparison
In order to adequately evaluate and analyze the performance of our model, we use the following models as baselines: • Majority is a basic baseline method which assigns the sentiment polarities in the test set according to the largest polarities in the training set. • Feature-SVM [29] uses an SVM classifier based on lexicon, parse and ngram features to achieve state-of-the-art performance. • LSTM uses an LSTM network to learn hidden states without considering aspect words and uses the averaged vector as the sentence representation to predict polarity. • AE-LSTM [7] models context word representations using an LSTM, then combines the hidden states with aspect embeddings to generate attention weights for classification. • ATAE-LSTM [7] improves on the AE-LSTM by appending the aspect embedding to each word embedding in representing the context. • IAN [5] individually models context and aspect words in attention based LSTMs using interactive attention. The two representations are then concatenated to predict polarity. • AOA-LSTM [6] based on the individual hidden states of aspect and context words, the model uses an interaction mechanism to focus on the important context words.
We also list the ISAN model variants to analyze the effects of stretched exponential attention: • ISAN-I: models word significance based on incremental interpretation, the final prediction is based on the an interactive attention module and word significance. • ISAN-D: models word significance based on novelty decay, includes an interactive attention module module. • ISAN: the complete interactive significant attention model. Table 2 illustrates the model comparison results of baseline methods. We observe that the Majority method achieves the worst performance as it only uses data distribution information, while Feature+SVM achieves improved performance on all datasets because of well designed features. Both models are outperformed by our ISAN model and other LSTM based models which automatically learn high quality representations for predictions. The LSTM method has the lowest performance in the neural network based methods because it equally treats aspect and target words. This highlights the importance of aspect words as illustrated by Jiang et al. [56]. Consequently, the ATAE-LSTM achieves higher performance as it introduces aspect embeddings together with the attention mechanism. Our proposed model outperforms these methods as it exploits mutual information between the aspect and context representations through attention interaction. The IAN model achieves improved performance in comparison to previously mentioned LSTM based methods as it interactively learns the aspect and context attention weights, our model and the AOA-LSTM are able to outperform it because it uses a pooling operation to get the final representation. The AOA-LSTM model achieves higher performance as it replaces the pooling operation in the IAN with an interaction matrix, with each value representing a context-aspect pair to produce mutual attentions to concentrate on important aspect and context words. Our complete ISAN model performs competitively with state of the art as it incorporates word significance. Table 3 illustrates the performance comparison of the ISAN model versions. The efficacy of incremental interpretation in aspect-level sentiment analysis is illustrated by the ISAN-I model, showing the importance of considering relevant aspect words that may be at the end of a sentence. This model is outperformed by the novelty decay based ISAN-D model, meaning that when these two factors are considered individually the impact of decaying novelty throughout a sentence is stronger. The influence of novelty decay implies that irrespective the length of a sentence, novelty is still an important factor as the stretched exponential law does not allow novelty to become zero even when the sentence length reaches large lengths. The performance of our model improves when novelty decay acts as a discounting factor of incremental interpretation as shown by the complete ISAN results, leading to 3.5%, 1.8% and 1% performance increase in the Laptop, Restaurant and Twitter datasets respectively. We also collected the statistics of the aspect word position in sentences from the Restaurant, Twitter and Laptop dataset as shown by Table 4. When each sentence length is divided into three segments, the majority of aspect words are located in the last section of the sentence for all datasets. This shows the importance of simultaneously considering words at the beginning and at the end of a sentence when determining sentiment polarity. The results also illustrate the effectiveness of word significance in propagating the significance grading of a context word that may be far from an aspect word. Our results support our intuition that word significance can be used as a representation of how much a word contributes to the sentence interpretation and ultimately draws our model to words that are worthy of attention. This study demonstrates the relevance of word significance in aspect-level sentiment analysis and can be extended to other machine learning tasks based on the attention mechanism. Word significance may be used to incorporate attention filters which can guide the attention mechanism to items more worthy of attention, more especially in other NLP tasks where there is a possibility of time based novelty decay with repeated exposure. Depending on the field, the stretched exponential decay factor ranges between 0 and 1 as shown by Figure 3 [48], putting emphasis on initial occurrences of an item without nullifying its later occurrences. Our results are also in line with previous studies which illustrate the universality of novelty decay in fields investigating social media and economics [21,26,42,57], which may be further investigated using neural network based models. Ultimately, our model strives to closely represent human attention mechanisms in machine learning tasks.

The Effects of Novelty Decay
In order to validate the effects of attention decay, we vary the novelty decay exponent beta β as shown in Figure 4. From the results we can see that the ISAN model reaches its peak Macro-F1 when the novelty decay factor β = 0.7, afterwards there is a general decline in performance. The novelty factor value of 0.7 is larger than the one found by Huberman et al. [21] of 0.4 for collective attention decay in websites, implying faster novelty decay in aspect-level sentiment analysis. Asur et al. [49] find the novelty decay factor in new story detection in Twitter to decrease from 0.8 to 0.4 as the number of stories increase, while Feng et al. [50] find novelty decay to range from 0.8 to 0.3 as the number of people in social networks increases. The novelty decay factor of our model is higher up the range demonstrated in previous studies, maybe due to lower number of words in sentences in comparison to the number of stories or people on social media.

Case Study
In this section we perform a case study on the ISAN model using two sentences as shown in Figure 5, with a deeper color intensity representing higher weight allocation to a word towards an aspect word. We first use the review sentence "as much i like the food there i cant bring myself to go back", with "food" as the aspect word from the Restaurant dataset. We apply the ISAN model to get the correct sentiment polarity prediction as negative. Although the sentiment towards food is positive, the model puts higher weight to the words "ca nt bring" despite being a few words away from the aspect word "food". This shows that our model is able to counterbalance novelty decay with incremental interpretation and allocates a word that is further down the sentence with a higher attention weights to detect negation and make the correct prediction. The attention given to common words such like "as" and "the" is also relatively low, verifying the intuition that such words do not contribute much to the sentiment polarity. The second sentence for our case study is "highly recommend this as great value for excellent sushi and service", where "service" is the aspect word as shown by Figure 6. Our model correctly predicts a positive sentiment polarity with higher attention given to words "great value" instead of word "excellent" which according to our intuition is more relevant to the word "sushi".

Error Analysis
The first type of error our model encounters is related to non-compositional sentiment expression which appears in numerous previous words [6,10]. For example, for the sentence "we requested they re-slice the sushi and it was returned to us in small cheese-like cubes" as there is no direct sentiment towards the aspect "sushi". This problem is further highlighted in the sentence "entrees include classics like lasagna, fettuccine alfredo and chicken parmigiana" where aspects words "lasagna", "fettucine" and "alfredo". Another type of error our model encounters is caused by the use of complex sentences like "Anywhere else, the price would be 3x as high!" where our models misses to give the word "3x" a high attention weight.

Conclusions
In this paper, we propose the ISAN model which is made of word significance to guide the attention mechanism towards words most worthy of attention. We first use a BiLSTM to separately model context and aspect hidden states, then apply a word significance module based on incremental interpretation and novelty decay to determine sentiment polarity. Our results highlight the importance of simultaneously considering incremental interpretation and novelty decay in determining polarity, showing that our model can be an alternative for position based models in aspect-level sentiment analysis. We perform experiments on three public datasets to prove the applicability of stretched exponential novelty decay in aspect-level sentiment analysis and establish a novelty decay factor of 0.7. Although results illustrate the effectiveness of word significance, there is still space for improvement as illustrated by the error analysis.