LCF: A Local Context Focus Mechanism for Aspect-Based Sentiment Classiﬁcation

: Aspect-based sentiment classiﬁcation (ABSC) aims to predict sentiment polarities of different aspects within sentences or documents. Many previous studies have been conducted to solve this problem, but previous works fail to notice the correlation between the aspect’s sentiment polarity and the local context. In this paper, a Local Context Focus (LCF) mechanism is proposed for aspect-based sentiment classiﬁcation based on Multi-head Self-Attention (MHSA). This mechanism is called LCF design, and utilizes the Context features Dynamic Mask (CDM) and Context Features Dynamic Weighted (CDW) layers to pay more attention to the local context words. Moreover, a BERT-shared layer is adopted to LCF design to capture internal long-term dependencies of local context and global context. Experiments are conducted on three common ABSC datasets: the laptop and restaurant datasets of SemEval-2014 and the ACL twitter dataset. Experimental results demonstrate that the LCF baseline model achieves considerable performance. In addition, we conduct ablation experiments to prove the signiﬁcance and effectiveness of LCF design. Especially, by incorporating with BERT-shared layer, the LCF-BERT model refreshes state-of-the-art performance on all three benchmark datasets.


Introduction
Aspect-based sentiment classification (ABSC) [1,2] is a fine-grained Natural Language Processing (NLP) task, and it is a significant branch of sentiment analysis [3][4][5][6]. Traditional sentiment analysis approaches [7][8][9][10] mainly focus on inferring sentence-level or document-level sentiment polarities (typically positive, negative, or neutral in triple-classification). However, the ABSC task, also known as aspect-based sentiment analysis (ABSA), differs from traditional sentiment analysis. It aims to predict independent sentiment polarities for targeted aspects within the same sentence or document. Aspect-based sentiment classification local contexts people make full use of the sentiment polarity, subjectivity and further information from targeted aspects. The ABSC datasets are composed of plenty of contexts and aspects. For example, reviews from customers usually comment on different aspects, and different aspects may deliver different sentiment polarities. Given sentences like "while the food is so good and so popular that waiting can really be a nightmare". Definitely, the customer compliments the food but criticizes the service. Due to certain reasons, the restaurant makes the customer wait for a long while. Generally, traditional sentence-level or document-level sentiment polarity mining methods cannot precisely predict polarities for specific aspects as they do not consider the fine-grained polarities of different aspects.
Deep Neural Networks (DNNs), such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), are employed in NLP tasks in recent years, and those DNN-based models are

•
This paper proposes LCF design models, which utilize self-attention to capture local context features and global context features concurrently. LCF design models combine local context features and global context features to infer sentiment polarity of targeted aspect.

•
We introduce SRD to evaluate the dependency between contextual words and aspects. SRD is significant for figuring out local context, and features of contextual words in SRD threshold will be preserved and focused. • This paper implements CDM and CDW layer to enforce LCF design models to pay more attention to local context words of specific aspect. The CDM layer focuses on local context by masking output representations of less-semantic-relative contextual words. The CDW layer weakens the features of less-semantic-relative contextual words according to SRD. • Experiments conducted for ablated LCF design models to evaluate the significance and effectiveness of LCF design architectures. Besides, extra experiments are also carried out to evaluate the effectiveness of different SRD thresholds.

Related Works
In recent years, a variety of methods have been introduced to deal with the ABSC task, including traditional machine learning methods and neural network methods. In this section, we will introduce the related work of aspect-level sentiment classification, including traditional machine learning methods and deep learning methods.

Traditional Machine Learning Methods
In general, traditional machine learning approaches [16,17] for the ABSC task are primarily based on feature engineering. This also means that a lot of time is spent collecting and analyzing data, then designing features based on the characteristics of the dataset and obtaining enough language resources to construct lexicons. Supported Vector Machine (SVM) [16] is a traditional machine learning method and is applied to solve aspect-level sentiment classification and achieve a considerable performance. However, consistent with most traditional machine learning methods, this method is very burdensome and inefficient to design features manually. In addition, when the dataset changes, the performance of the method is greatly affected. Therefore, the methods based on traditional machine learning have poor generality and are difficult to apply to a variety of datasets.

Deep Learning-Based Methods
Recent works are growing to combine with Neural Networks (NN) because NN-based methods equipping with the remarkable ability to capturing original features, which can map the features into continuous and low-dimensional vectors without feature engineering.
Word embedding [18] is the basis of most DNN-based methods, which represents natural language as continuous low-dimensional vectors and learns interfeatures of natural language by operating the vectors. Word2vec [19], PV [20], and GloVe [21] are the pretrained word embeddings. These pretraining word embeddings are all trained on a large amount of text corpus (a typical source of text data is the corpus of Wikipedia). DNN-based methods map each word into a vector and learning the vector representations of the words according to the word embeddings. Pretrained word embedding can not only accurately reflect the relationship between words, but also significantly improve the performance of DNN-based models.
Attention mechanism [22], which is applied in plenty of DNN-based models, improves the performance for most NN-based models. The attention mechanism takes advantage of the semantic correlation of context and aspect to calculate the attention weights for context words, enforcing DNN-based models to obtain fine-grained aspect-level sentiment polarity. DNNs become very popular and play a more important role in NLP tasks than traditional machine learning methods, especially RNNs and CNNs. However, neural networks usually apply backpropagation to update the weight of hidden layer of the network. When the network is deep, the gradient vanishing problem occurs, which causes the weight of the hidden layer close to the input layer to be updated very slowly. It is a long-standing issue in the neural network. Long Short-Term Memory (LSTM) [23] is an advanced RNN network that can alleviate the gradient vanishing problem. However, like most RNNs, LSTMs can hardly be trained in parallel and they tend to be time-consuming since they are time-serial neural networks. Moreover, LSTMs are not suitable to process the interactive correlation of context and aspect, which would cause a tremendous loss of aspect information. TD-LSTM is proposed by the authors of [24], an RNN-based architecture that can obtain context features from both left and right sides. ATAE-LSTM [12] applies an attention mechanism and assembles the representations of aspect and context by concatenating them. ATAE-LSTM enables aspects to participate in computing attention weights.
Previous works regard targeted aspects such as independent and auxiliary information. However, experimental results show that these methods have limited effectiveness improvement for ABSC task. IAN, which is proposed by the authors of [13], generates the representations of context and aspect. IAN applies attention mechanism to interactively learn the features of context and targeted aspect, which enhances the interactively-learning process of aspect and context. IAN first proposed the interactive learning of context and aspect words. RAM [25] adopts multilayer architecture based on bidirectional LSTMs, and each layer contains attention-based aggregation of token features and Gated Recurrent Units (GRU [26]) to learn the sentence features. For the first time, RAM noticed varying degrees of contribution to learning from different contexts. MGAN [27] introduces a novel multigrained attention network, which uses a fine-grained attention mechanism to capture the word-level interaction between aspect and context.
A notable trend is that the pretrained model has gradually become a research hotspot of the ABSC task. The main characteristic of the pretraining model is to train a highly universal Language Model (LM) based on massive corpus resources. The pretraining model can be applied to a large number of NLP tasks and significantly improve the performance of each task. ELMo [28] and GPT [29], which are based on LSTM and transformer, respectively, are pretrained language models designed to improve the performance of many NLP tasks. In addition, BERT-PT [30] explores a novel post-training approach for question answering (QA) task on the pretrained BERT language model, which can be adapted to aspect-level sentiment classification task. BERT-SPC is the BERT text pair classification model, which is adapted to finish the ABSC task by the authors of [31]. BERT-SPC prepares the input sequence by appending aspect to contexts, regarding context and aspect as two segments.

Methodology
In this paper, an LCF design is implemented with two embedding layers. In the LCF baseline model, GloVe [21] word embedding is adopted as the embedding layer to accelerate the learning process and attain better performance. The LCF baseline model is called as LCF-GloVe. As for another architecture of LCF design, we substitute GloVe embedding layer and feature extractor layer with BERT shared layer, the LCF-BERT model. A shortcut of the LCF design overall architecture is shown in Figure 3.
Apart from the embedding layer, the LCF-GloVe is slightly different from LCF-BERT, as LCF-GloVe mainly relies on Multi-Head Self-Attention instead of BERT-shared layer to learn local context features and global context features respectively. Moreover, the input sequence for LCF-GloVe could be slightly different between local context processor and global context processor. The LCF-GloVe requires the input of the whole context sequence accompanied with aspect.
The LCF-BERT model attains a dramatically high performance compared to current state-of-the-art models. Motivated by the BERT-SPC model, the input of global context processor of LCF-BERT model adopts the context input sequence consistent with BERT-SPC [31]. In the local context sequence, aspects will be preserved, because the LCF design can learn the interactive relation between context and aspect, which strengthens the capability of LCF design to determine the deep correlation of context and aspect.

Task Definition
For aspect-based sentiment classification, the prepared input sequences for model generally consist of context sequence and aspect sequence, which enables the model to learn the correlation of context and aspect. Suppose that s = {w 0 , w 1 , . . . , w n } is an input context sequence with aspect included, the sequence contains n words including targeted aspects. s t = w t 0 , w t 1 , . . . , w t m is a targeted aspect sequence. Meanwhile, s t is a subsequence from s, which is composed of m(m ≥ 1) words.

Semantic-Relative Distance
The majority of previous works divide the input sequence into aspect sequence and context sequence and modeling for their interrelation. However, this paper proposes a new idea: apart from the global context, the local context of targeted aspects contains more significant information. Therefore, one of the most important things is how to determine whether a contextual word belongs to the local context of a specific aspect or not. In order to solve this problem, this paper proposes SRD, which aims at assisting models to capture local contexts.
LCF design counts tokens between each contextual token towards specific aspect as the SRD of all token-aspect pairs (see Figures 4 and 5). For example, if the SRD threshold is set to be 5, each context word whose SRD towards an aspect less than the threshold will be regarded as the local context. Take the same review example as mentioned above, "while the food is so good and so popular that waiting can really be a nightmare", for the aspect of food, its local contexts words are while the [asp] is so good and so", [asp] represents the aspect sequence, which may be composed of several words or tokens. In this local context sequence, [asp] means "food". LCF calculates the SRD as follows.
where i and P a are the position of the contextual word and central position of aspect, respectively. m is the sequence length of the aspect. SRD i represents for the SRD between the i-th contextual token and specific aspect. LCF design models completely preserve original features of aspects and its local context. Through the experiments and analysis, it is concluded that SRD is of great importance for LCF models.

Embedding Layer
The embedding layer is the basic layer of LCF design models. Each word and token will be mapped to a vector space through embedding layers. In LCF design, GloVe word embedding and the BERT-shared layer are alternatives for the embedding layer.

GloVe Word Embedding
LCF-GloVe adopts the pretrained GloVe word embedding to accelerate the learning process and retain better performance. Suppose L ∈ R d e ×|V| is the GloVe embedding, d e is the dimension of embedding vector, and |V| is the size of vocabulary. Then, each contextual word w i will be embedded into a vector v i ∈ R d e .

BERT-Shared Layer
The BERT-shared layer is a pretrained Seq2Seq model for language understanding, and it can be regarded as the embedding layer. In order to achieve better performance, the fine-tuning learning process is necessary and indispensable. LCF-BERT adopts two independent BERT-shared layers to model local context sequence features and global context features, respectively. For the input of local context and global context representations X l and X g , respectively, we have O l BERT and O g BERT are the output representations of local context and global context processor, respectively. BERT l and BERT g are the corresponding BERT-shared layer modeling for local context and global context, respectively.

Pre-Feature Extractor
The BERT-shared layer is powerful enough to capture context features. However, LCF-GloVe eschews the BERT-shared layer and adopts GloVe word embedding as the embedding layer. In order to improve the capability of LCF design for learning semantic features, we design the Pre-Feature Extractor (PFE). PFE is composed of MHSA layer and Position-wise Convolution Transformation [31] (PCT) layer.

Multi-Head Self-Attention
Based on a self-attention mechanism, Multi-Head Self-Attention performs multiple attention functions to compute attention scores for each contextual word. For the self-attention function, Scaled Dot Product Attention (SDA) is recommended as the identical attention function, for it is faster and more efficient in the calculation. Suppose X SDA is the input representation embedded through embedding layer. The definition of SDA is as follows, Q, K, andV are obtained by multiplying the output representation of the upper layer's hidden states by their respective weight matrix And these weight matrices are trainable during learning process, Dimensions d q , d k , d v are equal to d h ÷ h, d h is the dimension of hidden layer. The attention representations learned by each head will be concatenated and transformed by multiplying a vector W MH . In LCF design, the number of attention heads, h, is set to be 12. Suppose H i is the representation learned by each attention head, then we have where ";" denotes vector concatenation W MH ∈ R hd v ×d h . Additionally, a tanh activation function is deployed for MHSA encoder to enhance learning capability of the representations. Figures 4 and 5 are the local context focus design diagrams during the encoding process.

Position-Wise Convolution Transformation
Position-Wise Convolution Transformation (PCT) is adapted to the LCF design as a trick. According to experimental results and sufficient analysis, it can slightly improve the performance of LCF design on laptop dataset. The input representations for PCT layer is the output representation of MHSA encoder. The definition of PCT is as follows ReLU denotes the activate function of ReLU; W 1 , W 2 , and ∈ R d 2 h are the traniable weights vectors of the two convolutional kernels, respectively; and b 1 and b 2 ∈ R d h are the biases vectors of the two convolutional kernels.
Then, the output representations of PFE layer generated are as follows

Feature Extractor
The LCF design deploys a Feature Extractor (FE) layer to learn features of the local context and global context. If only taking the local context into consideration, it would inevitably ignore the features of less-semantic-relative context words. In order to fully retain the features contained in the global context and learn the correlation between global context and aspect, LCF models take the global context features as a supplement to enhance LCF design.
The local context feature extractor is much different from global context feature extractor, for it contains a local context focus layer, while the global context feature extractor is only equipped with an MHSA encoder. To simplify the picture, we select two tokens to show their context-focused process with other tokens. After calculating the output of all tokens from the attention layer, the output features on each output position above the SRD threshold will be masked or weakened, while the output features of local context words will be completely retained. The features of the output position (POS) that the dotted arrow points to will be masked or weighted down and the features of the output position that the solid arrow points to will be completely preserved. The example of context word in this picture is "right". The input sequences of the LCF design are mainly based on the global context. LCF design models focus on local context by adopting local context focus layer. This paper implements two architectures to focus on local contexts CDM and CDW (see Figures 4 and 5). We apply MHSA encoders instead of CNN or RNN architecture due to the following considerations. On the one hand, MHSA is more powerful to capture context features. On the other hand, self-attention calculates correlation attention scores for every contextual word, and, according to the self-attention definition, the word itself holds the highest score on its corresponding output position generally. Table 1 is the algorithm flow of CDM and CDW mechanism. LCF design layers preserve the features of semantic-relative contextual words while masking or weighing the features of less-semantic-relative contextual words. So the less-semantic-relative context can participate in the encoding process and their passive influence are alleviated.

Dynamic Mask for Context Features
CDM layer masks the less-semantic-relative context features learned by PFE or BERT-shared layers. Although it is easy to mask the less-semantic-relative contextual words in the input sequence, it also will absolutely discard the features of less-semantic-relative contextual word. With the CDM layer deployed, only the features of the less-semantic-relative context itself on the corresponding output position will be masked. The correlative representations between less-semantic-relative context word and aspect are reserved on corresponding output positions.
All masked features will be set to zero vectors. Another MHSA encoder is deployed to learn the masked context features. In this term, LCF design can alleviate the influence of less-semantic-relative contexts, but reserve the correlation between each contextual word and aspect. Suppose O l PFE is the output representation of local context feature extractor, CDM focuses on the local context by constructing the mask vectors V m i for each less-semantic-relative context word, so we get the mask matrices M.
where α is the SRD threshold. M is the mask matrices for the representation of input sequences and n is the length of input sequence including aspect. E ∈ R d h is the ones vector and O ∈ R d h is the zeros vectors. O l CDM is the output of CDM layer. "." denotes the dot product operation of the vectors.

Dynamic Weighted for Context Features
In addition to CDM layer, another architecture is implemented to focus on local context words, the Context features Dynamic Weighted (CDW) layer. While features of a semantic-relative contextual word will be absolutely preserved, less-semantic-relative context features will be weighted decay. In this design, features of the contextual word that is far from the targeted aspect will be reduced according to their SRD. CDW weights the features by constructing the weighted vector V w i for each less-semantic-relative context word, here is the formula to get mask matrices M for an input sequence: where SRD i is the SRD between the i-th contextual token and a specific aspect. n is the length of the input sequence. α is the SRD threshold. O l CDW is the output of CDW layer. "." denotes the vector dot product operation.
The output representation of local context FE can be attained based on the output of CDW or CDM.
For CDM layers: For CDW layers: Both the output representations of CDM and CDW layer are denoted as O l , and they are alternative and independent.

Global Context Features Extractor
In global context FE, the output of the features learned by MHSA encoder is as follows where O g PFE is the representation learned by global context PFE layer.

Feature Interactive Learning Layer
The Feature Interactive Leaning (FIL) layer is deployed to interactively learn the features of the global context. FIL first concatenates the representations of O l andO g , then projects them into O lg pool and applies an MHSA encoding operation.
W lg ∈ R d h ×2d h and b lg ∈ R d h are the weights and bias vector of the dense layer, respectively; an MHSA encoder will encode the O lg dense , then output the interactively learned features O lg FIL .

Output Layer
In the output layer, the representation learned by feature interactive learning layer is pooled by extracting the hidden states on the corresponding position of the first token. Finally, a Softmax layer is applied to predict the sentiment polarity.
where C is the number of classes and Y is the sentiment polarity predicted by LCF design model.

Model Training
The LCF design includes LCF-GloVe and LCF-BERT; most of the architectures are identical except the embedding layer and PFE layer. For the LCF-GloVe model, the input sequence for local context processor and global context processor is the whole review, e.g., "while the food is so good and so popular that waiting can really be a nightmare". For the LCF-BERT model, the input sequence for the local context processor will be refactored to "[CLS]" + "while the food is so good, it is so popular that waiting can really be a nightmare." + "[SEP]", and the input sequence for global context processor are the same as the input sequence of BERT-SPC, e.g., "[CLS]" + "while the food is so good and so popular that waiting can really be a nightmare." + "[SEP]" + [asp] + "[SEP]". Similar to LCF-GloVe, the MHSA encoders of LCF-BERT are independent. Moreover, the BERT-shared layers for local context processor and global context processor are independent.
LCF design applies the cross-entropy loss function for LCF design with L 2 regularization, and we define the loss function as follows.
where C is the number of classes, λ is the L 2 regularization parameter, and Θ is the parameter set of the LCF model.

Datasets and Hyperparameters
Experiments are conducted on three ABSC benchmark datasets, a review dataset of laptop, a review dataset of restaurants from SemEval-2014 [1] and a common ACL 14 twitter social dataset introduced by [32]. According to the experimental results, the performance of the LCF design on the datasets of three topics (restaurant reviews, laptop reviews, and tweets) has been significantly improved, which indicates that the LCF design does not rely on specific datasets or corpora and it is applicable to most topics of datasets, which is due to the original characteristics of natural language (Suppose people comment on multiple aspects in a review, then, the context word used to judge an aspect is likely to be near the aspect generally, and not far from the aspect word to be commented on.). These datasets are adopted by most of the proposed models and are the most popular datasets of ABSC task, and a large number of experiments have been carried out on these datasets for comparison. These datasets provide labeled aspects such as the sentiment polarities of the aspects. All aspects of the above datasets are labeled in three categories of sentiment polarities: positive, neutral, and negative. In the experiments, these datasets are original and under no refactor, and no conflicting label is removed. In that case, it can provide a better estimate of the real performance of the model. Table 2 demonstrates the details of three datasets (Generally, both LCF-GloVe and LCF-BERT converge within three epochs during the training process.). To evaluate the precise difference of performance, the hyperparameters are kept consistent for LCF-GloVe model and LCF-BERT model except for learning rate, because BERT-shared layer requires very small learning rate during fine-tuning process [15]. For LCF-GloVe, the learning rate is set to 1 × 10 −3 . The hidden dimension and embedding dimension, d h , and d e , are set to 300. For LCF-BERT, the learning rate is set to 2 × 10 −5 . The hidden dimension and embedding dimension are set to 768 in LCF-BERT. For both LCF design models, dropout rates are set to 0, L 2 regularization is set to 1 × 10 −5 , and the batch size is set to 16. In addition, LCF models utilize the Adam optimizer [33].
All hyperparameters in this paper have been supported by a large number of comparative experiments. Among them, most of the hyperparameters follow the common hyperparameter settings of this task, such as the word embedding dimension and learning rate of LCF-GloVe model. We tried different thresholds to find the optimal threshold for each dataset (see Table 5). The fine-tuning process of BERT is very sensitive to the learning rate. Only a small learning rate can maximize Bert's performance, which has been illustrated in the original paper of BERT. Through the experiments, we observed that when batch size was large, the instability of regularization between layers would reduce the performance of the model, so the optimal batch size of 16 was adopted. Large dropout will prolong the convergence rate of the LCF-BERT model, and experimental results show that they have no obvious influence on LCF design. Therefore, after comprehensive consideration, dropout of the model is designed to be 0. For both LCF design models, accuracy and Macro-F1 score are adopted to evaluate the performance of LCF designs. Because experimental results tend to fluctuate, this paper chooses the best experimental results to make comparisons.

Comparison Models
We evaluate the performance of the LCF design on three datasets and compared with multiple baseline models. The results reveal that LCF design can greatly improve the state-of-the-art performance on three data sets, especially the LCF-BERT model. The LCF design models are compared with the following models.
TD-LSTM [24] TD-LSTM divides the input sequences into left context and right context towards a specific aspect. Moreover, two LSTM networks are adopted to modeling the left context sequence and right context sequence with targeted aspect respectively. Both left and right target-dependent representation are processed by corresponding LSTM networks and are concatenated as a unity to predict sentiment polarity of targeted aspect. ATAE-LSTM [12] ATAE-LSTM implements an attention mechanism to assist mode to focus on more relative context to targeted aspects. Meanwhile, ATAE-LSTM appends aspect embeddings with each word embedding, which strengthens model by learning the hidden relation between context and aspect.
IAN [13] IAN generates representations of targeted aspects and context by two LSTM networks, respectively. IAN learns the representations of targeted aspect and context interactively. Interactive attention mechanism brings a considerable performance improvement.
RAM [25] RAM improves the MemNet [34] by representing memory with BiLSTM neural networks. Meanwhile, a gated recurrent unit network was introduced to learn the features processed multiple attention mechanism.
BERT-PT [30] BERT-PT explores a novel post-training approach on the BERT pretrained model to improve the performance of fine-tuning of BERT for RRC task. Additionally, the BERT-PT method can be adapted to ABSC task.
BERT-SPC [31] BERT-SPC is the Sentence Pair Classification task of pretrained BERT model. BERT-SPC for ABSC task constructs the input sequence as "[CLS]" + global context + "[SEP]" + [asp] + "[SEP]". Table 3 demonstrates the main experimental results. According to experimental results, LCF design models perform well among three benchmark datasets, especially on the laptop dataset and restaurant dataset. The LCF-BERT model attains impressive improvement for outperforming state-of-the-art performance. Compared to the BERT-PT model, it improves the performance by 3-4% in the laptop dataset and 2-3% in the restaurant dataset. However, compared to MGAN, the LCF-GloVe model achieves a limited performance in the twitter dataset. After further analysis, it is found that the twitter dataset is a social dataset, and there are lots of misspelled words and unknown tokens. Moreover, it is found that plenty of tweets are composed of informal linguistic expressions as well as grammatically incorrect expressions, causing difficulty when extracting high-quality semantic representations. Generally, the CDM layer works well in the LCF-GloVe model. Probably the local context plays a more important role in feature extraction since MHSA has a limited power to extract global context features. In addition, the CDW layer achieves remarkable performance in the LCF-BERT model, because the BERT-shared layer is more powerful when extracting and learning context features, including the local context and global context. Table 3. Experimental results of performance. "Glo-CDM" and "Glo-CDW" indicate that both global context and local context participate in the learning process. We set superior SRD thresholds for LCF designs (see Section 4.5.5). We use " -" to represent unreported experimental results. The top two scores are in bold.

Models
Laptop Restaurant Twitter

Accuracy (%) F1 (%) Accuracy (%) F1 (%) Accuracy (%) F1 (%)
Baselines The experiments show that the local context focus mechanism is applicable to the self-attention and can achieve excellent results. However, while applied into many kinds of DNNs, the performance of local context focus layer in RNNs (LSTM and GRU) is not optimal. At the same time, local context focus mechanism provides a new approach to fine-grained aspect-level sentiment classification, which significantly avoids the influence of distant context in the mechanism of self-attention, and achieves a state-of-art effect on the three commonly used ABSC datasets. Once applied to different ABSC models, the local context focus mechanism will bring significant performance improvement to models.

Analysis of LCF design Models
Both LCF-GloVe and LCF-BERT are equipped with CDM and CDW local context focus layers. The experimental results indicate that LCF designs are very effective. In addition, CDM and CDW are feature-level operations (see Figures 4 and 5), instead of directly operating the input tokens. Owing to the MHSA encoder, CDM and CDW can involve less-semantic-relative tokens in the learning process, then retain their semantic features towards semantic-relative tokens. However, CDM and CDW mitigate the features of less-semantic-relative tokens on corresponding output positions, which enable LCF design models to focus on the local context. Another significant thing is that LCF design also learns the features of the global context. Owing to the interactive learning of local context and global context that LCF design achieves such remarkable results (see Table 4). Table 4. Results of LCF-GloVe variations of SRD threshold. "Only" means only global context or local context participates in the learning process. "w/o" means "without". The best scores of LCF-GloVe and LCF-BERT are at the top of the table. SRD thresholds for each variation are equal to basic CDM/CDW design (see Table 5).

Ablations and Variations of LCF Design
In order to analyze the importance of each LCF design layer, ablation and variation experiments are designed for LCF-GloVe as well as LCF-BERT. Experimental results are listed in Table 4.

Ablate Pre-Feature Extractor Layer
For LCF-GloVe, the pre-feature extractor is deployed between the embedding layer and feature extractor in LCF-Glove model, since it can enhance the performance of GloVe embedding layer. The pre-feature extractor consists of MHSA and PCT layer. We ablate the pre-feature extractor layer to examine the performance of LCF without it.
LCF-GloVe (CDM/CDW) without PFE layers performs better than baseline models on twitter dataset, while its performance on laptop dataset descends obviously. On the restaurant dataset, LCF-GloVe (CDM/CDW) almost achieve an equal performance compared to baseline models. PFE layers are deployed to learn the features embedded by GloVe embedding layer. The MHSA encoder is pretty different from the traditional encoder as it contains positional embedding. In order to adapt the GloVe embedding representation to MHSA encoder, PFE layers are designed to preprocess the embedding representations. Table 4 indicates that PFE layers are very significant for LCF-GloVe (CDM/CDW) on laptop dataset.

Ablate CDM/CDW Layer
The CDM layer and CDW layer are core architecture for the LCF design. We ablate the CDM and CDW layer and utilize the global context features to predict sentiment polarities for targeted aspects.
For the LCF-GloVe ablation experiment, LCF-GloVe with "only Glo" means that only global context features are captured and learned, and no CDM or CDW layer is deployed in this ablation. This ablation achieves inferior performance among three datasets, which indicates that CDM layer and CDW layer are very effective for LCF design models. LCF-GloVe with "only CDM" performs moderate performance among three datasets, and its performance on the restaurant dataset is equal to the baseline model. Moreover, LCF-GloVe with "only CDW" attains worse performance on three datasets compared to LCF-GloVe with "only Glo".
For the LCF-BERT ablation experiment, LCF-BERT with "only Glo" attains a considerable performance compared to LCF-BERT with "only CDM" and LCF-BERT with "only CDW". However, there is still a gap of performance on three datasets between both three ablations and their baseline model.

Ablate Feature Interactive Leaning Layer
Feature Interactive Learning (FIL) layer aims at assembling features and interactively learns the correlation between local context and global context. Concatenation and pooling can substitute the FIL layer, but there is no interactive learning process.
For LCF-GloVe-CDM without FIL layer, performance on restaurant and twitter datasets are close to the baseline models, and performance on laptop dataset is slightly poor. LCF-GloVe-CDW without FIL layer performs better on restaurant dataset, but achieves inferior performance on other two datasets.
For LCF-BERT (CDM/CDW) without FIL, they attain inferior performance on three datasets, which means FIL layer is of great significance for LCF design.

LCF-Ablations Analysis
According to Table 4, the performance of LCF ablations was significantly reduced. Compared to baseline models, the LCF-GloVe ablations of PFE layer achieve limited performance on three datasets, especially on laptop dataset. Both LCF-GloVe and LCF-BERT attain inferior performance when only the global context or local context is taken into consideration since LCF will lose significant features. We design a FIL layer to interactively learn the features of the global context and local context. Without the FIL layer, performance on three datasets drops obviously. The experiments reveal that, for both LCF-GloVe and LCF-BERT, all the components work fine and bring a huge improvement among three datasets. For LCF-BERT, if only the local context is taken as input, the performance decreases by 2-3%. Experimental results show that each component of LCF design is indispensable and effective.

LCF-GloVe Variations about SRD Threshold
For LCF-BERT, it is hard to evaluate sufficient SRD variation experiments since the BERT-shared layer is not space-efficient (the number of parameters of LCF-BERT is approximately 2.2 × e 10 ), while the number of parameters of LCF-GloVe is approximately 2.0 × e 6 . In that case, the α is set to be 3 for CDW and CDM design on three datasets.
In order to find superior α for different LCF design and datasets, a series of experiments about SRD threshold on LCF-GloVe are conducted to evaluate the best α for different situations. In the comparison experiments, the SRD of the corresponding model and dataset ranges from 0 to 9, and the local context will be equal to the aspect itself if the SRD threshold (α) is 0. In these comparison experiments, all the parameters and hyperparameters are consistent with LCF-GloVe or LCF-BERT except SRD thresholds. Due to the fluctuation of experimental results, many experiments have been carried out to find the best results for comparison.
Actually, the SRD threshold is not very sensitive for LCF-GloVe-CDW design, and the performances on three datasets fluctuates slightly (Our extra experiments prove that CDW and CDM designs perform well and stably on LCF-BERT.) (see Figures 6-8). The LCF-GloVe CDM design is stable among the three datasets, and the superior α for all situations can easily be found out (see Table 5).

Conclusion and Future Works
This paper proposes a new view: local context words of specific aspects are more relevant to the aspect. LCF designs focus on the local context and learn global context representations in parallel. This paper introduces SRD to assist locate the local context of each aspect. With the local context of targeted aspects supervised, LCF models work more stably and precisely in aspect-level sentiment classification. Both GloVe word embeddings and BERT-shared layer are utilized to improve the performance of LCF designs. With CDM and CDW applied, LCF-BERT outperforms state-of-the-art performance in three ABSC datasets. In future, the SRD calculation can be further improved by considering extra auxiliary information. Besides, the transferability of CDM and CDW designs will be evaluated whether they can improve the performance of the models those based on self-attention.

Case Analysis
In this section, we pick two samples from the laptop dataset and restaurant dataset for case analysis. They are "I how the black roasted codfish, it was the best dish of the evening" and "Lots of extra space but the keyboard is ridiculously small", and we label them as sample-1 and sample-2, respectively. Sample-1 only contains one aspect and sample-2 contains two aspects. The α is set to 3 for both samples.

One-Aspect
The polarity of aspect "dish" is predicted by LCF-ablated models and Table 6 shows the predicted results. Figures 9 and 10 are visualizations of CDM and CDW processes of aspect "dish" within sample-1. Table 6. Predictions of three aspects.

LCF-GloVe
Only   Results in Table 6 show that local context focus mechanism work performs well on one-aspect samples. Almost no LCF-ablated models make error prediction.

Multi-Aspect
Accordingly, Figures 11 and 12 are visualizations of CDM and CDW processes of sample-1, respectively. The polarity of the aspect "space" is predicted by LCF ablated models and Table 6 shows the predicted polarities. The sentiment classification of multi-aspect samples are more complex compared to one-aspect samples, since each aspect probably has different sentiment polarity. Moreover, it is important to alleviate the negative influence of less-semantic-relative context words during learning process.  Consistently, Figures 11-14 are visualizations of CDM and CDW processes of aspect "space" and "keyboard", respectively, within sample-2. The polarities of these two aspects are predicted by LCF ablated models and Table 6 shows the predicted polarities.  Most of the LCF-ablated models generally give the correct predictions for aspect "space" and "keyboard". Table 6 shows that the predictions of LCF-BERT are more accurate and reasonable than that of LCF-GloVe.
Equipped with the local context focus mechanism, LCF design models can better predict the sentiment polarities of aspects through the interactive learning of local context features and global context features, instead of making predictions merely relying on local context features or global context features.
While inhibiting the interference brought by distant context words to prediction, it can also retain the long-term dependencies between the aspect these words and the of local context words. Experiments show that this can significantly improve the prediction accuracy of the model, and refresh the best performance on the three commonly ABSC datasets.