Deep Learning Application to Ensemble Learning—The Simple, but E ﬀ ective, Approach to Sentiment Classifying

: Sentiment analysis is an active research area in natural language processing. The task aims at identifying, extracting, and classifying sentiments from user texts in post blogs, product reviews, or social networks. In this paper, the ensemble learning model of sentiment classiﬁcation is presented, also known as CEM (classiﬁer ensemble model). The model contains various data feature types, including language features, sentiment shifting, and statistical techniques. A deep learning model is adopted with word embedding representation to address explicit, implicit, and abstract sentiment factors in textual data. The experiments conducted based on di ﬀ erent real datasets found that our sentiment classiﬁcation system is better than traditional machine learning techniques, such as Support Vector Machines and other ensemble learning systems, as well as the deep learning model, Long Short-Term Memory network, which has shown state-of-the-art results for sentiment analysis in almost corpuses. Our model’s distinguishing point consists in its e ﬀ ective application to di ﬀ erent languages and di ﬀ erent domains.


Introduction
Sentiment classification or opinion mining is the narrow field of natural language processing, information querying, and text mining that is used to extract a person's impressions of or thoughts about something from nonstructural text data.This research domain has drawn the interest of not only scientists but also businesses and organizations worldwide.The ability to classify sentiment has a practically tremendous impact practically because it helps businesses save the expenditure of human resources to determine customers' needs while helping customers choose more suitable products and services according to their necessities.Li and Liu's [1] survey reported that more than 80% of internauts search at least once for reviews about a product they intend to buy before making their decision.
The problem of sentiment classification was raised by Dave et al. [2] and Nasukawa and Yi [3] in the early 2000s.Since then, many research studies have been conducted to classify and evaluate reviews about products and services in media and blog posts, which can be classified into three levels of interest: (i) document level; (ii) sentence level; and (iii) aspect level.On the first level, Tang et al.'s model [4], based on a deep learning approach, and Xia et al.'s ensemble learning model [5] should be mentioned.With reference to the sentence level, Marcheggiani et al. [6] and Yang and Cardie [7] proposed models based on conditional random fields (CRFs).However, the aspect level has received the most consideration within various publications.The focus of this paper is on Chinsha and Joseph's [8] and Tran et al.'s [9] work, which proposed a syntactic-based approach using dependency grammar.
Techniques used for classifying sentiments can be put into three main groups: machine learning comprising Pang et al. [10], Riaz et al. [11], and Wang et al. [12]; relying on vocabulary with Turney et al. [13], Muhammad et al. [14], and Khan et al. [15]; and a hybrid of machine learning and vocabulary with Balahur et al. [16] and Keshavarz and Abadeh [17].Machine learning techniques consist of supervised machine learning like Severyn et al. [18], semi-supervised machine learning like Hajmohammadi et al. [19], and unsupervised machine learning like Claypo and Jaiyen [20].Techniques that rely on vocabulary comprise three groups referring to some noticeable works as follows: a vocabulary approach with Saif et al. [21], a dataset approach with Vulic et al. [22], and integration of the above ones with Taboada et al. [23].
In the last two decades, machine learning methods have dominated majority of sentiment analysis tasks.Since feature representation greatly influences the performance of a machine learning algorithm [24], many researches focus on getting effective features in-hand with domain expertise and ad hoc techniques.However, this work can be completed using representation learning algorithms, such as the deep learning approach, which automatically distinguishes and explains text representations from data.Deep learning has emerged due to its ability to represent data at various classified levels.Wu et al. [25] and Zhao et al. [26] are two remarkable evidences of this approach.
In regard to valence shifters in sentiment analysis, users often produce reviews about subjects based on the sentiment levels.Thus, the sentiment value of a phrase can be affected by the corresponding context, which is called polarity shifting (or valence shifters) [27].Polarity shifting bears complex language structures which consist of negative, contrasting, intensified, and diminished structures [23].Polarity shifting can make traditional approaches, such as machine learning with the Bag-of-Words (BoW) model, ineffective because these approaches are interested in whether single words bear positive or negative polarity, based on a predetermined sentiment dataset.By contrast, the common technique for sentiment classification focuses on polarity shifting, which requires an analysis of a phrase's structure and semantics [28][29][30][31].
Xia et al. [5] used four classifiers (two baseline machine learning classifiers and two statistical classifiers) with four sub-datasets, which were comprised of various valence shifting structures.Oscar et al. [32] proposed an ensemble of sentiment classifiers, where several baseline classifiers trained with different types of features were combined.The authors adopted deep learning to produce features for the classifiers automatically.
This paper adopts ensemble learning to classify sentiment at the document level inspired by [5,32].We extract various different features from datasets for base learners by identifying the various structures that cause polarity shifting in the text; we call these 'surface features'.We also use word embedding and deep learning to extract other features, which are called 'deep features'.The proposed system was built and experiments were conducted using datasets to check the system's performance in Vietnamese and English.A comparison with other machine learning approaches showed that the results of the proposed system were better than even state-of-the-art deep learning models and other ensemble learning systems.The experimental results also show that taking 'deep features' into consideration for base learners improves the effectiveness of the system.
The following are the contributions of this research:

•
We propose an effective ensemble learning system using datasets of base learners, which comprise features that result from exploring language characteristics and applying a deep learning model.

•
We adopt word embedding and develop a deep learning model for base learners that helps improve the system's effectiveness.

•
The proposed model proved appropriate for the Vietnamese language, and also yielded adequate results for the various English datasets.
Appl.Sci.2019, 9, 2760 3 of 18 The remainder of the paper is organized as follows: Section 2 presents related existing work, Section 3 presents the proposal of our model, Section 4 describes the experiments and valuations, and Section 5 presents the conclusion and introduces directions for future research.

Related Work
Polarity shifting occurs when a phrase's sentiment value changes according to a specific context [27].The first machine learning methods failed to take account of influences caused by negation structures and other polarity shifting structures.For instance, for early machine learning, the two sentences 'The hotel is very nice but the price is high.' and 'The hotel is very nice, the price is high.' are likely to be classified into the same stage because they contain the same words indicating the sentiment 'nice' and 'high'.To overcome this issue, the works [33] have recently used sequence mining to extract polarity shifting models that inverse, decrease, and eliminate polarity.Using a hybrid of different techniques, SO-CAL (Semantic Orientation CALculator) [23] was one of the first systems to process polarity shifting using rule models and sentiment vocabulary labeled as sentiment datasets, and [34,35] used dependency grammar to define syntactic rules that identified each negative structure's influence and other polarity shifting structures.
Ensemble learning is a strong machine learning model that is optimal in classification problems involving many learners; the ability of an ensemble learning model to generalize is much better than that of a single learning model [36].Ensemble learning is applicable in various domains, including bioinformatics [37], finance [38], and healthcare [39].The latest research indicates that ensemble learning models could be applied to the sentiment classification problem.Table 1 shows the relevant works conducted over the last ten years that have applied ensemble learning to sentiment classification.Apropos the deep learning use for sentiment classification, this approach has recently been recognized as a strong machine learning model and has produced advanced results in various domains to which it has been applied-from computer vision and speech processing to natural language processing [44].The application of deep learning to sentiment classification has also become more popular.Some recent research studies on sentiment classification that engage in maneuvering deep learning are depicted in Table 2.

Architecture
The system's input is training datasets comprising of labeled texts classified into positive and negative stages.These texts are made to pass through a preprocessing component in order to be standardized (correcting spelling errors, abbreviations, discarding stop words) and analyzed into appropriate sentences or clauses.After the preprocessing stage, every text's sentence set will be classified by feature extracting components, into polarity shifters, which are negation, contrast, inconsistency, and no_shift sentences.These sub-datasets will be separately processed to be trained with base learners.At a certain time, another base learner will be applied to all the datasets.These datasets will also be used for training with a deep learning model and word embedding representation Word2Vec [49].Finally, the base learners' results will be integrated through ensemble learning.Figure 1 describes the whole process.

Architecture
The system's input is training datasets comprising of labeled texts classified into positive and negative stages.These texts are made to pass through a preprocessing component in order to be standardized (correcting spelling errors, abbreviations, discarding stop words) and analyzed into appropriate sentences or clauses.After the preprocessing stage, every text's sentence set will be classified by feature extracting components, into polarity shifters, which are negation, contrast, inconsistency, and no_shift sentences.These sub-datasets will be separately processed to be trained with base learners.At a certain time, another base learner will be applied to all the datasets.These datasets will also be used for training with a deep learning model and word embedding representation Word2Vec [49].Finally, the base learners' results will be integrated through ensemble learning.Figure 1 describes the whole process.

Building Training Datasets for Base Learners
A learning feature representation used to build datasets for base learners is the key task in applying ensemble learning.In this paper, extraction of the following feature types was considered:

•
Features of 'surface feature' type: concerning polarity shifting, similar to Rui Xia et al.'s approach [5] (having proposed a feature extracting technique based on rules and statistic method focusing on discovering polarity shifting cases), we define extracting rules according to language characteristics.Text's sentences and clauses will be determined for polarity shifting by negation, contrast, and inconsistency identifying techniques.The results will be introduced into a corresponding training dataset.

•
Features of 'deep feature' type: data which has a complex structure.Therefore, we need to develop automatic extracting methods effectively in order for the system to estimate as well as possible.In traditional machine learning, the feature extracting task is designed and

Building Training Datasets for Base Learners
A learning feature representation used to build datasets for base learners is the key task in applying ensemble learning.In this paper, extraction of the following feature types was considered:

•
Features of 'surface feature' type: concerning polarity shifting, similar to Rui Xia et al.'s approach [5] (having proposed a feature extracting technique based on rules and statistic method focusing on discovering polarity shifting cases), we define extracting rules according to language characteristics.Text's sentences and clauses will be determined for polarity shifting by negation, contrast, and inconsistency identifying techniques.The results will be introduced into a corresponding training dataset.

•
Features of 'deep feature' type: data which has a complex structure.Therefore, we need to develop automatic extracting methods effectively in order for the system to estimate as well as possible.
In traditional machine learning, the feature extracting task is designed and standardized by man, which is a weakness of the previous machine learning methods.Deep learning takes advantage of an architecture of processing multilayers for data component learning representation, as each layer represents different abstract levels of data.We choose deep learning model to extract 'deep features' type.

Extracting 'Surface Feature'
Weighted log-likelihood ratio statistics for sentiment words classification: There are numerous different methods of classifying sentiment words.We apply the weighted log-likelihood ratio statistic method (WLLR) proposed in Rui Xia et al. [5], the WLLR measurement shows a word t i 's correspondence to class c j through Formula (1): where: p t i , c j : word t i s probability of class c j and p t i , c j : word t i s probability of another class different from c j .

•
If r(t i ) > 0, this word is introduced into the positive sentiment word set, labeled with the measurement r(t i ), and ranked according to its measurement.

•
Otherwise, this word is introduced into the negative sentiment word set, labeled with the measurement |r(t i )|, and ranked according to its measurement.
Depending on the word ranking order, we build pairs of sentiment polarity.These pairs will be applied in negation elimination process.
WLLR statistics will be also used to identify sentiment contrast sentence, as Formula (2) indicates: if h(s i ) < 0: inconsistency sentence, otherwise: no_shift sentence.with: Features forming negation dataset: Negation structure is the most popular structure in polarity shifting.Table 3 demonstrates this structure's constant occurrences in the dataset, such as the word "không not " taking place 9778 times in 3,829,253 words in total of the dataset of Vietnamese reviews collected about hotels.Identifying negation structure is realized by checking the occurrences of words in sentences, such as "không not ", "chẳng no ", "chả don't ".These identified sentences will be put into the D negation set, comprising of negation sentences.After identifying the negation word's positions in the D negation set's sentences, this negation word will be removed.The first sentiment word having followed the negation word removed will be replaced with another word bearing contrast sentiment according to Formula (2).Sentiment words which gradually follow will be replaced if they manifest the same sentiment as the first one.
Example: "I do not like this hotel!" will be replaced with "I dislike this hotel!"Table 3. Statistics realized depending on some negation words' occurrences in Vietnamese language corpus of hotel reviews.

Shifters Occurrences in the Corpus không not
Features forming negation dataset: Negation structure is the most popular structure in polarity shifting.Table 3 demonstrates this structure's constant occurrences in the dataset, such as the word "không not" taking place 9778 times in 3,829,253 words in total of the dataset of Vietnamese reviews collected about hotels.Identifying negation structure is realized by checking the occurrences of words in sentences, such as "không not", "chẳng no", "chả don't".These identified sentences will be put into the Dnegation set, comprising of negation sentences.After identifying the negation word's positions in the Dnegation set's sentences, this negation word will be removed.The first sentiment word having followed the negation word removed will be replaced with another word bearing contrast sentiment according to Formula (2).Sentiment words which gradually follow will be replaced if they manifest the same sentiment as the first one.
Example: "I do not like this hotel!" will be replaced with "I dislike this hotel!"Table 3. Statistics realized depending on some negation words' occurrences in Vietnamese language corpus of hotel reviews.

Shifters
Occurrences in the Corpus không not chẳng no chả don't chẳng no shifting.Table 3 demonstrates this structure's constant occurrences in the dataset, such as the word "không not" taking place 9778 times in 3,829,253 words in total of the dataset of Vietnamese reviews collected about hotels.Identifying negation structure is realized by checking the occurrences of words in sentences, such as "không not", "chẳng no", "chả don't".These identified sentences will be put into the Dnegation set, comprising of negation sentences.After identifying the negation word's positions in the Dnegation set's sentences, this negation word will be removed.The first sentiment word having followed the negation word removed will be replaced with another word bearing contrast sentiment according to Formula (2).Sentiment words which gradually follow will be replaced if they manifest the same sentiment as the first one.
Example: "I do not like this hotel!" will be replaced with "I dislike this hotel!"Features forming contrast dataset: Contrast structure is also a popular one in polarity shifting.Table 4 demonstrates the quite important frequency of the word "nhưng but", which occupies 3728 places in 3,829,253 words in total of the dataset of reviews collected about hotels.These words are divided into two groups: the first one is called fore-contrast includes "nhưng but" and "tuy however", and the second one is called post-contrast "mặc_dù although" and "dù though".If a fore-contrast occurs in a sentence, the polarity shifting takes place in the phrase preceding the fore-contrast, and, in the case of post-contrast, sentences which contain post-contrast are shifted themselves.The contrast sentences will be put in the set Dcontrast.
Example: "Khách sạn rất đẹp, vị trí thuận lợi tuy nhiên giá hơi đắt."(The hotel is very nice, its location is good but the price is quite expensive).The polarity shifting occurs in the phrase "Khách sạn rất đẹp, vị trí thuận lợi" (The hotel is very nice, its location is good).Features forming inconsistency dataset: Sentiment inconsistency sentences are the ones which do not demonstrate grammatical polarity shifting but implicate the contrast to sentiment shown in the whole text.This inconsistency is caused by human language, such as implicit, ironical, and satirical sentences.Inconsistency sentences can be identified with the WLLR through evaluating every word in the text.Then, a sentence will be evaluated on polarity shifting with Formula (2).
Relying on the evaluated value, one of the two following decisions will be made: • If h(si) < 0, the sentence will be put into the set Dinconsistency containing inconsistency data; • If h(si) ≥ 0, the sentence will be put into the set Dno_shift containing unshifted data.
Features in full dataset: Besides, we also use the entire dataset, which is already preprocessed, Features forming contrast dataset: Contrast structure is also a popular one in polarity shifting.Table 4 demonstrates the quite important frequency of the word "nhưng but ", which occupies 3728 places in 3,829,253 words in total of the dataset of reviews collected about hotels.These words are divided into two groups: the first one is called fore-contrast includes "nhưng but " and "tuy however ", and the second one is called post-contrast "mặc_dù although " and "dù though ".If a fore-contrast occurs in a sentence, the polarity shifting takes place in the phrase preceding the fore-contrast, and, in the case of post-contrast, sentences which contain post-contrast are shifted themselves.The contrast sentences will be put in the set D contrast .
Example: "Khách sạn rất đẹp, vị trí thuận lợi tuy nhiên giá hơi đắt."(The hotel is very nice, its location is good but the price is quite expensive).The polarity shifting occurs in the phrase "Khách sạn rất đẹp, vị trí thuận lợi" (The hotel is very nice, its location is good).Features forming contrast dataset: Contrast structure is also a popular one in polarity shifting.Table 4 demonstrates the quite important frequency of the word "nhưng but", which occupies 3728 places in 3,829,253 words in total of the dataset of reviews collected about hotels.These words are divided into two groups: the first one is called fore-contrast includes "nhưng but" and "tuy however", and the second one is called post-contrast "mặc_dù although" and "dù though".If a fore-contrast occurs in a sentence, the polarity shifting takes place in the phrase preceding the fore-contrast, and, in the case of post-contrast, sentences which contain post-contrast are shifted themselves.The contrast sentences will be put in the set Dcontrast.
Example: "Khách sạn rất đẹp, vị trí thuận lợi tuy nhiên giá hơi đắt."(The hotel is very nice, its location is good but the price is quite expensive).The polarity shifting occurs in the phrase "Khách sạn rất đẹp, vị trí thuận lợi" (The hotel is very nice, its location is good).Features forming inconsistency dataset: Sentiment inconsistency sentences are the ones which do not demonstrate grammatical polarity shifting but implicate the contrast to sentiment shown in the whole text.This inconsistency is caused by human language, such as implicit, ironical, and satirical sentences.Inconsistency sentences can be identified with the WLLR through evaluating every word in the text.Then, a sentence will be evaluated on polarity shifting with Formula (2).
Relying on the evaluated value, one of the two following decisions will be made: • If h(si) < 0, the sentence will be put into the set Dinconsistency containing inconsistency data; • If h(si) ≥ 0, the sentence will be put into the set Dno_shift containing unshifted data.
Features in full dataset: Besides, we also use the entire dataset, which is already preprocessed, Features forming contrast dataset: Contrast structure is also a popular one in polarity shifting.Table 4 demonstrates the quite important frequency of the word "nhưng but", which occupies 3728 places in 3,829,253 words in total of the dataset of reviews collected about hotels.These words are divided into two groups: the first one is called fore-contrast includes "nhưng but" and "tuy however", and the second one is called post-contrast "mặc_dù although" and "dù though".If a fore-contrast occurs in a sentence, the polarity shifting takes place in the phrase preceding the fore-contrast, and, in the case of post-contrast, sentences which contain post-contrast are shifted themselves.The contrast sentences will be put in the set Dcontrast.
Example: "Khách sạn rất đẹp, vị trí thuận lợi tuy nhiên giá hơi đắt."(The hotel is very nice, its location is good but the price is quite expensive).The polarity shifting occurs in the phrase "Khách sạn rất đẹp, vị trí thuận lợi" (The hotel is very nice, its location is good).Features forming inconsistency dataset: Sentiment inconsistency sentences are the ones which do not demonstrate grammatical polarity shifting but implicate the contrast to sentiment shown in the whole text.This inconsistency is caused by human language, such as implicit, ironical, and satirical sentences.Inconsistency sentences can be identified with the WLLR through evaluating every word in the text.Then, a sentence will be evaluated on polarity shifting with Formula (2).
Relying on the evaluated value, one of the two following decisions will be made: • If h(si) < 0, the sentence will be put into the set Dinconsistency containing inconsistency data; • If h(si) ≥ 0, the sentence will be put into the set Dno_shift containing unshifted data.
Features in full dataset: Besides, we also use the entire dataset, which is already preprocessed, Features forming contrast dataset: Contrast structure is also a popular one in polarity shifting.Table 4 demonstrates the quite important frequency of the word "nhưng but", which occupies 3728 places in 3,829,253 words in total of the dataset of reviews collected about hotels.These words are divided into two groups: the first one is called fore-contrast includes "nhưng but" and "tuy however", and the second one is called post-contrast "mặc_dù although" and "dù though".If a fore-contrast occurs in a sentence, the polarity shifting takes place in the phrase preceding the fore-contrast, and, in the case of post-contrast, sentences which contain post-contrast are shifted themselves.The contrast sentences will be put in the set Dcontrast.

Shifters
Occurrences in the Corpus mặc_dù although tuy however nhưng but Features forming inconsistency dataset: Sentiment inconsistency sentences are the ones which do not demonstrate grammatical polarity shifting but implicate the contrast to sentiment shown in the whole text.This inconsistency is caused by human language, such as implicit, ironical, and satirical sentences.Inconsistency sentences can be identified with the WLLR through evaluating every word in the text.Then, a sentence will be evaluated on polarity shifting with Formula (2).
Relying on the evaluated value, one of the two following decisions will be made: • If h(si) < 0, the sentence will be put into the set Dinconsistency containing inconsistency data; • If h(si) ≥ 0, the sentence will be put into the set Dno_shift containing unshifted data.
Features in full dataset: Besides, we also use the entire dataset, which is already preprocessed, and use the name Dfull for another base learner.
Features forming inconsistency dataset: Sentiment inconsistency sentences are the ones which do not demonstrate grammatical polarity shifting but implicate the contrast to sentiment shown in the whole text.This inconsistency is caused by human language, such as implicit, ironical, and satirical sentences.Inconsistency sentences can be identified with the WLLR through evaluating every word in the text.Then, a sentence will be evaluated on polarity shifting with Formula (2).
Relying on the evaluated value, one of the two following decisions will be made: • If h(s i ) < 0, the sentence will be put into the set D inconsistency containing inconsistency data; • If h(s i ) ≥ 0, the sentence will be put into the set D no_shift containing unshifted data.
Features in full dataset: Besides, we also use the entire dataset, which is already preprocessed, and use the name D full for another base learner.
As with the strategy of polarity shifting classifying presented above, we have two combined methods for identifying polarity shifting, as shown below: (1) Identifying polarity shifting by rule-based method-building rules and dataset containing words, phrases causing polarity shifting and representative polarity shifting structures in order to identify and remove polarity shifting in sentences; (2) Identifying polarity shifting by statistic method-using WLLR to predict possibilities to shift polarity in sentences.This technique was appropriate for identifying inconsistent sentences, as well as researching and building training models applicable to different domains and languages.
Identifying 'surface features' is described in Figure 2. First, the data was preprocessed.In the next stage, handcrafted features were used that were created using tokenizer and Part-Of-Speech taggers, as well as valance shifter indicators as inputs for the machine learning algorithm.The goal was to produce the most accurate results with 'surface features' that have the highest possibility of prediction.

Extracting Features of 'Deep Feature' Type
Deep learning is a subset of machine learning that depends on learning different multilayers representing of data, each of which automatically transforms the representation at one level into a representation at a higher and more abstract level.The learned representations can be naturally used as features; we call these 'deep features'.Many deep learning models of natural language processing have used input features of word embedding (word vector) [50]-a technique of dense information word learning in a vector space of dense dimensions.Every word is regarded as a point in this space and represented by a vector of constant length.These vectors can represent a language's rules and characteristics.Of word embedding learning models on raw text, Word2Vec is a particularly effective model and usually called for.We take the training dataset of Word2Vec type for inputs into the Long Short-Term Memory (LSTM) network.The LSTM model was introduced by Hochreiter and Schimidhuber [51] and then improved by Gers et al. [52].LSTM has a similar structure to a Recurrent Neural Network (RNN) [53].However, instead of only being a one-layer neural network, a state in an LSTM has four layers.The main idea of LSTM is that in each layer there will be a forget gate to decide whether to allow the previously learned information to be used for the current layer or not.Numerous deep learning models that extend the LSTM have been proposed.but the classic LSTM still remains a strong baseline [54].
The highest probability samples belong to positive/negative stages, which were chosen as features of the meta-learner within the ensemble learning classifier, referred to as 'deep features.'The 'deep feature' identification process is described in Figure 3. First, the text data were preprocessed.Then the data were converted into dense vectors using embedded techniques, such as Word2Vec.Next, the dense vectors were loaded into a deep learning model.The goal was to produce the most accurate results with 'deep features' that have the highest possibility of prediction.

Extracting Features of 'Deep Feature' Type
Deep learning is a subset of machine learning that depends on learning different multilayers representing of data, each of which automatically transforms the representation at one level into a representation at a higher and more abstract level.The learned representations can be naturally used as features; we call these 'deep features'.Many deep learning models of natural language processing have used input features of word embedding (word vector) [50]-a technique of dense information word learning in a vector space of dense dimensions.Every word is regarded as a point in this space and represented by a vector of constant length.These vectors can represent a language's rules and characteristics.Of word embedding learning models on raw text, Word2Vec is a particularly effective model and usually called for.We take the training dataset of Word2Vec type for inputs into the Long Short-Term Memory (LSTM) network.The LSTM model was introduced by Hochreiter and Schimidhuber [51] and then improved by Gers et al. [52].LSTM has a similar structure to a Recurrent Neural Network (RNN) [53].However, instead of only being a one-layer neural network, a state in an LSTM has four layers.The main idea of LSTM is that in each layer there will be a forget gate to decide whether to allow the previously learned information to be used for the current layer or not.Numerous deep learning models that extend the LSTM have been proposed.but the classic LSTM still remains a strong baseline [54].
The highest probability samples belong to positive/negative stages, which were chosen as features of the meta-learner within the ensemble learning classifier, referred to as 'deep features.'The 'deep feature' identification process is described in Figure 3. First, the text data were preprocessed.Then the data were converted into dense vectors using embedded techniques, such as Word2Vec.Next, the dense vectors were loaded into a deep learning model.The goal was to produce the most accurate results with 'deep features' that have the highest possibility of prediction.
decide whether to allow the previously learned information to be used for the current layer or not.Numerous deep learning models that extend the LSTM have been proposed.but the classic LSTM still remains a strong baseline [54].
The highest probability samples belong to positive/negative stages, which were chosen as features of the meta-learner within the ensemble learning classifier, referred to as 'deep features.'The 'deep feature' identification process is described in Figure 3. First, the text data were preprocessed.Then the data were converted into dense vectors using embedded techniques, such as Word2Vec.Next, the dense vectors were loaded into a deep learning model.The goal was to produce the most accurate results with 'deep features' that have the highest possibility of prediction.

Base learners and Meta-Learner
The classic machine learning techniques, like Logistic Regression [55] and Support Vector Machines [56], are used to train datasets that are comprised of features of 'surface feature', which are Dnegation, Dcontrast, Dinconsistency, Dno_shift, and Dfull.They are the highly estimated techniques for text classifying, in general, and sentiment classifying, in particular.Along with them, a deep learning

Base learners and Meta-Learner
The classic machine learning techniques, like Logistic Regression [55] and Support Vector Machines [56], are used to train datasets that are comprised of features of 'surface feature', which are D negation , D contrast , D inconsistency , D no_shift , and D full .They are the highly estimated techniques for text classifying, in general, and sentiment classifying, in particular.Along with them, a deep learning model is chosen to train all the datasets (D full ) for the purpose of identifying features of 'deep feature' for ensemble learning.
Base learners' output values are each sample's probability of belonging to negative and positive stages.These probabilities are used as intensifying learning data for the ensemble stage.Regarding ensemble learning, here are the two models used to blend base-classifiers' results [57]    The whole process can be described in Algorithm 1.

Algorithm 1. Algorithm of classifier ensemble model (CEM)
Input:  # train a deep learning bl 6 on dataset D full with Word2Vec presentation.Output:

Vietnamese Language
Dataset: We experimented on two datasets containing students' reviews about the university (UIT-VSFC) [58] and reviews about hotels in Vietnam (HOTEL-Reviews).The hotel reviews were posted by users on mytour.vnfrom 02/8/2010 to 29/6/2017.Review data were preprocessed to remove abbreviations, social network language, signs, logos, etc. Information details about the two datasets are described in Table 5.We proportioned their train-test as 50-50%.

Models contributing to experiment process:
We compared our model with other strategies, the traditional SVM classification method, along with the deep learning model LSTM, as follows: • SVM: sentiment classification using the classic machine learning method, Support Vector Machines, with the bag-of-words model, unigram feature.With "contrast", "inconsistency", "negation", "no-shift", and "full" are the names of sub-datasets, which are described in Section 3.2.1, the Logistic Regression technique is adopted for all base learners.

Experiment results:
Experiment results calculated according to accuracy ratio are shown in Table 6 and Figures 5 and 6.We compared our model with other strategies, the traditional SVM classification method, along with the deep learning model LSTM, as follows: • SVM: sentiment classification using the classic machine learning method, Support Vector Machines, with the bag-of-words model, unigram feature.With "contrast", "inconsistency", "negation", "no-shift", and "full" are the names of sub-datasets, which are described in Section 3.2.1, the Logistic Regression technique is adopted for all base learners.

Experiment results:
Experiment results calculated according to accuracy ratio are shown in Table 6 and Figures 5 and  6.

English Language
Dataset: We experimented on corpus proposed by Blitzer et al.
[59] including four domains of Electronics, DVD, Books, and Kitchen, each of which contained 1000 reviews labeled positive and 1000 reviews negative.These datasets were used for fair comparison purposes since two other approaches were used, as shown below.We proportioned their train-test as 90-10%.
Models contributing to experiment process: Similar to the Vietnamese language, the following models contributed to experiment process: • SVM: sentiment classification using the classic machine learning method Support Vector Machines with the bag-of-words model, unigram feature.

Experiment results:
Experiment results, which were calculated according to accuracy ratio are shown in Table 7 and Figures 7-10.

English Language
Dataset: We experimented on corpus proposed by Blitzer et al.
[59] including four domains of Electronics, DVD, Books, and Kitchen, each of which contained 1000 reviews labeled positive and 1000 reviews negative.These datasets were used for fair comparison purposes since two other approaches were used, as shown below.We proportioned their train-test as 90-10%.
Models contributing to experiment process: Similar to the Vietnamese language, the following models contributed to experiment process: • SVM: sentiment classification using the classic machine learning method Support Vector Machines with the bag-of-words model, unigram feature.

Experiment results:
Experiment results, which were calculated according to accuracy ratio are shown in Table 7 and Figures 7-10.

Evaluation
Based on the experiment results for the Vietnamese language, we have come to the following conclusions: Our dataset size in the English language is limited, and, therefore, we have chosen the neural multilayer network (MLP) model instead of LSTM.It is certain that the original LSTM suffers from poor performance when applied with a small dataset.Experiment results show that the model proposed by us, CEM(6C-LR), attains higher classification effectiveness than other methods, especially when compared with the PSDEE method given by Xia et al. [5] and the LSS method proposed by Shoushan et al. [43].Figures 11 and 12 show the average accuracy of each base learner in the method of Xia et al. [5] and in the proposed model.
The Xia et al. method used a statistical approach to predict data test labels in the inconsistency learner and no-shift learner.The average accuracy of the two learners was only 70%, with a final accuracy of 82.43%.The proposed model used the baseline method for the inconsistency learner and no-shift learner, which provided an average accuracy of 76%.By adding a deep learning learner, the proposed system achieved a final accuracy of 84.62%.The learners that contributed the most to the accuracy of the system were the full learner, the no-shift learner, and the deep learning learner, corresponding to no-shift features that did not cause sentiment shifts, and 'deep features'.
The WLLR statistical model was not able to perform well with the Vietnamese language.The WLLR's classification of polarized words was insufficient, and data alone cannot increase the accuracy of polarized words.This approach revealed its drawbacks when dealing with grammar that had a complex semantic structure, such as Vietnamese.For instance, a well-known example error was classifying the word pair "không_thích don't like " and "ghét hate " into the same ranking.However, in the English language, "don't like" may replace "dislike."This can be solved by replacing the WLLR method with the emotional dictionary proposed by Tran et al. [31] to rank pairs of words bearing sentiment.
The reasons for our proposed method achieving acceptable results can be summarized as follows: • Our model can capture various cases of sentiment shifts and introduce appropriate treatment for each.In improving on Rui Xia et al.'s approach, we have built additional base learners for various sub-datasets that improved system performance.

•
Our model uses deep learning to automatically learn features that are implicit as input to the meta-learner.

•
Our model proved the powerful ensemble learning suite by having datasets with features of different characteristics.
• Models interested in 'deep features', such as the CEM(6C-LR), leads to results better than other models in both datasets experimented, especially compared with other strategies of four base learners, five base learners, and the baseline machine learning method SVM.

•
Ensemble learning of training sets containing 'deep feature' and features relevant to polarity shifting of 'surface feature' type leads to classification results better than the deep learning stateof-the-art model in sentiment classification (LSTM).The 'deep features' were the most effective features used to classify sentiment in an ensemble system.

•
Dataset size also has an effect on every method's effectiveness.With limited data (HOTEL-Reviews), SVM always proves to be an effective text classification method when compared with LSTM or the model proposed by us (CEM(6C-LR)).
Our dataset size in the English language is limited, and, therefore, we have chosen the neural multilayer network (MLP) model instead of LSTM.It is certain that the original LSTM suffers from poor performance when applied with a small dataset.Experiment results show that the model proposed by us, CEM(6C-LR), attains higher classification effectiveness than other methods, especially when compared with the PSDEE method given by Xia et al. [5] and the LSS method proposed by Shoushan et al. [43].Figures 11 and 12 show the average accuracy of each base learner in the method of Xia et al. [5] and in the proposed model.The Xia et al. method used a statistical approach to predict data test labels in the inconsistency learner and no-shift learner.The average accuracy of the two learners was only 70%, with a final accuracy of 82.43%.The proposed model used the baseline method for the inconsistency learner and no-shift learner, which provided an average accuracy of 76%.By adding a deep learning learner, the proposed system achieved a final accuracy of 84.62%.The learners that contributed the most to the accuracy of the system were the full learner, the no-shift learner, and the deep learning learner, corresponding to no-shift features that did not cause sentiment shifts, and 'deep features'.
The WLLR statistical model was not able to perform well with the Vietnamese language.The WLLR's classification of polarized words was insufficient, and data alone cannot increase the accuracy of polarized words.This approach revealed its drawbacks when dealing with grammar that had a complex semantic structure, such as Vietnamese.For instance, a well-known example error was classifying the word pair "không_thích don't like" and "ghét hate" into the same ranking.However, in the English language, "don't like" may replace "dislike."This can be solved by replacing the WLLR method with the emotional dictionary proposed by Tran et al. [31] to rank pairs of words bearing sentiment.The reasons for our proposed method achieving acceptable results can be summarized as follows: • Our model can capture various cases of sentiment shifts and introduce appropriate treatment for each.In improving on Rui Xia et al.'s approach, we have built additional base learners for various sub-datasets that improved system performance.

•
Our model uses deep learning to automatically learn features that are implicit as input to the meta-learner.

•
Our model proved the powerful ensemble learning suite by having datasets with features of different characteristics.

Conclusions and Future Research Plan
In this paper, we introduced a novel model that integrates the advantages of deep learning, machine learning, statistics, and rule-based techniques.Although the computational cost of the proposed system is higher than the compared algorithms, the system bears multiple distinctive characteristics.We combined different methods, identified polarity shifting based on language structures and techniques, and used a word-embedding model with deep learning.This approach captures both 'surface features' and 'deep features' in text and allows our system to achieve results better than other models.In addition, experiments have demonstrated that the proposed model works effectively with other languages, such as English.
Our future work will focus on analyzing and conducting more experiments on deep learning as well as polarity-shifting structures in texts with the objective of uncovering the limits (if that is the case) of existing models when applied to different datasets and domains in response to the sentiment classification of reviews on social networks (especially Twitter and Facebook), which presently capture the attention of the maximum population.Texts of this kind, which are often short and written using complex structures, alter the meanings of sentences and render their identification more difficult.One solution to this problem could involve broadening the rules and the vocabulary corpus to process these specific sentences.The intention to experiment with other deep learning models to strengthen the system and enhance its accuracy will be taken account of.As mentioned in the evaluation, replacing the WLLR statistical method with the VNSD dictionary [31] was considered for sentiment score ranking.

Figure 1 .
Figure 1.Architecture of sentiment classification system based on ensemble learning model.

Figure 1 .
Figure 1.Architecture of sentiment classification system based on ensemble learning model.
: (1) Rule Fixed model uses fixed rules to choose inputs for ensemble learning, and the majority of ensemble learning's results are based on classifier output's results; (2) Meta-Classifier model found in classifiers' results is taken for ensemble learning model's features.In this paper, we used the Meta-Classifier model with Logistic Regression technique.Figure 4 describes the architecture of ensemble learning using the Meta-Classifier model with base learners' output results, which are each sample's probability of belonging to positive and negative stages.
Appl.Sci.2019, 9, x FOR PEER REVIEW 8 of 18 model is chosen to train all the datasets (Dfull) for the purpose of identifying features of 'deep feature' for ensemble learning.Base learners' output values are each sample's probability of belonging to negative and positive stages.These probabilities are used as intensifying learning data for the ensemble stage.Regarding ensemble learning, here are the two models used to blend base-classifiers' results [57]: (1) Rule Fixed model uses fixed rules to choose inputs for ensemble learning, and the majority of ensemble learning's results are based on classifier output's results; (2) Meta-Classifier model found in classifiers' results is taken for ensemble learning model's features.In this paper, we used the Meta-Classifier model with Logistic Regression technique.Figure 4 describes the architecture of ensemble learning using the Meta-Classifier model with base learners' output results, which are each sample's probability of belonging to positive and negative stages.

Figure 5 .
Figure 5. Results experimented on corpus of HOTEL-Review.

Figure 5 .
Figure 5. Results experimented on corpus of HOTEL-Review.

Figure 9 .
Figure 9. Results experimented on corpus Books.Figure 9. Results experimented on corpus Books.

Figure 9 .
Figure 9. Results experimented on corpus Books.Figure 9. Results experimented on corpus Books.

Figure 10 .
Figure 10.Results experimented on corpus Kitchen.Figure 10. Results experimented on corpus Kitchen.

Figure 10 .
Figure 10.Results experimented on corpus Kitchen.Figure 10. Results experimented on corpus Kitchen.

•
Models interested in 'deep features', such as the CEM(6C-LR), leads to results better than other models in both datasets experimented, especially compared with other strategies of four base learners, five base learners, and the baseline machine learning method SVM.• Ensemble learning of training sets containing 'deep feature' and features relevant to polarity shifting of 'surface feature' type leads to classification results better than the deep learning state-of-the-art model in sentiment classification (LSTM).The 'deep features' were the most effective features used to classify sentiment in an ensemble system.• Dataset size also has an effect on every method's effectiveness.With limited data (HOTEL-Reviews), SVM always proves to be an effective text classification method when compared with LSTM or the model proposed by us (CEM(6C-LR)).

Table 1 .
Noticeable researches connected to the application of ensemble learning to sentiment classification.

Table 2 .
Examples of new researches on sentiment classification using deep learning.

Table 3 .
Statistics realized depending on some negation words' occurrences in Vietnamese language corpus of hotel reviews.

Table 4 .
Statistics based on some contrast structure words occurring in Vietnamese language corpus of hotel reviews.

Table 4 .
Statistics based on some contrast structure words occurring in Vietnamese language corpus of hotel reviews.

Occurrences in the Corpus mặc_dù although
Appl.Sci.2019, 9, x FOR PEER REVIEW 6 of 18

Table 4 .
Statistics based on some contrast structure words occurring in Vietnamese language corpus of hotel reviews.

Table 4 .
Statistics based on some contrast structure words occurring in Vietnamese language corpus of hotel reviews.

Table 4 .
Statistics based on some contrast structure words occurring in Vietnamese language corpus of hotel reviews.

Table 5 .
Information details about two datasets used in experiment.
• LSTM: sentiment classification using Long Short-Term Memory 2 × 64 hidden-layer units with feature representing the Word2Vec type.The original dimension of our one-hot vector is 74,268, reduced to 300 after performing word embedding.Dropout and recurrent dropout are 0.5.Activation function is sigmoid.•Classifierensemble model (CEM)(4C-LR): model comprised of meta-learner using Logistic Regression and four base learners which are contrast learner, inconsistency learner, negation learner, and no_shift learner.
• CEM(5C-LR): model comprised of meta-learner using Logistic Regression and five base learners which are contrast learner, inconsistency learner, negation learner, no_shift learner, and full learner.•CEM(6C-LR)-the proposed model: model comprised of meta-learner using Logistic Regression and six base learners, which are contrast learner, inconsistency learner, negation learner, no-shift learner, full learner, and LSTM learner.

Table 6 .
Results experimented on two corpuses of reviews about hotels in Vietnam (HOTEL-Reviews) and UIT-VSFC.
• LSTM: sentiment classification using Long Short-Term Memory 2 × 64 hidden-layer units with feature representing the Word2Vec type.The original dimension of our one-hot vector is 74,268, reduced to 300 after performing word embedding.Dropout and recurrent dropout are 0.5.Activation function is sigmoid.• Classifier ensemble model (CEM)(4C-LR): model comprised of meta-learner using Logistic Regression and four base learners which are contrast learner, inconsistency learner, negation learner, and no_shift learner.• CEM(5C-LR): model comprised of meta-learner using Logistic Regression and five base learners which are contrast learner, inconsistency learner, negation learner, no_shift learner, and full learner.• CEM(6C-LR)-the proposed model: model comprised of meta-learner using Logistic Regression and six base learners, which are contrast learner, inconsistency learner, negation learner, no-shift learner, full learner, and LSTM learner.

Table 6 .
Results experimented on two corpuses of reviews about hotels in Vietnam (HOTEL-Reviews) and UIT-VSFC.
• MLP (Multilayer Perceptron): sentiment classification using neural network with 160 inputs, 2 × 50 hidden-layer neurons, and 2 outputs.Activation function is softmax.• PSDEE: method proposed by Rui Xia et al. [5] using four sub-datasets of contrast, inconsistency, negation, and no_shift with unigram feature.There are four base learners for training and combining tasks.

Table 7 .
Results experimented on four corpuses of Electronics, DVD, Books, and Kitchen.

Table 7 .
Results experimented on four corpuses of Electronics, DVD, Books, and Kitchen.