A Feature-Based Approach for Sentiment Quantiﬁcation Using Machine Learning

: Sentiment analysis has been one of the most active research areas in the past decade due to its vast applications. Sentiment quantiﬁcation, a new research problem in this ﬁeld, extends sentiment analysis from individual documents to an aggregated collection of documents. Sentiment analysis has been widely researched, but sentiment quantiﬁcation has drawn less attention despite offering a greater potential to enhance current business intelligence systems. In this research, to perform sentiment quantiﬁcation, a framework based on feature engineering is proposed to exploit diverse feature sets such as sentiment, content, and part of speech, as well as deep features including word2vec and GloVe. Different machine learning algorithms, including conventional, ensemble learners, and deep learning approaches, have been investigated on standard datasets of SemEval2016, SemEval2017, STS-Gold, and Sanders. The empirical-based results reveal the effectiveness of the proposed feature sets in the process of sentiment quantiﬁcation when applied to machine learning algorithms. The results also reveal that the ensemble-based algorithm AdaBoost outperforms other conventional machine learning algorithms using a combination of proposed feature sets. The deep learning algorithm RNN, on the other hand, shows optimal results using word embedding-based features. This research has the potential to help diverse applications of sentiment quantiﬁcation, including polling, trend analysis, automatic summarization, and rumor or fake news detection. for The computed results show that with GloVe performed best for and and RNN with performed best for STS-Gold and


Introduction
The social web has changed the way people communicate. The emergence of social media channels has resulted in the rapid creation of textual content. People create and post their content using social interaction platforms such as the web, discussion forums, Facebook, Twitter, etc. The rapid growth of content has sentiment information, which offers the potential for researchers to obtain people's opinion through social media about entities including business, academia, products, marketing, etc. To extract meaningful information from raw data, a famous field known as sentiment analysis is in trend [1,2]. Sentiment analysis is an active research area that classifies opinions in negative, positive, and neutral texts. It also finds the grade of polarity (high, moderate, and mild). Sentiment analysis is carried out on three levels: document level, sentence level, and phrase level. Document-level sentiment analysis is the most popular and is followed by numerous opinion mining techniques. Document-level sentiment analysis creates groups of documents and classifies the target documents into the required set of classes. For binary classification, the target documents are classified as positive or negative, while for tertiary classification the required classes include positive, negative, and neutral. Document-level sentiment analysis does not consider diverse factors for analysis. Apart from document-level analysis, sentence-level analysis considers each sentence and counts its single supposition. Sentence-level sentiment analysis is based on the subjectivity of sentences. Document-level and sentence-level sentiment analysis do not give a clear understanding of the polarity of the text. Sentiment analysis has various research areas, including subjectivity analysis, sentiment polarity detection [3], sentiment quantification, etc. [4].
Sentiment quantification deals with the estimation of class labels of individual content. For sentiment quantification, various methods that include Classify and Count, Adjusted Count, and Instance-based Quantification Trees [5] are commonly used in different studies. However, an analysis of previous classification algorithms shows that for quantification, these standard algorithms are not an optimal solution. In this regard, research suggests that quantification should be considered as a different approach compared to classification and this research problem should be addressed using diverse approaches [6] Hence, it opens new research opportunities to explore different approaches and develop new methods in this domain.
Consequently, it raises the need to devise sentiment quantification-based methods that deliver high accuracy. To address the issue of accuracy, this research contributes to the field of sentiment quantification as follows: • Novel feature sets are proposed such as pos, tweet, specific, content, and sentiment, with the ranking of features carried out using feature selection approaches. • Deep features including word2vec and GloVe are used for sentiment analysis, and these features are considered for sentiment quantification. The remainder of this paper is organized as follows: Section 2 provides a review of existing research studies in the relevant literature. Section 3 provides details of the proposed research methodology. Section 4 provides a comprehensive discussion of the empirical-based results. Section 5 concludes the paper.

Related Work
Here, we discuss the existing research work of quantification based on sentiment analysis, which further divides into three main classes: aggregated methods, non-aggregated methods, and ensemble-based methods.

Sentiment Quantification
Sentiment quantification is the process of detecting frequent data. It is also known as prevalence estimation [7]. Quantification is used in different fields to deal with aggregated data. Sentiment quantification has various research applications, with some of them discussed here. The sentiment quantification method is used to detect communities [8], quantification of cross-lingual language [9], public health monitoring [10], and for tweet classification [11].

Aggregated Methods
The quantification approach is preferred to predict the class prior probabilities. Classify and Count (CC) is a famous technique used for quantification. However, CC is lacking in the estimation of the class distribution. Newer approaches are presented to overcome the processing limitation of CC based on Sample Means Matching (SMM). SMM is very effective in quantifying a large amount of data per second. Twenty-five datasets are taken to perform the experiments. The proposed technique has outperformed the existing methods of quantification [12]. Further, a model titled Ordinal Quantification based on trees is proposed. The purpose of the proposed technique is to accurately count the frequency of each class of unlabeled items in text. In ordinal quantification, the order of the class is defined. The same approach is utilized to find the highest stars in products' reviews by analyzing their class prevalence over time. The proposed approach is evaluated on the SemEval2016 dataset and outperforms state-of-the-art methods [13].
Classifying data in deep layers is a complex task. Various techniques such as Neural and Statistical Machine Translation are used for this purpose, but are lacking in encoding and decoding while learning data from deep layers. To address these issues, Expectation Maximization (EM) is a quantification technique used for the automatic detection of errors in Arabic text to overcome the shortcomings of neural and statistical translation methods. EM dynamically combines information around layers. Moreover, during training, Kullback-Leibler divergence (KLD) is used to improve the model's performance. The proposed approach is evaluated on two standard datasets, namely QALB-2014 and QALB-2015. The experiments showed that the proposed approach outperformed the previous techniques in terms of F1 score [14].
EM is also applied in the field of rumor detection for Arabic tweets. The proposed method is based on a semi-supervised (EM) method to extract the user and contentbased features from tweets. Both feature sets are tested to check their significance. The proposed feature sets are trained through a semi-supervised (EM) method with a small base of labelled data. The proposed method was compared with the Gaussian method and outperformed the baseline with 78.6% accuracy [15].
An estimation of class proportions based on counting the classification errors is also used for classification purposes. A new method is proposed to adjust the classification errors by building confidence intervals. The model is introduced for the quantification of social media. The proposed approach is better than previous approaches that used the accurate estimation intervals [16].
The CC method has given rise to many other derived methods. "QuaNet", another derived method, is introduced using Recurrent Neural Network (RNN) to learn "quantification embeddings". These embeddings are firstly learned by a model then elaborated by CC. This approach is tested on Kindle, IMDb, and HP datasets. The results have shown the effectiveness of this model over existing quantification techniques [17].

Non-Aggregated Methods
González-Castro et al. [18] developed a model to quantify data based on a divergence measure. Hellinger distance is used for data distribution, validation, and to find the mismatch between test and validation. Prior probability estimation is used to minimize divergence. HDx and HDy are two types of Hellinger distance, where HDy needs output from the classifier and HDx does not need input from the classifier. Hopkins et al. introduced a non-parametric technique to quantify data [19]. The proposed approach quantifies data without any need of classification. American presidential blogs were selected as a dataset, with the proposed method decreasing the unbiased estimation. Software application was developed to quantify thousands of opinions about the US presidency.

Ensemble-Based Methods
Ensemble learners combine some weak learners. Some aggregated methods are combined to address the data distribution issues in sentiment analysis. Count HDy and Adjusted Classify are combined to make an ensemble model. CC, AC, PCC, PAC, and HDy are applied for learning the proposed ensemble model. Two schemes are presented to learn and predict. All learners are used to give prediction, then four sets of measures are applied to select the best model [20].
Ensemble methods give optimal results by building various training sets. Each model is then trained using data distribution techniques for quantification. The proposed methods categorize the errors of data distribution to enhance the performance of ensemble learners. The model explicitly addresses the binary quantification problem by focusing on the change in the expected distribution for each class. The results have shown that ensemble-based method have outperformed prior techniques [21].
The ensemble method is also explored in the field of soundscape ecology. A new approach is introduced which combines quantification and classification to train the CNN to classify classes of birds. The experiments show that the quantification performed better than the classification for the classification of bird species [22].
To obtain the optimal accuracy for sentiment quantification, machine learning techniques reported promising results. However, due to the sensitive nature of sentiments in the opinion-seeking process, there is a need to achieve more optimal results. Non-lexicon-based approaches are not widely applied for sentiment quantification. The role of diverse features can be exploited to find their role in improving the classification accuracy of sentiment quantification along with non-lexicon approaches. Some of these studies are summarized in Table 1.  [21] 2017 Data distribution and quantification.

Problem Statement and Formulation
Accuracy is an important parameter in the field of sentiment analysis. In the literature, various feature sets have been exploited using machine learning techniques to improve the results. However, there is still a need to investigate those feature sets for the emerging domain of sentiment quantification and to further improve accuracy because of the sensitive nature of sentiment in the opinion-seeking process. There is a need to contribute to the field of sentiment quantification to inquire about the impact of feature sets for sentiment quantification. In addition, as existing research studies only focus on machine learning, there is a need to explore deep learning approaches.
Formally, the research problem is to estimate the dispersion of a set D = d 1 , d 2 , . . . , d q A of unlabeled documents across a set C = c 1 , c 2 , . . . , c p S of classes. In our research, the relevant literature deals with |C| = 3. There are three classes: positive, neutral, and neg-ative. As our focus is the SLMC quantification task, we consider the measures that have been proposed for evaluation. Some of the notations include quantification loss denoted by ∇(p, p, D, C), error estimation denoted by ∇, and distribution denoted by p for set D and class C by another distributionp.

Proposed Research Methodology
This segment describes the approach used to quantify tweets based on sentiment analysis. A framework is proposed to give insights into the steps followed for sentiment quantification. A detail discussion follows on the feature engineering, algorithms applied, dataset considered, and performance evaluation measure used in this research.

Framework for Sentiment Quantification
The proposed model demonstrates the procedure carried out for sentiment quantification, as shown in Figure 1. In the first step, cleansing of the standard datasets (SemEval2016, SemEval2017, STS-Gold, and Sanders) is performed using data preprocessing techniques. Data preprocessing includes spaces removal, tokenization achievement, stop-words removal, case conversion, removal of words of less than three letters, and lemmatization for content feature extraction. In the second step, features based on content, POS (part of speech), tweet specific, and sentiment features are extracted using libraries of Python. Parameter settings for optimizing all classifiers are shown in Table 7. Further, the traditional machine learning approaches and deep learning approaches such as NB, AdaBoost, DT, RF, SVM and RNN, CNN_LSTM, and DBN are applied for sentiment quantification. Afterward, to count and classify the instance of data, the Classify and Count (CC) method is applied. Next, to evaluate the performance of machine learning classifiers, performance evaluation measures are applied for sentiment quantification. POS tagging is applied to understand the nature of content-based features. The verb feature is used to obtain the action of an entity. The adjective count is considered because adjectives show the negative and positive characteristics of a tweet. The content-based feature exploits the diverse characteristics of text in tweets. Question marks, exclamation marks, and special characters are counted to check whether a person is asking a question or trying to attract attention. The retweet feature is also important to check if a tweet con-

Feature Engineering
Feature engineering consists of feature extraction and selection to achieve optimal accuracy. The selection of features has a major impact in achieving the desired results. Here, the discussion is divided into subparts such as proposed features, baseline features, and deep features to elaborate on the features' impact on quantification accuracy. It also presents the selection and ranking of the features.

Proposed Feature Sets
To perform sentiment quantification, sentiment-based features are extracted through a sentiment-based lexicon, Vader. Vader is well known for the computation of sentiment features and is also used in various research studies [23,24].
POS tagging is applied to understand the nature of content-based features. The verb feature is used to obtain the action of an entity. The adjective count is considered because adjectives show the negative and positive characteristics of a tweet. The content-based feature exploits the diverse characteristics of text in tweets. Question marks, exclamation marks, and special characters are counted to check whether a person is asking a question or trying to attract attention. The retweet feature is also important to check if a tweet contains facts. If a tweet is frequently retweeted, it potentially relates to a sensitive topic containing more sentiments. The mention feature is considered to check if another person is added to the discussion. Moreover, the URL feature is used to obtain the number of URLs shared by users to support their point of view. The hashtag contains the content topic; therefore, this feature is also considered. A list of proposed features is shown in Table 2. Features such as n-gram, word2vec and Bag-of-Words (BoW) are also used for sentiment quantification. The n-gram is a technique of word embedding, while BoW is used in natural language processing (NLP). BoW takes frequencies of each word to train a classifier. Term frequency-inverse document frequency (TF-IDF) finds the frequency of a particular word in text. Word2vec and GloVe are used to represent words into vectors. The feature word2vec also finds the syntactic and semantics' similarity between words, while GloVe divides words into clusters to find similar and dissimilar words.

Feature Selection and Ranking
Optimal features increase the performance of classifiers. To find the optimal feature sets, three widely used feature selection methods are applied: Information Gain (IG), Gain Ration (GR), and Relief-f. IG is suitable for biased data and decreases due to its mutual information formula. GR computes the difference of attributes and considers the features with small difference. Relief-F computes the closest neighbors for all attributes.
Optimal features are selected through feature selection techniques, with those features having less importance and a negative influence on the target class omitted. Features ranked by feature selection algorithms according to their importance are shown in Table 3.  Table 3 shows that the sentiment features have a greater impact than other features. Negative sentiments and negative emoticons have a greater impact and are ranked higher than positive sentiments, with their importance consistent with existing research studies [25]. The adjective and adverbs features of POS have high scores and show the action of an object. Verbs have a greater impact than nouns due to their nature and show the attributes of any entity. POS features have a greater impact for predicting the target class.
Content features such as WH, quoted, and repetitive content-based features have high scores due to the subjective and opinionative nature of features, respectively [26]. Then, special characters followed by exclamation marks are ranked higher, showing the discussion within content. Another content-based feature, URL, has a high score as this focuses on the opinion of the subject. Hashtags are ranked low as they relate to topics that are important both in objective and opinionative content. Then follows hashtags that contain topics of the content which can be retweets and contain no sentiments.
Among the baseline features, TF-IDF and n-gram have higher scores than BoW. Among deep features, GloVe has a higher score than word2vec due to the fast processing of training. In addition, GloVe combines the benefits of the word2vec-based skip-gram model in word analogy tasks such as sentiment analysis and stance classification.

Classification Algorithms Applied
This subsection discusses the machine learning technique applied for sentiment quantification. Machine learning techniques are divided into three categories: traditional algorithms, ensemble learners, and deep learners.

Machine Learning Techniques
Some machine learning approaches applied on tweets for sentiment quantification are discussed.

Support Vector Machine (SVM)
SVM is based on linear regression. It uses high margins for high-dimensional data to classify negative and positive features. SVM optimization is calculated through the formula shown in Equations (1) and (2).
where n is the number of trainings, i is a linear combination of training inputs, q is the training output, m is the cost function, and x and y measure the similarity of the dot product of c. SVM is not suitable for noisy and large datasets due to more execution time being required for the training process.

Naïve Bayes (NB)
NB is a conditional probability-based classifier. The NB probability estimation formula is given in Equation (3), where X is an event, Z is the evidence, P(X) is the probability of an event before the evidence is seen, and P(X|Z) is the probability of an event after the evidence is seen.
NB gives better performance for categorical data than numerical data.

Decision Tree (DT)
DT is based on the rule of data decision and works on the principal of entropy and Information Gain (IG) techniques. DT helps to reduce the execution time of preprocessing for missing attributes. The entropy is calculated using the formula in Equation (4).
where P h is the probability that an attribute belongs to class m. To process the information in bits, the log 2 function is used. While En(X) is the required entropy for the class label, it is also known as Entropy.

Ensemble Learning Techniques AdaBoost
AdaBoost is an ensemble method that aggregates the strong and weak learners. This technique helps to give more accurate decisions for predicting the target class. This method is also favorable for attributes that are misclassified during prediction. While training, each element is assigned a weight. The weight assignment calculation is shown in Equation (5).
where e is the number of elements to be trained. Misclassified instances are computed as shown in Equation (6).
Random Forest (RF) RF is based on the technique of regression models. RF is suitable for high-variation data and calculate average to compute. RF works on a strategy of votes to calculate responses on data attributes. This approach uses a bagging method k times. For Q = 1... m, RF trains its regression tree by the formula given in Equation (7).

Deep Learning Techniques Deep Belief Networks (DBN)
DBN is a deep learning technique that follows the method of probability and statistics. DBN architecture contains hidden layers and blocks, with layers interconnected but blocks separated from each other. For sentiment analysis, DBN is in trend to exploit its efficiency for prediction [27].

CNN-LSTM
CNN (Convolutional Neural Network) is a deep learner but is not capable of calculating long-distance dependencies in data. LSTM (Long Short-Term Memory) can work well with long-distance dependencies and is combined with CNN to achieve the desired result for any biased datasets. CNN, along with LSTM, is applied for sentiment analysis and sequence-based text processing [28].

Recurrent Neural Network (RNN)
RNN is also a deep learner and is preferable for text processing and language translation. It works on the rule of memory. The previous output is saved and fed as input for the next phase. This strategy helps its sequential processing. RNN is applied for sentiment analysis [29].

Datasets
This subsection discusses the details of datasets selected for experimentation.

SemEval2017
SemEval2017 is a famous multilingual dataset. This dataset consists of tweets in two languages: Arabic and English. English tweets are higher in number than Arabic tweets, which are only 19% of the dataset. The dataset contains 6100 testing and 3355 training tweets in Arabic, with 12,284 testing and 50,333 training tweets are in English. The details of the dataset, which has been used in earlier studies [32,33], are shown in Table 4.

STS-Gold
STS-Gold may present different sentiment labels because tweets and targets (entities) are annotated individually [34]. This dataset contains 1.6 million manually classified tweets. There were 1.28 million tweets used for training and 3.2 million tweets used for testing purposes.

Sanders
The Sanders [35,36] [N2-N4] dataset is manually labelled by one annotator and consists of 5512 tweets. We have used 4410 tweets for training and 1102 tweets for testing purposes.

Performance Evaluation Measures
This subsection describes the performance evaluation measures used for sentiment quantification.

Absolute Error (AE)
This measure corresponds to the average absolute difference between the predicted class prevalence and the true class prevalence, using Equation (8).

Relative Absolute Error (RAE)
Relative absolute error (RAE) addresses the trouble that occurred in normalized absolute error by scaling the value p c j − p c j in Equation (9) with the true class prevalence.

Kullback-Leibler Divergence (KLD)
Another measure that has become the standard metric of quantification is normalized cross-entropy, better known as Kullback-Leibler divergence (KLD), which is used as a quantification measure and is defined in Equation (10).

Results and Discussion
According to the literature, there is room to improve the accuracy for sentiment-based quantification. Sentiment quantification is not addressed with feature-based approaches to achieve the desired accuracy. To address this problem, we have proposed various feature sets to reach the optimal accuracy for quantification of tweets based on sentiment analysis. To evaluate our feature-based framework, machine learning approaches which are subdivided into three levels, conventional algorithms, ensemble learners, and deep learning approaches, are applied on the SemEval2016, SemEval2017, STS-Gold, and Sanders datasets. To evaluate the performance of classifiers, performance evaluation metrics are applied.

Single Feature Sets
Detailed experiments are performed to evaluate the effectiveness of our proposed features for the sentiment quantification task. To achieve this aim, each proposed feature set is tested on both datasets to obtain detailed analysis. To evaluate their impact, NB, SVM, and DT conventional algorithms and AdaBoost and RF ensemble learners are applied on each set including POS, content, sentiment, and tweet specific. The results suggest that POS features show more effective results when applied with AdaBoost than when evaluated through performance evaluation metrics. AdaBoost dominated the other classifiers in terms of a lower error rate (KLD = 0.0213) for SemEval2016, SemEal2017 (KLD = 0.0214), STS-Gold (KLD = 0.0129), and Sanders (KLD = 0.0169), as shown in Table 5.

Combination of Feature Sets
To take the experiments to the next step, the proposed feature sets are combined with each other to determine the optimal pair of feature sets. The proposed features are combined in groups such as "sentiment + content" (SC), "sentiment + tweet specific" (ST), "sentiment + POS" (SP), "POS + tweet specific" (PT), "POS + content" (PC),"content + tweet specific" (CT), "sentiment + POS + content" (SPC), "sentiment + content + tweet specific" (SCT), "sentiment + POS + tweet specific" (SPT), "POS + content + tweet specific" (PCT), and all feature sets. The results have shown that when all proposed features are combined, they outperform all single feature sets. SVM outperformed other classifiers when applied with all feature sets "sentiment + POS + content + tweet specific" (SPCT). SVM has more promising results with a lower error rate for all four datasets, with KLD = 0.014 for SemEval2016, 0.013 for SemEval2017, 0.0051 for STS-Gold, and 0.0092 for Sanders, as shown in Tables 6-9, respectively.

Optimal Feature Sets
The results analysis is also represented in Figures 2 and 3. Results have shown the impact of POS as a single feature set. POS contains the action of an object and also important information. Thus, when applied with machine learning algorithms, it has shown promising results and a lower error rate for sentiment quantification, as shown in Figure 2 for all four datasets. When the feature sets are combined their effectiveness is increased, which shows the usefulness of these features that contain meaningful information and outperformed other approaches when applied with SVM for all four datasets, as shown in Figure 3.

Results of Deep Features
Some of the deep features are also exploited to find out their impact on sentiment quantification. Deep features, including GloVe, BoW, word2vec, and n-gram, are extracted from all four datasets, SemEval2016, SemEval2017, STS-Gold, and Sanders. The deep features are tested with deep learning approaches such as DBN, RNN, and CNN-LSTM. The deep learning approaches are chosen due to their scalability and efficiency. Deep learning approaches do not require feature engineering and are suitable to achieve desired results. The results suggest that RNN is the best approach when applied with GloVe, which had a lower error rate (KLD = 0.009) for SemEval2016 and SemEval2017 (KLD = 0.011), and lower error rate (KLD = 0.004) for STS-Gold and Sanders (KLD = 0.008) when applied with word2vec among other deep learning approaches, as shown in Table 10. Deep learning approaches outperformed the conventional and ensemble-based machine learning approaches due to their high efficacy.

Conclusions
This study contributes to the field of quantification based on sentiment analysis. The study exploits the diverse feature sets and explores the performance of machine learning approaches for the quantification of tweets. The proposed feature sets, such as POS, tweet specific, and sentiment-and content-based, increase the performance of classifiers. When the proposed feature sets are combined, they demonstrate efficient results in terms of quantification accuracy.
Three conventional machine learning approaches, namely Naïve Bayes (NB), Decision Tree (DT), and Support Vector Machine (SVM), are used in the proposed framework. AdaBoost and Random Forest are used in the case of ensemble-based approaches. Recurrent Neural Network (RNN), Deep Belief Network (DBN), and Convolutional Neural Network (CNN-LSTM) are exploited in the deep learning category of approaches. Ensemble approach AdaBoost dominated the other classifiers when applied using a single feature set, in terms of a lower error rate (KLD = 0.0213) for SemEval2016, SemEval2017 (KLD = 0.0214), STS-Gold (KLD = 0.0129), and Sanders (KLD = 0.0169). When the feature sets are combined, SVM has more promising results with a lower error rate for all four datasets, with KLD = 0.014 for SemEval2016, 0.013 for SemEval2017, 0.0051 for STS-Gold, and 0.0092) for Sanders. The computed results show that RNN with GloVe performed best for SemEval2016 and SemEval2017, and RNN with word2vec performed best for STS-Gold and Sanders.
Future work directions are as follows: • As the social web channels provide a facility to add multilingual content, it raises diverse research issues for natural language processing and context understanding.
In the case of multilingual content, especially where the diversity in different languages' structure presents issues such as sentence structure, stemming, parsing, tagging, etc., more research is needed. • Each language has its own syntax and vocabulary. Text-based features of each language provide different research challenges. Therefore, applying the proposed features and algorithms on languages such as Arabic, Persian, and Urdu will be an interesting research work, as these languages are written from right to left.

•
The analysis and learning carried out using one language can be applied to another language using cross-lingual analysis. Thus, the cross-lingual sentiment quantification task can also be a potential research area, especially in languages that lack annotated datasets.