GBSVM: Sentiment Classiﬁcation from Unstructured Reviews Using Ensemble Classiﬁer

: User reviews on social networking platforms like Twitter, Facebook, and Google+, etc. have been gaining growing interest on account of their wide usage in sentiment analysis which serves as the feedback to both public and private companies, as well as, governments. The analysis of such reviews not only plays a noteworthy role to improve the quality of such services and products but helps to devise marketing and ﬁnancial strategies to increase the proﬁt for companies and customer satisfaction. Although many analysis models have been proposed, yet, there is still room for improving the processing, classiﬁcation


Introduction
With the wide proliferation of smartphones, recent years have seen the inception and expansion of social platforms like Twitter, Facebook, and Google+, etc. which collectively are referred to as 'social media'. With its growth, social media has become one of the interactive technologies that promote users to create ideas and share information and opinions in the form of expressions or reviews. Such expressions are full of information that serves as the user's feedback and can be utilized to revise and devise policies to improve the quality of both products and services. However, the extraction of users' opinions from the reviews is not a trivial task. A specialized field called 'sentiment analysis' [1] offers a variety of techniques and tools that are used for identifying and extracting subjective information which represents users' opinions. These techniques and tools are categorized under the natural language processing [2]. The mining of users' opinions from user reviews is called opinion mining and is practiced within text mining. Opinion mining aims to contrive a system which is used to extract and classify review and identify opinions within the text.
Traditionally sentiment analysis classifies the opinions polarity into positive, negative and neutral classes [3]. The polarity is based on the emotions or specialized words that are present within users' reviews. These reviews possess a significant meaning for the public, as well as, private companies because of the information they contain. The reviews contain the likes and dislike about a particular product or service so hold potential information that can be used by companies to improve the quality of the products. It can further help to devise or revise policies about particular products. The textual comments are much more significant than the numeric score because they represent what people exactly comment about a particular product [4].
Even though this content is meant to hold meaningful information for governments, businesses, and individuals, this user-generated bulk content has to be processed using text mining techniques and sentiment analysis. However, this process is not a trivial task and sentiment analysis faces several challenges that become obstacles in analyzing the accurate interpretation and meaning of sentiments and detecting the suitable sentiment polarity [5]. The first challenge is to tackle the nature of reviews text which can be either semi-or unstructured. Since most of the users writing reviews are novice or non-expert and non-professional writers, so they do not follow any set rules to express their views which results in semistructured and unstructured data [6]. Similarly, domain dependence, review structure, and language semantics make the analysis more challenging. For a more detailed list of sentiment analysis challenges users are referred to [5].
Supervised, as well as unsupervised, machine learning techniques have been applied for sentiment analysis which extracts the meaningful information from structured and unstructured text data to aid the decision-makers. Supervised techniques have proven to be more effective in determining the polarity of the sentiments, however, they require large amounts of labeled data which is not easy to get [7]. On the contrary unsupervised techniques, though not superior, are still advantageous as they can work without the labeled data. Support Vector Machine (SVM), Naive Bayes (NB), and tree-based approaches are reported to show good performance in sentiment analysis [8]. The selection of appropriate feature set from the data is an equally important step like the classification model. Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF), word2vec, and parts of speech, etc. are among the most widely applied features in sentiment analysis. The use of a specific feature with different classification models show different results, so, an appropriate strategy would be to investigate the use of different features with a variety of classifiers and analyze their performance. Research works [3,7] report superior performance of uni-and bi-gram features in sentiment classification. Even so, a dedicated classifier is not suitable for tweets and often a combination also called voting or ensemble of multiple classifiers proves to show excellent performance for sentiment analysis [9]. This paper first investigates the performance of various machine learning supervised models and then contrive a voting classifier to perform sentiment analysis on a Twitter dataset. Major contributions of this study are summarized as follows: • Performance analysis of SVM, Gradient Boosting Machine (GBM), Logistic Regression (LR), and Random Forest (RF) is carried out for sentiment analysis. The polarity of Google apps dataset is divided into positive, negative, and neutral classes for this purpose. • A voting classifier, called Gradient Boosted Support Vector Machine (GBSVM), is contrived to perform sentiment analysis that is based on Gradient Boosting (GB) and SVM. The performance is compared with four state-of-the-art ensemble methods.

•
The use of TF and TF-IDF is investigated, whereby, uni-gram, bi-gram, and tri-gram features are used with the selected classifiers, as well as, GBSMV to analyze the impact on the sentiment classification accuracy.
The rest of the paper is organized in the following manner. Section 2 discusses important research works related to the current study. Section 3 outlines the details of the proposed voting classifier and describes the dataset used for the experiment. Section 4 is about the results of the experiment and discussions. Performance analysis of the proposed GBSVM is also made with other state-of-the-art models in the same section. In the end, conclusion and future work are given in Section 5.

Literature Review
Data over the internet is growing constantly and so does people's choice to express their opinions. With the expansion of social platforms, tools to obtain people opinion has been changed and fields like opinion mining and sentiment analysis have received an increased demand. Owing to the potential impact that users' opinions can make on businesses, the importance of online reviews cannot be underestimated. Consequently, a large number of research professionals are building systems that can extract the information from such reviews to benefit marketing analysis, drive public opinion, and increase customer satisfaction. As a result, sentiment analysis has been adopted and deployed in a large number of research areas and businesses.

Review Classification Using Sentiment Analysis
Authors in [10] perform sentiment analysis on data containing reviews about different electronic products such as mobiles, laptops, and monitors, etc. User reviews are collected from Twitter for this analysis. A feature vector is created to handle emotional and misspelled keywords from the text. Feature extraction is done in two steps: first, all the Twitter-specific features including hash-tags and emoticons are extracted and added to the feature vector, after that, these features are removed from the tweets and uni-gram features are extracted from tweets. For classification SVM, Naïve Bayes (NB) and maximum entropy models are used which show very similar accuracy for tweet classification into positive, negative and neutral classes. Similarly, authors in [11] performed sentiment analysis on Facebook posts. The research aimed at classifying the sentiments of students by analyzing the text that students posted on Facebook. Sentiments polarity is classified as positive, negative or neutral. Authors developed an application called SentBuk that retrieves users' posted messages including textual messages, comments, and likes, etc. on Facebook. Sentiment classification is based on a hybrid technique that combines the lexical and machine learning approach, whereby, the lexicon-based approach is used for pre-processing while the machine learning approach is applied at the classification level. The proposed approach achieves an accuracy of 83% and is very helpful to guide teachers of the students' current state of mind and support adaptive e-learning.
Opinion mining is very important for analyzing public opinions about a specific product and predicting its future sale trends, so, many research contributions can be found which are dedicated to mining user reviews about products. For example, authors in [3] proposed a method of extracting the polarity of users' sentiments about products. They develop a method that can automatically discover positive, negative, and neutral reviews about any given product. The approach uses self-tagged reviews as the training data. The data chosen for testing is extracted from Amazon and C|net and provides quantitative and binary ratings. Authors suggest that metadata substitutions and variable-length features can increase the machine learning classifiers' accuracy. In the same way, authors in [12] perform sentiment analysis on Twitter tweets to check the effect of the average person's tweets on the fluctuation of stock prices of a multinational firm Samsung Electronics Ltd. The research presents an algorithm to give fast and accurate results for user sentiments based on the tweets. The proposed algorithm involves calculating the sentiment score based on the keywords found in the tweets and assigning a sentiment score to each tweet, whereby, the score can be 1, 0 or −1 for positive, neutral and negative, respectively. Opinion mining is not limited to analyze the reviews for products alone, but it can be applied to predict users' sentiments in reaction to various events. For example, research [13] uses sentiment analysis on twitter reviews to predict public reaction about various events that occurred in the FIFA world cup 2014. Manually labeled data is used for training which is then used to find the correlation between the Twitter sentiment and a particular event. The model used is a Bayesian logistic regression that uses lexicons, uni-, and bi-gram features to detect subjective or objective tweets. The tweets are processed to represent the public sentiments towards unexpected events.
Another research [14] proposes a novel approach based on SentiWordNet to carry out opinion mining using web data. The count of the score falls under seven categories: strong-positive, positive, weak-positive, neutral, weak-negative, negative, and strong-negative to test the efficacy of NB, SVM, and multi-layer perceptron. The proposed approach is evaluated on movie and product web domains and results are compared against the performance of the selected classifiers. Results demonstrate that the proposed approach outperforms the selected machine learning classifiers.

Sentiment Analysis for Different Languages
Sentiment analysis faces numerous challenges on account of the use of various languages used in tweets and reviews and hence many research works are focused on analyzing within the domain of a particular language. For example, research [15] performed sentiment analysis on Arabic text. An Arabic dataset is used that has been labeled through crowdsourcing. Ten-Fold cross-validation is used for splitting the data into train and test. Machine learning techniques including K Nearest Neighbor (KNN), NB, and SVM are used for detecting the polarity of given reviews. Results show that SVM gives the best precision which is 75% and KNN gives the best recall which is 69%. Similarly, authors in [16] perform sentiment analysis on Vietnamese text. The research proposed an approach for extracting and classifying Vietnamese text into various sentiments. The semi-supervised learning approach General Knowledge-Latent Dirichlet Allocation (GK-LDA) performs better than the traditional topic modeling LDA. The superior performance is on account of a dictionary-based approach that extracts noun-phrases rather than merely extracting word seeds.
In the same manner, research [17] conducted a sentiment analysis of the Thai language. Because the Thai language is a non-segmented language where the texts are made in a long sequence of characters without word boundaries, so, traditional Bag-of-Words (BoW) approaches are not suitable for Thai text. The proposed approach is a machine learning-based approach where ORCHID, a Thai word-segmented text corpus, is used. Later, Thai stopwords are removed before the emotions word tagging can be done. Then sentiment level is assigned to reviews based on the detected emotional word. The proposed approach helps companies know the weaknesses of their products based on the analysis of the user reviews. In another research [18] sentiment analysis is performed on different languages reviews such as English, Dutch, and French. Reviews are collected from the World Wide Web and classification is performed using SVM, multinomial NB and Maximum Entropy Classifier (MEC). For feature selection uni-grams and BSubjectivity, bi-grams, and adjectives are used. Uni-gram and BSubjectivity gives the most accurate results of 86% with SVM and 87% with MEC.

Research Works on the Pre-Processing in Sentiment Analysis
The tweets and reviews used for sentiment analysis often contain noise in the form of words that do not contribute towards the classification and hence pre-processing is an important task in sentiment analysis. For example, research [19] employs sentiment analysis on Arabic tweets. The impact of stemming, feature correlation, and n-grams model is thoroughly investigated in Arabic text. Then, three classifiers SVM, NB, and KNN are evaluated for their effectiveness to perform the sentiment analysis on Arabic text. Results show that the selection of an appropriate pre-processing strategy is very critical to achieve higher performance. Likewise, authors in [20] suggest that pre-processing is an important task, as it can substantially improve the classification and accuracy. Besides, research [21] proves that an appropriate pre-processing phase can elevate the performance of machine learning classifiers. Feature selection is another important task in sentiment analysis which can dramatically improve classification accuracy. Authors in [22] proposed a novel filter-based probabilistic feature selection method called the Distinguishable Feature Selection (DFS) method that is used for text classification. Results show that DFS gives competitive results to other feature selection techniques.
Despite various methods proposed and used for sentiment analysis, there is still room for improvement, as no method is suitable for all kinds of data. Specifically, the machine learning models show variant performance when used with a specific pre-processing strategy and particular feature selection method. Voting classifiers proved to show superior performance than single classification models, as in [9], where a voting classifier based on LR and Stochastic Gradient Descent Classifier (SGDC) is used for tweet classification. As a consequence, this research aims at devising a Voting Classifier (VC) which can perform better than already proposed models to predict sentiments polarity from unstructured text.

Materials and Methods
This section provides the details of the proposed approach, the dataset used for the experiment, feature selection, selected machine learning classifiers and the accuracy metrics used to evaluate the performance. The proposed approach is based on GBM and SVM, so it is highly desirable to describe how these models work before the narration of the proposed approach. The following section discusses the machine learning classifiers selected for the experiments as well as GBM and SVM used for the proposed voting classifier.

Classifiers Selected for the Experiment
Machine learning models have shown good performance for sentiment classification. A variety of machine learning models are available that can be used for sentiment prediction and many of these models have been implemented in SciKit learn [23]. For the current study, we used the SciKit learn stable version 0.22.0. This study considers the use of ensemble models for sentiment classification on account of their superior performance. Researches [24,25] proves that ensemble methods can substantially elevate the performance of individual base learners for sentiment classification and hence are favorable to use for sentiment analysis. The ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model which decreases variance(bagging) and bias (boosting) to obtain improved predictions(stacking) [26]. Supervised machine learning classification approaches involve building classifiers from labeled instances of texts or sentences.

Random Forest
RF is an ensemble model used both for classification and regression [27]. It trains multiple trees where each tree is built using a random subset of the vector features. The decisions of each tree are combined using a voting algorithm that gives the result. The sequence of features and the value of the feature generates the path to a leaf that represents the decision. While training, the values of the intermediate nodes are updated to minimize a cost function that evaluates the performance of the trees. RF reduces variance in different ways; firstly on training different samples of data and secondly by using a random subset of different features [28]. This study uses the RF implementation provided by SciKit learn.

Logistic Regression
LR performs predictive analysis and is used to describe data and explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. LR is a statistical machine learning algorithm that classifies the data by considering the output variables on extreme ends and tries to make a logarithm line that distinguishes between them [29]. This model is not only used for regression but also the classification task. It is one of the machine learning algorithms that provide low variance and great efficiency. A logistic model can be updated easily with new data using stochastic gradient descent.

Gradient Boosting Machine
GBM is a machine learning technique used for regression and classification problems and produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees [30]. It builds the model in a stage-wise fashion as other boosting methods do, and generalizes them by allowing optimization of an arbitrary differentiable loss function. In supervised learning, boosting is a method of converting weak learner into a strong learner. In boosting each new tree is a fit on a modified version of the original dataset. It has shown considerable success for practical applications in machine learning tasks [30,31]. A loss function in GBM is to be minimized to improve its performance. The gradient descent is the most widely used loss function and has the form: where α represents the learning rate which can be varied between 0 and 1 to tune the performance of GBM, and ∑ (y i − y p i ) shows the sum of the residuals. The loss function is a measure that indicates the efficiency of coefficients of a model that fit the underlying data. A logical understanding of loss function would depend upon what we are trying to optimize [32]. The architecture of a GBM is given in Figure 1. GBM is not only useful in practical applications, but it also has significant usage in data mining challenges [33,34]. The functionality of GBM is described in Algorithm 1.

Support Vector Machine
SVM was originally developed by Cortes and Vapnik for binary classification [35]. It is a non-probabilistic binary linear classifier that constructs a hyperplane or set of hyperplanes high or infinite-dimensional space, which can be used for classification, regression, and similar other tasks [36,37]. An SVM uses a device called kernel mapping to map the data in input space to a high-dimensional feature space in which the problem becomes linearly separable [38]. The basic idea underlying the SVM for sentiment classification is to find a hyperplane that divides the documents, or in our case, Google app reviews as per the sentiment and the margin between the classes (positive, negative and neutral) being as high as possible. The points that lay on the separation boundaries are called support vectors, as shown in Figure 2. [31]. Inputs:

Algorithm 1 Friedman's Gradient Boosting Algorithm
1: initialize f 0 with a constant 2: for t = 1 to M do 3: compute the negative gradient g t (x) 4: fit a new base-learner functionh(x, θ t ) 5: find the best gradient descet step-size ρ t update the function estimate : The class separability depends upon the distance between the samples of the classes. In other words, the higher the distance between the support vectors (margins) is, the more distinguished the classes are. The hyperplanes are originated using quadratic programming optimization problem [39]. The decision function of an SVM is related not only to the number of SVs and their weights but also to the a priori chosen kernel that is called the support vector kernel [40,41]. For this purpose, various kernels like radial, polynomial, and neural kernels, etc., are used with SVM [42]. The working principle of SVM is given in Algorithm 2 [43,44].

Proposed Approach
In this section, GBSVM a novel approach is introduced for sentiment classification and prediction of review sentiments of Google play store apps. The purpose of the proposed approach is to perform a multi-class sentiment classification on the unstructured data. Figure 3 shows the diagram for the proposed methodology.

Algorithm 2 Support voting machine.
Input: input data (x, y) N i=1 1: for i = 1 to M do 2: closest_ pair=min 1 if closest_ pair==minimum then 4: for j = 1 to n do 5: decision_ function_ to_ draw_ hyperplane sign(∑ i a i z i (x)) 6: optimization of hyperplane to classify the data according to classes 7: set the optimal margin ω(a, a ) = ∑ i (a i − a i )z i

Data Pre-Processing
Before describing the proposed approach, we would like to briefly explain the pre-processing followed for the experiments conducted to evaluate the performance of both the proposed approach, as well as, the selected machine-learning classifiers. Often the dataset contains noise in the form of unnecessary data that does not contribute towards the classification and needs to be cleaned. Data pre-processing is the process used for removing noisy and incomplete data. Pre-processing plays a pivotal role to improve classification accuracy [21]. The dataset used in this study contains a large amount of unnecessary data that does not play any role in the prediction. Since training and testing time increases when the dataset is larger, so, removing un-necessary data can speed up the training process as well. Pre-processing involves the steps carried out to clean the data so that the learning efficiency of the models can be enhanced. For this purpose, the natural language tool kit (NLTK) of Python has been utilized. It is a suite of text processing libraries that can be used for a variety of processing tasks and we used NLTK 3.5b1 with Python 3 [45]. Figure 4 shows the pre-processing steps that are followed in the current study.

Tokenization
(Divide the user reviews into tokens)

Punctuation
Removal (Remove the words like "an","in","is") Steps followed in the pre-processing of the data.

Stop-words
As a first step, all the reviews with missing values are identified and removed as the missing data can degrade the performance of the classifiers. Next, numerical values are removed from the text as they do not contribute towards the learning of the classifiers. It decreases the complexity of the training classifiers. Occasionally, reviews contain special symbols like a hear sign, thumb sign, etc. that need to be removed to reduce feature dimension and improve performance. After that the following punctuation []() /|, ; . ' is removed from the reviews in view of the fact that it does not contribute to the text analysis. It cripples the model's ability to discriminate between punctuation and other characters.
As a next step, words are converted to lowercase because the text analysis is case sensitive. If this step is not carried out, the machine learning models will count for example 'Excellent' and 'excellent' as two different words which will ultimately affect the classifier's performance [46]. In the end, stemming is performed. It is a very important pre-processing step that removes the affixes from the words. It transforms the extended words into their base forms. For example, 'loves', 'loved', and 'loving' are the modified forms of 'love'. Stemming changes these words into the original/root form and helps to increase the performance of a classifier [47].

Proposed GBSVM
The proposed classifier is an ensemble model also called a meta-classifier that performs classification by combining the probability scores from different base models. The final predicted class is based on the aggregate results from the base models. The VC (GBSVM) presented in this study is the ensemble of two classifiers GBM and SVM which are used for the final prediction of the target class. GBM is an efficient algorithm that uses a boosting method that converts the weak learner into a strong learner and trains many models in a gradual and sequenced manner. GBM is the extended version of a decision tree algorithm, that does not work well because it generates random trees and one tree has no co-relation with the other, i.e., it builds random independent trees. While the trees in the GBM are built to eliminate the short-coming (error-residuals) of the previous weak tree.
The SVM, on the contrary, is a linear model used to solve classification and regression problems. It deals with both linear and non-linear problems and works well for different applications [48]. SVM separates data into different classes by using a line or hyper-plane. SVM can perform complex types of classification based on the labels provided to the algorithm. A kernel is used in SVM that converts the low-dimension spaces into high-dimension spaces as it converts the problems into separate problems that are not separable.
The final result from the VC can be based on either the hard voting or soft voting. Hard voting is a case of majority voting where the class prediction from each based model is considered. Here, the class label Y can be predicted by majority voting of each classifier C.
For example, if the predictions from c 1 , c 2 , and c 3 are 'positive', 'negative', and 'positive', respectively, then the final prediction will be 'positive' by the majority vote.
The soft voting, on the other hand, considers the probability score from each classifier of a specific class that the current sample belongs to. At that point, soft voting criteria determine the class with the highest probability which it gets by averaging the individual values of the classifiers [49].
The proposed GBSVM takes the advantage of the advantages of both GBM and SBM and combines their predicted probability of a particular class to make the final decision. M GBM and M SvM are trained on the training data set and then used to predict the probability for positive, neutral and negative classes separately. Using the predicted probability from the two classifiers, an average probability for each class is computed for a given review. The decision function is then used to decide the final prediction/label of the review which is based on the highest average probability for a class. The working mechanism of the GBSVM is given in Algorithm 3. In this study, the soft voting technique is used. The VC in the current study is expressed as:

Algorithm 3 Gradient Boosted Support Vector Machine (GBSVM).
Here n ∑ i GBM i and n ∑ i SV M i both will give prediction probabilities against each test example.
After that, the probabilities for each test example by both GBM and SVM passes through the soft voting criteria as shown in Figure 5. When a given sample passes through the SVM and GBM, they give the probability score against each class (positive, negative, neutral). Let GBM's probability score be 0.

Dataset
Dataset plays a very important role to perform the sentiment analysis. This study utilizes the dataset that contains the mobile application reviews for Google apps. The dataset used in the current study has been downloaded from and is freely available at [50]. The dataset contains user reviews about Google apps in the English language. It contains 64,295 records consisting of attributes including 'App', 'Translated_ Reviews', and 'Sentiments'. The description of the dataset attributes is given in Table 1.

Attribute Description
App It represent the actual name of the app on google play store. Translated_Reviews It consists of the reviews given by each individual users. Sentiments It contains positive, Negative and Neutral sentiments.
The dataset contains names of different apps and 'translated_ reviews' attribute shows users reviews against the individual app. There are three classes in sentiments attribute namely positive, negative and neutral. Figure 6 shows the distribution of positive, negative and neutral reviews. There are 23,998 positive reviews, 8271 negative and 5158 neutral reviews in the dataset.

Feature Selection
Feature selection is a method of selecting the subset of relevant features for model construction, thus reducing the training time of models, simplifying the models and improving the probability of generalization and avoid overfitting. In feature extraction, input raw data are transformed into features. Supervised machine learning algorithms require text data in the form of vectors for model training. So extraction and conversion of text data into features are required without losing any important information. There is a large variety of features that have shown promising results in the classification, e.g., BoW (Bag of Words), word2vec, and TF, etc. BoW is the representation of the given text into vectors of fixed length. It does so by counting the occurrence of each word in a given text, i.e., it gives the numbers to each word is [51]. The BoW, however, is not as efficient as TF or TF-IDF are on account of its inability to capture the semantics of documents [52]. Word2vec feature extraction is based on a two-layer neural network that is used to extract feature vectors from a text corpus. It uses either CBoW (Continuous BoW) or the SG (Skip-Gram)model to do that [53]. This study considers TF and TF-IDF with its variants [54]. TF counts the frequency of words in the document i.e., it converts a collection of a text document into a matrix of occurrences of each word in the document.
Other than TF, this study considers TF-IDF with bi-and tri-gram [54] for feature selection. Even though humans can understand a language easily, it is often very complicated to train a model on the text. For this purpose, various patterns or features are used to train a model. Words convey a different meaning when considered alone than when they are joined with other words. A word if considered alone is called a uni-gram and independent of other words, for example, 'one great app' has three uni-grams including 'one', 'great', and 'app'. N-gram refers to N sequence of items for a given sample where N can be 1, 2, 3, etc. For example for the above-mentioned sentence, bi-gram would be 'one great', and 'great app'. N-gram features are reported to show better performance for review classification [55,56]. TF-IDF weighs down the most common words occurring in all text documents and gives importance to each word that appears in a subset of documents. It is a well-known algorithm that is used to transform the text into a feature vector. TF-IDF works by assigning lower weights to most common words but giving importance to rare words in a document and converts the text into a vector form. This technique is used for feature extraction in various NLP applications. TF tells which term feature is most important in a given document, while IDF presents how many documents contain that term.

Performance Evaluation Parameters
A large variety of metrics are available that can be used to evaluate the performance of classifiers [57], however, four of them are among the widely used parameters including accuracy, precision, recall, and f1-score. Four basic terms that need the understanding to grasp the performance evaluation metrics are True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) [58]. Accuracy determines the performance of a classifier in terms of the percentage of reviews that are predicted correctly. Using the above-mentioned terms, accuracy can be calculated using: The recall is often referred to as the completeness of a classifier. What proportion of actual positive is identified correctly is given by recall. It is also called the sensitivity and can be calculated using the following formula: Precision shows the exactness of a classifier. It shows what percentage of all samples are labeled positive that are actually positive. It is calculated with the following equation: The F score is a statistical analysis measure that takes both the precision and recall into account and computes a score between 0 and 1. The closer the value is to 1, the higher the accuracy of the classifier will be. F1 is calculated as:

Experiment Settings
Selected machine learning classifiers, as well as, the proposed VC are executed to analyze the initial results and then the parameters are finely tuned to increase the performance. Empirical settings of these parameters resulted in enhanced performances from the selected classifiers. Results discussed in Section 4 are the best results that we got from these classifiers and based on the parameters shown in Table 2.
RF is executed with two parameters control: max_ depth and random_ state. The former shows the maximum depth of the tree that will be created. It can also be taken as the longest route from the node to the leaf. An optimal decision tree (DT) is known to be NP-complete in many aspects. So practical DT are heuristic where local optimal decisions are taken at each node. Hence a globally optimal decision tree is not guaranteed. So multiple trees are trained in an ensemble classifier and features are sampled randomly. The latter parameter of RF controls the random choices for such training. The C defines how much we want to avoid misclassification of each training example. For smaller values, the misclassified examples are higher and vice versa. Maximum iterations define the maximum number of iterations that we want to carry out to the optimization process. Parameter 'tol' refers to tolerance for stopping criterion. Penalty shows the regularization technique used for the model. L2 represents 'Ridge Regression' which adds 'squared magnitude' of coefficient as a penalty to the loss function of the model. The parameter 'fit_ intercept' is set 'True', that includes the intercept value to the regression model. 'Solver' parameter defines the algorithm to be used in the optimization problem which is set to 'lbfgs' which is necessary to handle the 'L2' penalty and handle the multinomial loss.
For GBM maximum depth is set to 10 while the learning rate is 0.4. The parameter 'n_ estimators' is set to 100; it defines the number of boosting stages. Setting a larger number of estimators usually gives better performance. Cache_ size defines the size of the kernel cache (in MB). The 'decision_ function_ shape' defines whether to return 'ovr' (one-vs-rest) or 'ovo' (one-vs-one); we set it to 'ovr'. For 'degree', we used the default value, i.e., 3; which uses 1/(n_ features * X.var()) as the value of gamma. Kernel methods enable the mapping of non-linear observations into a higher-dimensional space to make them separable. Various kernels are used in machine learning models including linear, Gaussian, neural, etc. For SVM, we used a 'linear' kernel that is used when the data is linearly separable. The number of iterations is set to −1 which means that there is 'no limit' to iterations and 'shrinking' heuristic is set to 'True'.

Results and Discussions
This section contains the details of the experimental results conducted in this research along with the discussion of the results. Experiments are conducted using SVM, GBM, RF and LR classifiers to classify sentiments into positive, negative and neutral classes. The train-test split is done as 70:30 for training and testing, respectively. Although the proposed GBSVM is constituted of GBM and SVM, these techniques are tested individually as well to analyze their performance. Consequently, four techniques including GBM, SVM, RF, and LR are analyzed for their efficacy against the proposed approach. For the experiments, we did not use the cross-validation. Although k-fold is more precise from a theoretical perspective, yet more computationally complex. In the light of the research [59] that states that cross-validation is the common exercise but sometimes it is better not to use it, we did not use it. TF and uni-, bi-and tri-gram variants of TF-IDF are used as features for classification. Results obtained using TF features are shown in Table 3. Experiment results demonstrate the proposed GBSVM performs well when used with TF features. The underlying reason is the combination of GBM and SVM where GBM works on weak learners to make it strong for prediction. The learning rate applied is 0.1 which fairly gives accurate results and SVM is used with linear kernel and so performs faster and more accurately on the categorical data. As the dataset contains a higher number of positive class instances so, the precision rate for the positive class is quite good.
Similarly, GBSVM outperforms other machine learning classifiers when TF-IDF features are used for sentiment classification and gives an accuracy of 92%. Like before, precision results for the positive class are comparatively higher than that of negative and neutral classes primarily on account of the higher training data samples for positive class. F1 score which considers both precision and recall is also high for GBSVM than that of other classifiers. Results shown in Table 4 are for uni-gram TF-IDF, however, bi-gram TF-IDF has also been used and results are shown in Table 5. Results with bi-gram TF-IDF features reveal that the performance of all classifiers has been highly degraded. Theoretically, a higher-order n-gram model contains more information on a word's context which can lead to a model overfit. This happens when the data is sparse where we have a relatively large number of tokens but the frequency of the tokens is low. In such scenarios, a low order n-gram model can perform better than a high order n-gram model which is the case with the current dataset. This can be further corroborated from the results when TF-IDF tri-gram features are used for sentiment classification. Results with tri-gram features are shown in Table 6. As we can see the accuracy values obtained from different classifiers have been further decreased with tri-gram features. The most probable reason is the nature of the data that has been used during the training phase. In many cases, bi-and tri-gram performs worse than uni-grams, particularly when adding extra features because it may lead to overfitting. Another reason is the small sample of training data. It is most probable that classifiers are likely to have unseen tri-grams which can reduce the performance with the test data. Often the data contains only single words that lead to better performance of uni-gram than that of bi-and tri-gram models. The selected data mostly consists of single words like 'great', 'nice', and 'good', etc. so training on these results in higher accuracy for classifiers when uni-gram is used and the result becomes poorer gradually if we move from bi-gram to tri-gram. That is the reason the performance of selected classifiers has been degraded, however, even so, GBSVM performs better than other classifiers.

Performance Analysis of the Proposed GBSVM
A performance comparison of the proposed GBSVM is done as well to show its performance against other similar models. Authors in [9] perform tweet classification based on user sentiments with a VC which is based on LR and SGDC. Three features have been investigated as well including TF, TF-IDF, and word2vec. The performance of VC in [9] is better with TF and TF-IDF which yields an accuracy of 78.9%, and 79.1%, respectively. The approach in [9] is tested on the selected dataset with TF-IDF uni-gram features and it achieves an accuracy of 88.01%. On the other hand, the accuracy of the proposed GBSVM is 93.0% which is much better than that of [9].
Similarly, another work [60] proposes the combination of NB, Rocchio, and KNN for text classification.
This voting classifier can predict 89.23% of the samples accurately. However, when applied to the selected dataset, its accuracy is demoted to 73.0%, while at the same time GBSVM gives an accuracy of 93.0%. Results shown in Table 7 that the proposed GBSVM performs better than both [9] and [60].

Analyzing the Performance of GBSVM on Additional Dataset
Since the proposed GBSVM is tested only on one dataset, so the results cannot be generalized. To show its capability, it is tested on another dataset called '20 newsgroup dataset' which consists of a total of 18,000 records and comprises of 20 classes [61]. The dataset has been experimented upon by two research works [62,63]. Research [62] utilizes a Graph Convolutional Network (GCN), and a Simple graph Convolutional Network (SGCN) for classification.
Similarly, another work [63] proposed a Neural Attentive Bag-of-Entities(NABoE) model for the classification of twenty classes in the dataset. The proposed model is a neural network model that performs text classification using the entities in the knowledge base. Results from [62,63] are compared against the proposed GBSVM and the performance is shown in Table 8. Results show that the proposed GBSVM outperforms these models and achieves higher accuracy.

Conclusions
The rise and widespread use of social media has opened new ways of expressing opinions and sentiments on social platforms like Twitter, Facebook, etc. It has fueled the interest in sentiment analysis, as finding correct sentiments from text has become an important tool for individuals and companies to devise and revise products and services for increased customer satisfaction. In this paper, a sentiment analysis approach is contrived which performs voting from two base models including GBM), and SV). The performance is tested against four machine learning models including GBM, SVM, LR, and RF. Experiment results on the Google app dataset show that the proposed GBSVM outperforms machine learning classifiers. Additionally, TF, and three variants of TF-IDF uni-, bi-, and tri-gram are also investigated for their suitability as classification features which reveal that uni-gram performs better than that of TF and bi-and tri-gram TF-IDF. However, these results are not conclusive as a large dataset may affect the results and the bi-gram and tri-gram perform better with a larger dataset, which is intended as future work. Refinement in the accuracy is further possible with a more balanced data where the training samples for positive, negative, and neutral are approximately similar. Performance comparison of GBSVM with four similar models show that it performs better and achieves higher accuracy.