Semantic Features for Optimizing Supervised Approach of Sentiment Analysis on Product Reviews

The growth of ecommerce has triggered online reviews as a rich source of product information. Revealing consumer sentiment from the reviews through Sentiment Analysis (SA) is an important task of online product review analysis. Two popular approaches of SA are the supervised approach and the lexicon-based approach. In supervised approach, the employed machine learning (ML) algorithm is not the only one to influence the results of SA. The utilized text features also handle an important role in determining the performance of SA tasks. In this regard, we proposed a method to extract text features that takes into account semantic of words. We argue that this semantic feature is capable of augmenting the results of supervised SA tasks compared to commonly utilized features, i.e., bag-of-words (BoW). To extract the features, we assigned the correct sense of the word in reviewing the sentence by adopting a Word Sense Disambiguation (WSD) technique. Several WordNet similarity algorithms were involved, and correct sentiment values were assigned to words. Accordingly, we generated text features for product review documents. To evaluate the performance of our text features in the supervised approach, we conducted experiments using several ML algorithms and feature selection methods. The results of the experiments using 10-fold cross-validation indicated that our proposed semantic features favorably increased the performance of SA by 10.9%, 9.2%, and 10.6% of precision, recall, and F-Measure, respectively, compared with baseline methods.


Introduction
With the rapid growth of ecommerce platforms for online shopping, more and more customers share their opinions about products on the internet. This fact has generated a huge amount of opinion data within the platform [1]. The opinion data has then emerged as a valuable and objective source of product information for both customers and companies. For customers, it helps them by recommending that they buy a certain product [2]. For companies, it can help them in evaluating the design of a product [3] based on the analysis of user generated content (UGC), i.e., product reviews describing the user's experience [4]. With the quantity of the data, manual processing is not an efficient task. Alternatively, a big data analytics technique is necessary [5]. Sentiment Analysis (SA) has arisen in response to the necessity of processing the huge data in speed [6]. SA is a computational technique to automate the extraction of subjective information, i.e., opinion of customers with respect to a product [7]. For that reason, this study is important.
One of the most popular methods employed for SA tasks is the machine learning (ML)-based method [8], i.e., a supervised approach employing an ML algorithm. Although the role of the ML algorithm is important, it is not yet the only factor that determines the performance of SA. As a text classification task, another important factor influencing the SA result is the employed text features [9].

1.
Augmenting the results of supervised SA tasks for online product reviews by proposing a method for extracting sentiment features that takes into account semantics of words. The proposed method is the extension of [10].

2.
Evaluating the semantic features in various ML algorithms and feature selection methods.

3.
Finding the best set of the features using several feature selection methods.
In assigning the correct sense of the words in a review sentence, an adapted WSD method was used. In this method, the sense is picked from the WordNet lexical database. To calculate the numeric value of the feature, one of three sentiment values of the sense, i.e., positivity, negativity, and objectivity, is then picked from the SentiWordNet database [12]. Employing several WordNet similarity algorithms, we present a method to generate a semantic feature set of words. To evaluate the performance of our proposed semantic features, several ML algorithms and feature selection methods were employed. The employed ML algorithms and feature selection methods are implemented in WEKA, an open source tool containing algorithms for data mining applications [15]. The results of these experiments using 10-fold cross-validation indicated that our proposed semantic features favorably enhanced the performance of SA in terms of precision, recall, and F-Measure. The rest of this manuscript is organized in the following sections. Section 2 explores the most recent related study that has previously been done. Section 3 describes the method for extracting semantic features of words, including the formulas that are introduced. We explain the scenario of the experiments and the results in Section 4. The results of the experiments are discussed in Section 4. Finally, we highlight the effectiveness of our proposed method in Section 5.

Related Work
The expansion of online shopping has triggered consumer to express their opinion about a product they have purchased on ecommerce platform. SA is an efficient text mining technique to extract the opinion from online product reviews [16]. Two types of approaches that is commonly employed to perform SA task, i.e., supervised approach and lexicon based approach [17]. Supervised approaches make use of labeled training data to learn the model. Using an ML algorithm, the model is then performed to test the dataset. Using Hybrid ML approach, Al Amrani [18] performed SA on an Amazon review dataset. A number of decision trees at randomly selected features were firstly picked to forecast the class of test dataset. Support Vector Machine (SVM) was then used to maximize the margin that separates two classes. Since RF was not sensitive to input, the default parameters were used for each classifier. The method was applied to the Amazon review dataset.
To optimize the SA task, Singh [19] employed four ML classifiers, i.e., Naïve Bayes, J48, BFTree, and OneR. NLTK and bs4 libraries were used for preprocessing of raw text. Using three manually annotated datasets, i.e., Woodlan's wallet, digital camera reviews, and movie reviews from IMDB, the robustness of the classifiers was compared. WEKA 3.8 was used for implementing the classifiers. The results of the experiment confirmed that OneR was the most prominent in accuracy.
Conditional Random Field (CRF) and SVM was employed for sentiment classification of online reviews [16]. CRF was used to extract emotional fragment of the review from a Unigram features of the text. SVM transformed data that is not linearly separable into linearly separable dataset in the feature space through nonlinear mapping. The proposed method was evaluated using Chinese online reviews from Autohome and English online reviews from Amazon. To segment the review, the jieba library of Python was employed. The results of the experiment indicated that average accuracy achieved was 90%.
A study formed a classifier ensemble consisting of Multinomial Naïve Bayes, SVM, Random Forest, and Logistic Regression to improve the accuracy of SA. The utilized feature was BoW represented by a table in which the column represents the term of the document, and the values represent their frequencies. The sentiment orientation was determined using majority voting and the average class probability of each classifier. Using four benchmarks of the Twitter dataset, i.e., Sanders, Stanford, Obama-McCain Debate, and HCR, the experiment revealed that the method can improve the accuracy of SA tasks on Twitter.
Mukherjee [20] optimized the SA approach for product reviews by developing a system to extract both potential product features and the associated opinion words. In terms of SA tasks, product features extracted in this work are actually aspect. The pair of product features and their associated opinion words was extracted by making use of grammatical relation provided by the Stanford Dependency Parser. The evaluation was conducted using two datasets from [21,22]. The system outperformed several baseline systems. Using two scenarios of the experiment, i.e., rule-based classification and supervised classification using SVM, the results of the study confirmed that the supervised classification significantly outperformed the Naïve rule-based classification.
Meanwhile, a lexicon-based SA approach relies on a pre-built Sentiment Lexicon, i.e., a pre-built list of sentiment terms with their associated sentiment value that is publicly available, e.g., Opinion Lexicon, General Inquirer, and SentiWordNet [12]. The sentiment of the term's overall document is then aggregated to determine its sentiment orientation. The main challenge of the lexicon-based approach is to improve a term's sentiment value with respect to a specific context [23] since the same term can have different sentiment values when it appears in different contexts [24].
In that regard, Saif [11] has proposed a method to learn sentiment orientation of words from their contextual semantics. Based on a hypothesis that words appearing in similar contexts tend to share similar meanings, the work proposed a method called SentiCircle. The method was tested on several benchmarks of the Twitter dataset. The results of the experiment indicated that SentiCircle outperformed several baseline methods.
A study has improved the implementation of SentiCircle in the online review dataset by considering semantics of words [10]. The study argued that the result of SentiCircle is valid only if the prior sentiment value of words is also valid so that θ of SentiCircle will adjust sentiment value of words in the right direction. Semantics of words is extracted using a WSD technique. The sentiment value of words was picked from SentiWordNet. The method has been proven more robust against several baseline methods.
From the previously described recent related works, several insights can be highlighted as follows: 1.
BoW is a common text features utilized for supervised SA tasks. To show the robustness of our proposed features against BoW, we will compare the performance of our proposed features with two baselines of BoW, i.e., Unigram (Baseline 1) and Bigram (Baseline 2).

2.
Since semantics has the potential to enhance the performance of SA, we extent a lexicon-based method [10] to extract text features that capture semantics of words based on its local context, i.e., sentence to provide correct sentiment value of words. We evaluate the performance in a supervised environment. We train the model for several ML algorithms, i.e., Naïve Bayes, Naïve Bayes Multinomial, Logistic, Simple Logistic, Decision Tree, and Random Forest.

3.
We also apply the feature selection method to find the best set of our features.

Proposed Method
All steps carried out in this study are described in Figure 1.
Step 1 was adopted from [13], which is an extension of [14]. This technique is called WSD. The aim of Step 1 was to assign a correct sentiment value to the words according to their contextual sense related to different neighboring words since the same words can appear in different parts of a text and may reveal different meanings, depending on the neighboring words. This problem is called polysemy. As shown in Figure 2, polysemy is a word with the same word form (WF) but a different meaning (WS). Some examples of polysemy can be seen in Table 1.

167
As shown in Table 1, the same sentiment word that appears in different review sentences  As shown in Table 1, the same sentiment word that appears in different review sentences potentially has a different sentiment orientation or value. This issue may interfere with the result of the SA. To address this problem, we employed a general purpose sentiment lexicon, namely SentiWordNet [4]. SentiWordNet specifies three sentiment values, namely positivity, negativity, and objectivity, to each synset of WordNet. More about SentiWordNet can be found at https://github.com/aesuli/sentiwordnet.

167
As shown in Table 1, the same sentiment word that appears in different review sentences 168 potentially has a different sentiment orientation or value. This issue may interfere with the result of 169 the SA. To address this problem, we employed a general purpose sentiment lexicon, namely 170 SentiWordNet [4]. SentiWordNet specifies three sentiment values, namely positivity, negativity, and 171 objectivity, to each synset of WordNet. More about SentiWordNet can be found at

175
In the pre-processing step, stopword removal, stemming, part of speech (POS) tagging, and 176 filtering were conducted. In the implementation, the Stanford POS tagger library was used. POS 177 tagging is necessary to pick the correct sense of the words from the WordNet collection. In the 178 filtering step, we left in only verbs, adjectives, and nouns. The local neighborhood used as the 179 context for the words was a review sentence. Therefore, the processing step of WSD was done on a 180 sentence basis.

Sense Sentiment Orientation
The girl runs to the school. move fast on foot Neutral He runs the multinational company direct or control Positive The land around here is flat.
having no variation in height Neutral The party is a bit flat.
uninteresting, boring Negative I enjoy the resolution of the screen.
get pleasure from Positive The company enjoys big profit.
have benefit from Neutral In the pre-processing step, stopword removal, stemming, part of speech (POS) tagging, and filtering were conducted. In the implementation, the Stanford POS tagger library was used. POS tagging is necessary to pick the correct sense of the words from the WordNet collection. In the filtering step, we left in only verbs, adjectives, and nouns. The local neighborhood used as the context for the words was a review sentence. Therefore, the processing step of WSD was done on a sentence basis.
After pre-processing (stop word removal, stemming, POS tagging, and filtering), the sense of the words from the WordNet collection was picked. As an example of the calculation of the proposed feature, consider the review sentence "I love the screen of the camera". The result of the pre-processing step can be seen in Table 2. Table 2. Pre-processing step.

Step Result
Picking Review Sentence I love the screen of the camera. After filtering, the left term is assigned to w i . In the case of the example, we have w 1 , w 2 , and w 3 . We then pick the senses of w i from the WordNet database, indicated as w j i , as shown in Table 3. In terms of the weighted graph, the senses of the words serve as the vertices of the graph. Adopting [25], the edges are then generated by calculating the similarity between the vertices using WordNet similarity measures from Wu and Palmer (WUP) [26], Leacock and Chodorow (LCH) [27], Resnik (RES) [28], Jiang and Conrath (JCN) [29], and Lin (LIN) [30]. Adapted Lesk [31] was incorporated to improve the result.
Suppose e cd ab is the similarity between w b a and w d c . All possible similarities between the word senses of the different words are calculated as shown in Table 4. For the implementation of WordNet similarity, the ws4j algorithm from Hideki Shima was adapted. The tool can be downloaded at https://ws4jdemo.appspot.com/. To select the contextual sense, the indegree score of each sense was computed using the indegree algorithm. For example, the indegree score of w 3 1 in Table 4, i.e., In w 3 1 , is calculated as follows: In w 2 2 = e 12 12 + e 22 12 + e 32 12 + e 21 23 . In general, the indegree score of ws b a is computed using Equation (1). The notation used in Equation (1) is described in Table 5. In Table 4, the similarity highlights with red are the similarities between word senses on the left side of ws 2 2 , and the similarity highlighted with green is the similarity of the word senses on the right side of ws 2 2 . The indegree scores of all ws j i are calculated, and the selected (contextual) sense is the sense of the word with the highest indegree score. The contextual sentiment value of a word, i.e., cs i , is determined using Equation (2), where f (m i ) assigns three sentiment values from SentiWordNet, called cspos i , csneg i , and csneu i . In Equation (3), u is the number of words in the processed review sentence.   Table 5. Notations used in Equation (1).

Notations Definitions
In where: At the sentence level, three average sentiment values are calculated using Equations (4)-(6), respectively.
Regarding the utilized WordNet similarity algorithm, the average value of every contextual sense in the review document was then calculated and assigned as a feature of the review document. Three average sentiment scores from SentiWordNet were assigned to each document review. Hence, a set of 15 contextual features was provided as the basis for the classification task. A detailed description of the features is presented in Table 6. Table 6. Details of the proposed features.

F1
Average positive score generated using WSD-WUP F2 Average negative score generated using WSD-WUP F3 Average neutral score generated using WSD-WUP F4 Average positive score generated using WSD-LCH F5 Average negative score generated using WSD-LCH F6 Average neutral score generated using WSD-LCH F7 Average positive score generated using WSD-RES F8 Average negative score generated using WSD-RES F9 Average neutral score generated using WSD-RES F10 Average positive score generated using WSD-LIN F11 Average negative score generated using WSD-LIN F12 Average neutral score generated using WSD-LIN F13 Average positive score generated using WSD-ADT F14 Average negative score generated using WSD-ADT F15 Average neutral score generated using WSD-ADT For assessing the proposed features, an experiment was performed using five ML algorithms, namely Naïve Bayes, Logistic, Simple Logistic, Decision Tree, and Random Forest. We also applied five different feature selection methods, i.e., correlation-based feature selection (CFS), correlation attribute evaluator, information gain attribute evaluator, one rule attribute evaluator and principle component analysis, in order to test the performance of the proposed feature set in various different optimized combinations. For the implementation of the ML algorithms and the attribute selection methods, we used WEKA from the University of Waikato. The toolkit can be found at https: //www.cs.waikato.ac.nz/ml/weka/. The employed attribute selection methods are briefly overviewed in the following paragraphs.

Correlation Feature Selection (CFS)
CFS argues that representative features are features that are highly correlated with the class, yet uncorrelated to each other. Therefore, it measures the individual predictive ability of each subset along with the degree of redundancy between them. CFS presents a measure called the 'merit' of the feature subsets. It predicts the correlation between a composite test consisting of the summed components and the outside variable using the standardized Pearson's correlation coefficient [32], as shown in Equation (7).
In Equation (7), r zc is defined as the correlation between the summed components and the outside variable, k is the number of components, r zi is the average of the correlation between the components and the outside variable, and r ii is the average inter-correlation between the components.

Correlation Attribute Evaluator
This method calculates the weighted average of the overall Pearson correlation coefficient between an attribute and its class as an indicator for ranking a set of attributes. For x ∈ X and c ∈ C, where X is the feature subset and C is the class, and Pearson's correlation coefficient is given by Equation (8).

Information Gain Attribute Evaluator
Information gain, also called mutual information, can also reveal dependency between features by calculating the level of impurity in a group of samples. This technique provides a ranking of attributes by evaluating the information gain of an attribute with respect to the class. The information gain of X for C is the class, and X is the attribute subset given by Equation (9).

OneR Attribute Evaluator
The one rule attribute evaluator rates the values of the attributes based on a simple yet accurate classifier called the one rule classifier [33], which classifies a dataset based on a single attribute, i.e., a one-level decision tree. OneR chooses the rule with the smallest error rate from previously built rules for every attribute in the training data.

Principle Components
The algorithm attempts to find the axis of greatest variance of the data. In the implementation, this attribute evaluator calculates the eigenvector of the covariance matrix of the data and filters out the attributes with the worst eigenvectors.

Result and Discussion
In the experiment, we used three product review datasets from Amazon Review Data provided by Julian McAuley [23], i.e., Beauty, Books, and Movies. The dataset can be downloaded from http://jmcauley.ucsd.edu/data/amazon/. For building the ground truth, we assigned a label of three sentiment categories, i.e., positive, negative, and neutral, for every product review document by taking the 'overall' score from the metadata of the dataset (see the sample review in Figure 3.). The datasets with an 'overall' score of 1-2 were labeled as negative reviews. Meanwhile, the datasets with an 'overall' score of 4-5 were labeled as positive. The rest was labeled as neutral.
by Julian McAuley [23], i.e., Beauty, Books, and Movies. The dataset can be downloaded from http://jmcauley.ucsd.edu/data/amazon/. For building the ground truth, we assigned a label of three   We present the results of the experiment using the three Jean McAuley Amazon datasets in Tables 7-9. Our proposed features were evaluated using Naïve Bayes, Naïve Bayes, Logistic, Simple Logistic, J48, and Random Forest. The combination of features was optimized using CFS, correlation attribute evaluator, information gain attribute evaluator, one-rule attribute evaluator, and principle component analysis, and then we compared the results with the full feature set. The performance metrics of the selected features were evaluated using 10-fold cross-validation.   To investigate the role of semantic features in enhancing the result of the supervised SA method, the performance of our proposed features was compared with a baseline feature. For the baseline, we applied Unigram features, i.e., common features used for text classification, extracted using an unsupervised filter from StringtoWordVector in WEKA. In the experiment, we also applied the same feature selection method for the baseline. For the performance metrics, precision, recall, and F-Measure were employed. The whole experiment was conducted using 10-fold cross-validation with WEKA. The results of the experiment are presented in Tables 7-9. We compared the result of experiment with two baseline methods, i.e., Unigram (Baseline 1) and Bigram (Baseline 2), of BoW that are commonly adopted for supervised SA tasks.
In Tables 7-9, we provide the experiment results for the three product review datasets, i.e., Beauty, Books, and Movies dataset. In each table, we compare the performance of the proposed features with two baselines, namely Unigram and Bigram, that are commonly employed for text classification tasks. The results are grouped based on the employed feature selection method. For every feature selection method, we applied all used ML algorithms. To indicate the best performance achieved for precision, recall, and F-Measure, we used the asterisk symbol as presented in the Tables. For the Beauty dataset, the best performance of the proposed features is 0.897, 1.000, and 0.945 for precision, recall, and F-Measure, respectively. For the Books dataset, Principal Component Analysis (PCA) achieved the best performance by 0.916, 0.882, and 0.895 for precision, recall, and F-Measure, respectively. Meanwhile, for the Movies dataset, features selected using CFS reached the best performance by 0.943, 1.000, and 0.970 for precision, recall, and F-Measure, respectively.
In Figure 4, we calculated the average performance of our proposed features in terms of precision, recall, and F-Measure. Compared with the baseline feature that is commonly employed in the supervised SA task, i.e., BOW, the extracted semantics features have favorably increased all performance metrics of the supervised SA task, i.e., precision, recall, and F-Measure, as indicated in Figure 4. In the figure, we present average evaluation metrics for the three datasets. The figure indicates that the proposed semantic features outperform both Unigram and Bigram in all performance metrics. On average, the proposed features increased precision, recall, and F-Measure by 10.9%, 9.2%, and 10.6% of those compared to baseline methods.

312
To find the best set of the features, we calculated the average performance of our semantic 313 features for every feature selection method, as presented in Figure 5. The results of the experiment 314 presented in Figure 5 confirmed that the best semantic feature set is one that is selected using CFS,

315
i.e., F1-6. Features selected using PCA are in second place. We also highlight that the employed  To find the best set of the features, we calculated the average performance of our semantic features for every feature selection method, as presented in Figure 5. The results of the experiment presented in Figure 5 confirmed that the best semantic feature set is one that is selected using CFS, i.e., F1-6. Features selected using PCA are in second place. We also highlight that the employed WordNet similarity algorithms have a dominant role in determining the correct sentiment value of a term. Assigning incorrect sentiment values results in extracting contextual features that potentially lead to misclassification.

322
The limitation of this study is that the performance of the adopted similarity algorithm is not 323 what was expected. As an example, we calculate semantic similarity of love#v#1, hate#v#1, and 324 like#v#1 using , , and ℎ. Naturally, we expected that love#v#1 and like#v#1 should have 325 greater semantic similarity than love#v#1 and hate#v#1. Sense of those words can be seen in Table   326 10. Yet, the result was not what we wished, as indicated in Table 11. In the future, we plan to 327 propose a robust similarity algorithm to augment the result of a supervised SA task. Prefer or wish to do something

331
The implications comprise practical implication for both online marketers and customers, as 332 well as academic implications for the researcher in the field of text processing. The results of study 333 affirm that our proposed SA technique can be employed to generate quantitative ratings from 334 unstructured text data within the product review [34]. The online marketers could, therefore, apply 335 the technique to foresee consumer satisfaction toward a certain product [35]. Meanwhile, for 336 potential customers, a big data recommender system could possibly be built accordingly to provide 337 recommendation about the intended product they want to purchase [2]. The finding of this work 338 could also be beneficial for researchers in the field of text processing to further explore more 339 Figure 5. Average performance of semantic features for every feature selection method overall datasets.
The limitation of this study is that the performance of the adopted similarity algorithm is not what was expected. As an example, we calculate semantic similarity of love#v#1, hate#v#1, and like#v#1 using wup, jcn, and lch. Naturally, we expected that love#v#1 and like#v#1 should have greater semantic similarity than love#v#1 and hate#v#1. Sense of those words can be seen in Table 10. Yet, the result was not what we wished, as indicated in Table 11. In the future, we plan to propose a robust similarity algorithm to augment the result of a supervised SA task. Prefer or wish to do something

Implications
The implications comprise practical implication for both online marketers and customers, as well as academic implications for the researcher in the field of text processing. The results of study affirm that our proposed SA technique can be employed to generate quantitative ratings from unstructured text data within the product review [34]. The online marketers could, therefore, apply the technique to foresee consumer satisfaction toward a certain product [35]. Meanwhile, for potential customers, a big data recommender system could possibly be built accordingly to provide recommendation about the intended product they want to purchase [2]. The finding of this work could also be beneficial for researchers in the field of text processing to further explore more sophisticated semantic features of words. Previous work has also confirmed the robustness of semantic features [10].

Conclusions
This paper proposed a set of contextual features for SA of product reviews, generated using an extended WSD method. Several ML algorithms were employed to evaluate the performance of the proposed features in a supervised SA task. To find the subset that provides the best performance metrics, several feature selection methods were applied, i.e., CFS, correlation attribute evaluator, information gain attribute evaluator, one-rule attribute evaluator, and principle component analysis.
This study contributes to improving the performance of supervised SA tasks by proposing a method to extract semantic features of the product review dataset. The results of the cross-validated experiment in a supervised environment using several ML algorithms and feature selection methods has confirmed that our proposed semantic features favorably augment the performance of SA in terms of precision, recall, and F-Measure. Another finding of this study summarizes the robustness of semantic feature set selected using CFS and PCA.