Analysis of Machine Learning Algorithms for Opinion Mining in Different Domains

Sentiment classification (SC) is a reference to the task of sentiment analysis (SA), which is a subfield of natural language processing (NLP) and is used to decide whether textual content implies a positive or negative review. This research focuses on the various machine learning (ML) algorithms which are utilized in the analyzation of sentiments and in the mining of reviews in different datasets. Overall, an SC task consists of two phases. The first phase deals with feature extraction (FE). Three different FE algorithms are applied in this research. The second phase covers the classification of the reviews by using various ML algorithms. These are Naïve Bayes (NB), Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), Passive Aggressive (PA), Maximum Entropy (ME), Adaptive Boosting (AdaBoost), Multinomial NB (MNB), Bernoulli NB (BNB), Ridge Regression (RR) and Logistic Regression (LR). The performance of PA with a unigram is the best among other algorithms for all used datasets (IMDB, Cornell Movies, Amazon and Twitter) and provides values that range from 87% to 99.96% for all evaluation metrics.


Introduction
Sentiment analysis, also known as opinion mining (OM), is defined as figuring out the public attitude of individuals toward distinct topics and news [1].It is also the computational handling of opinions, feelings, and the subjectivity of textual content [2].Nowadays, individuals inspect feedback and online posts about any product-which are called opinions, emotions, feelings, attitudes, considerations, beliefs or conduct of clients-before purchasing it.Sentiment analysis is the task of automatically classifying the sentiment orientation (positive or negative) of a textual content.It can be used to classify online products as recommended or not recommended (satisfied or not satisfied) [3].
There are two fundamental OM approaches used to decide whether the review or audit sentences are positive or negative [4].The first approach is a machine learning (ML) approach which is very frequently used in sentiment analysis (SA), as this approach is based on supervised, unsupervised, and semi-supervised learning.In supervised learning, the dataset is labeled to acquire a reasonable and sensible output.Dissimilar to supervised learning, the unsupervised learning process does not have any need for labeled data.In order to tackle the issue of processing unlabeled data, clustering algorithms are utilized [5].The second approach makes use of a present lexicon or dictionary with words, expressions, or phrases-terms labeled either positive or negative [6,7].These approaches were performed based on real-life needs.This study presents the impact of supervised learning algorithms on different labeled datasets.
Binary sentiment classification (SC) and multi-class SC are the very regularly used methodologies of SC [8].Each document or feedback review of the dataset is classified into two main classes, which are a positive or a negative sentiment in binary SC [9].Whereas in multi-class SC each document could be classified into more than two classes, a degree of the sentiment could be a solid positive, positive, neutral, negative or solid negative [10].
SA is categorized into three main levels-the aspect or feature level (AL), the sentence level (SL) and the document level (DL) [11].The AL refers to classify the sentiments that are expressed on various features or aspects of an entity.In the SL, the fundamental concern is to pick whether each sentence infers a positive, negative or neutral opinion.In the DL, the basic concern is to classify whether the whole opinion in a document implies a positive or negative sentiment.The SL and DL analyses are insufficient to precisely monitor what people accept and reject.This research focuses on the document level of sentiment analysis.
The rest of this paper is organized as follows: Section 2 presents a literature review of SA.Section 3 demonstrates the characteristics of the different datasets.In Section 4, the proposed methodology of SA is clarified.Section 5 discusses the results of the proposed methodology on different datasets.Section 6 concludes the research and introduces the scope and extension for future work.

Literature Review
In this section the related work is presented.It was investigated in various domains such as movie reviews, product reviews, and Twitter.
Tripathy et al. [5] did a study on classifying movie reviews using different ML algorithms, which are naïve bayes (NB), maximum entropy (ME), stochastic gradient descent (SGD), and support vector machine (SVM).These algorithms are carried out on the IMDB dataset using different combinations between n-gram approaches, for example, unigram and bigram, bigram and trigram, and unigram, bigram and trigram.Their results show that support vector machines (SVM) acquired the highest accuracy with 88.9% when unigram, bigram and trigram are applied as feature extraction.
Deng et al. [12] applied SVM combined with Importance of a Term in a Document (ITD) in order to extract features on various datasets, which are the Cornell movie reviews, Amazon products reviews and the Stanford movie reviews.Their approach certainly beats Best Matching (BM25) on two of three datasets while the distinction is indistinctive on the small Cornell movie review dataset with an accuracy of 88.50%.The accuracy of the combined algorithm is 87.44% on the Stanford movie review data set, which is greater than the BM25 (87.10%).Furthermore, the accuracy of the combined algorithm is 88.70% on Amazon products reviews.Suresh and Raj [13] exhibited a novel model for Twitter SA of a specific product and have analyzed the four usually used supervised ML algorithms (NB, Decision Tree J48, SVM and ME).Among those applied algorithms, results show that the J48 algorithm ended up being the most efficient ML algorithm, in contrast with the other three algorithms, with an overall accuracy of 92%.
Joshi and Vekariya [14] presented a practical approach to Twitter SA.The SVM algorithm with Part of Speech (POS) performs very well in SA with an accuracy of 92%.
Wawre and Deshmukh [15] compared two supervised ML algorithms (NB and SVM) for SA on the IMDB movie reviews dataset.The NB is 65.57% accurate compared with SVM with an accuracy of 45.71%.
Liu et al. [16] introduced a fundamental framework for SA on the Cornell movie review dataset utilizing NB with the Hadoop framework.The results show that the NB classifier could scale up enough.The accuracy achieved is below 82%.
In addition to machine learning based sentiment analysis, there are other approaches for SA that are based on lexicon.Many researchers applied sentiment analysis using a dictionary or corpus [6,7].

Datasets
Four different datasets were examined-the IMDB dataset [17], the Cornell dataset [18], the Amazon products dataset [19], and the Twitter dataset [20,21].All these English reviews datasets are balanced and only contain two sentiment classes, which are positive and negative sentiments.
The first dataset, IMDB movie review, was initially used by Reference [22], then it was used and expanded widely as a benchmark dataset.The polarity dataset contains 12,500 positive and 12,500 negative processed reviews of movies, which were extracted from the internet movie database with an average of 30 sentences in each document.
The second data set is also a movie reviews dataset collected from Cornell movie reviews [23].It includes 1000 positive and 1000 negative movie reviews.The number of positive words is 372,016 while the number of the distinct positive words is 29,693.The number of negative words is 330,463 while the number of the distinct negative words is 27,749.
The third dataset is collected from Amazon product reviews.The dataset has 1000 reviews divided into 500 positive and 500 negative.This dataset contains 981 distinct positive words and 1125 distinct negative words out of a total number of 5180 words.
Finally, the last dataset used is a Twitter dataset.It contains more than 1,500,000 processed tweets.However, in this research only 150,000 tweets are used for training and testing.The characteristics of used datasets can be illustrated in Table 1.

The Proposed Methodology
In this research, experiments are implemented with respect to binary SC, since this classification is important for many users who want to make a decision either to buy a product or not, watch a movie or not and so on.The proposed methodology for SA is a five-step process.In the first step, a dataset is selected from among the four different datasets.Pre-processing is performed in the second step, while the third step involves the FE algorithms that are applied on the selected dataset.Next, all ML algorithms are trained.Finally, the different ML algorithms are evaluated using 10-fold.The proposed methodology is illustrated in Figure 1.
The first dataset, IMDB movie review, was initially used by Reference [22], then it was used and expanded widely as a benchmark dataset.The polarity dataset contains 12,500 positive and 12,500 negative processed reviews of movies, which were extracted from the internet movie database with an average of 30 sentences in each document.
The second data set is also a movie reviews dataset collected from Cornell movie reviews [23].It includes 1000 positive and 1000 negative movie reviews.The number of positive words is 372,016 while the number of the distinct positive words is 29,693.The number of negative words is 330,463 while the number of the distinct negative words is 27,749.
The third dataset is collected from Amazon product reviews.The dataset has 1000 reviews divided into 500 positive and 500 negative.This dataset contains 981 distinct positive words and 1125 distinct negative words out of a total number of 5180 words.
Finally, the last dataset used is a Twitter dataset.It contains more than 1,500,000 processed tweets.However, in this research only 150,000 tweets are used for training and testing.The characteristics of used datasets can be illustrated in Table 1.

The Proposed Methodology
In this research, experiments are implemented with respect to binary SC, since this classification is important for many users who want to make a decision either to buy a product or not, watch a movie or not and so on.The proposed methodology for SA is a five-step process.In the first step, a dataset is selected from among the four different datasets.Pre-processing is performed in the second step, while the third step involves the FE algorithms that are applied on the selected dataset.Next, all ML algorithms are trained.Finally, the different ML algorithms are evaluated using 10-fold.The proposed methodology is illustrated in Figure 1.

Datasets Pre-Processing
The reviews datasets contain a lot of feedback and opinions which are expressed in various ways by clients.The four datasets utilized in this work are already labeled.Labeled datasets have a negative and positive polarity.The raw data having polarity is extremely susceptible to inconsistency, irregularity and redundancy.The structure of the data influences the results.To enhance the quality and performance of the classification process, the raw data needs to be pre-processed [24].The pre-processing task deals with the preparation process that removes the repetitive words, non-English characters and punctuations.It enhances the proficiency and adeptness of the data.It includes removing non-English letters, tokenizing, removing stop words, removing repeated characters, removing URLs and user mentions, removing hashtags and retweets for the Twitter dataset, and handling emoticons [25,26].

Feature Extraction
In order to apply ML algorithms to the SA datasets, it is essential to extract confident features from the textual content that lead to a successful correct classification.The original textual content data are typically presented as an element of a Feature Set (FS), FS = (feature 1, feature 2 . . .feature n).In this research, two FE algorithms are applied, these are Term Frequency-Inverse Document Frequency (TF-IDF) and N-gram.

Term Frequency-Inverse Document Frequency Algorithm
The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm is a regular measurement utilized as a part of the textual content classification framework.TF-IDF contains two factors-term frequency and inverse document frequency.Term frequency is exposed through the basic monitoring of the frequency of a given expression that has occurred in a given document.The inverse document frequency is calculated by dividing the number of all documents by the total number of documents that a given word is stated in.When these factors are multiplied together, the output score is the value that is the highest for words that appear frequently in documents and lowest for terms that appear frequently in each document.This allows us to find terms that are significant and weighty in a document [27].

N-Gram Algorithm
N-gram would fit for capturing textual context to some scope and is broadly used in NLP tasks.Whether to apply a higher order of n-gram is beneficial or not can be debated.Many researchers say that the unigram performs better than the bigram in classifying movie reviews by sentiment polarity, however different analysts and researchers found that in the different reviews dataset, bigrams and trigrams outperform unigrams [28].

Machine Learning Classification
In this research, different ML algorithms are applied to four different datasets to determine their efficiency and applicability for text classification and compare their evaluation accuracies [29][30][31].The different ML algorithms are NB, SGD, SVM, Passive Aggressive (PA), Maximum Entropy, Adaptive Boosting (AdaBoost), Multinomial Naïve Bayes (MNB), Bernoulli Naïve Bayes (BNB), Ridge Regression (RR) and Logistic Regression (LR).The motive behind utilizing such algorithms is their viable capability to manage and deal with textual categorizations, in which the quantity of features is very extensive.The experiments have been conducted using Python 3.6 libraries.The two Python libraries used for the experiments are Scikit-learn [32] and Natural Language Tool Kit (NLTK) [33].These libraries are free and grant researchers the ability to employ ML algorithms.

Evaluation
K-fold-cross-validation is conducted in the experiments of various ML algorithms.In this research the 10-fold cross validation is used to evaluate the ML algorithms' performance [34].The confusion matrix of the evaluation metrics is shown in Figure 2 where t PR stands for true positive review, t NR stands for true negative review, f PR stands for false positive review, and f NR stands for false negative review.The Precision (Prc.),Recall (Rcl.),Accuracy (Acc.) and F-score (F-s) are calculated to measure the performance of the applied algorithms.The equations of Prc., Rcl., Acc. and F-s are defined in Equations ( 1)-( 4).

Results and Discussion
The comparative analysis of results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE is demonstrated in Table 2.
From the experiments illustrated in Table 2, we recognized that the FE methods had an impact on the performance of any classifier.PA and RR with unigram, bigram or TF-IDF outperformed other algorithms on different datasets with an accuracy above 96%.
The top five accuracies of the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms is demonstrated in Figure 3.
Seen in Figure 3, the top accuracies for the for IMDB dataset were for PA, RR, LR, NB and MNB.We noticed that the bigram performed the best with all ML algorithms, however, the unigram also received high accuracies.The trigram did not work very well with the IMDB dataset.In the Amazon products dataset we noticed that PA, RR, SVM, AdaBoost and LR achieved the highest accuracies.Using unigram or TF-IDF gave an accuracy above 80%, while trigram was the worst.The highest accuracies on Cornell movies dataset were for PA, RR, LR, SVM, and AdaBoost.Unigram and TF-IDF outperform the other FE methods.Trigram gave the lowest accuracies with most of ML algorithms.In the Twitter dataset, PA, RR, BNB, LR, and MNB achieved the highest accuracies.The trigram achieved the lowest accuracies among the different ML algorithms.The Precision (Prc.),Recall (Rcl.),Accuracy (Acc.) and F-score (F-s) are calculated to measure the performance of the applied algorithms.The equations of Prc., Rcl., Acc. and F-s are defined in Equations ( 1)-( 4

Results and Discussion
The comparative analysis of results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE is demonstrated in Table 2.
From the experiments illustrated in Table 2, we recognized that the FE methods had an impact on the performance of any classifier.PA and RR with unigram, bigram or TF-IDF outperformed other algorithms on different datasets with an accuracy above 96%.
The top five accuracies of the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms is demonstrated in Figure 3.
Seen in Figure 3, the top accuracies for the for IMDB dataset were for PA, RR, LR, NB and MNB.We noticed that the bigram performed the best with all ML algorithms, however, the unigram also received high accuracies.The trigram did not work very well with the IMDB dataset.In the Amazon products dataset we noticed that PA, RR, SVM, AdaBoost and LR achieved the highest accuracies.Using unigram or TF-IDF gave an accuracy above 80%, while trigram was the worst.The highest accuracies on Cornell movies dataset were for PA, RR, LR, SVM, and AdaBoost.Unigram and TF-IDF outperform the other FE methods.Trigram gave the lowest accuracies with most of ML algorithms.In the Twitter dataset, PA, RR, BNB, LR, and MNB achieved the highest accuracies.The trigram achieved the lowest accuracies among the different ML algorithms.The top five precisions of the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms is demonstrated in Figure 4.
Seen in Figure 4, the top precisions for the IMDB dataset were for PA, RR, LR, MNB, and ME.We noticed that the bigram performed the best with all ML algorithms, however, unigram also had a high precision.On Amazon products dataset we noticed that PA, RR, AdaBoost, SVM, and LR achieved the highest precision with using unigram or TF-IDF as FE methods.The highest precision values on Cornell movies dataset were for PA, RR, AdaBoost, LR, and SVM.Unigram and TF-IDF outperformed the other FE methods.On the Twitter Dataset, PA, RR, BNB, MNB, and LR achieved the highest precision.Trigram achieved the lowest precision among the different ML algorithms.
The top five show the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms, this is demonstrated in Figure 5.The top five precisions of the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms is demonstrated in Figure 4.
Seen in Figure 4, the top precisions for the IMDB dataset were for PA, RR, LR, MNB, and ME.We noticed that the bigram performed the best with all ML algorithms, however, unigram also had a high precision.On Amazon products dataset we noticed that PA, RR, AdaBoost, SVM, and LR achieved the highest precision with using unigram or TF-IDF as FE methods.The highest precision values on Cornell movies dataset were for PA, RR, AdaBoost, LR, and SVM.Unigram and TF-IDF outperformed the other FE methods.On the Twitter Dataset, PA, RR, BNB, MNB, and LR achieved the highest precision.Trigram achieved the lowest precision among the different ML algorithms.
The top five show the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms, this is demonstrated in Figure 5.
As seen in Figure 5 the top recall values for the IMDB dataset were for PA, RR, NB, LR and MNB.We noticed that n-gram beat the TF-IDF performance with all ML algorithms.On Amazon products dataset we noticed that PA, RR, AdaBoost, LR, and SVM achieved the highest recall values with using  As seen in Figure 5 the top recall values for the IMDB dataset were for PA, RR, NB, LR and MNB.We noticed that n-gram beat the TF-IDF performance with all ML algorithms.On Amazon products dataset we noticed that PA, RR, AdaBoost, LR, and SVM achieved the highest recall values with using Unigram or TF-IDF as FE methods.The highest recall values on Cornell movies dataset were for PA, RR, LR, SVM, and SGD.Unigram and TF-IDF outperformed the other FE methods.On the Twitter dataset, PA, RR, BNB, MNB, and LR achieved the highest recall.Trigram achieved the lowest recall over different ML algorithms.
The top five f-score of the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms is demonstrated in Figure 6.
As seen in Figure 6, the top f-score values for the IMDB dataset were for PA, RR, LR, MNB and SGD.We noticed that unigram, bigram and TF-IDF performed better than trigram.On the Amazon products dataset we noticed that PA, RR, AdaBoost, LR, and SVM achieved the highest f-score with   The top five f-score of the results obtained by the proposed methodology using four different datasets, n-gram and TF-IDF as FE algorithms is demonstrated in Figure 6.As seen in Figure 6, the top f-score values for the IMDB dataset were for PA, RR, LR, MNB and SGD.We noticed that unigram, bigram and TF-IDF performed better than trigram.On the Amazon products dataset we noticed that PA, RR, AdaBoost, LR, and SVM achieved the highest f-score with using unigram or T-IDF as FE methods.The highest f-score values on Cornell movies dataset were for PA, RR, LR, SVM, and SGD.Unigram and TF-IDF outperformed the other FE methods.On the Twitter dataset, PA, RR, BNB, MNB, and LR achieved the highest f-score.Trigram achieved the lowest f-score among the different ML algorithms.

Conclusions and Future Work
In this research, ten different ML algorithms-NB, SGD, SVM, PA, ME, AdaBoost, MNB, BNB, RR and LR with two FE algorithms (n-gram and TF-IDF)-have been implemented on four SA datasets.The four reviews datasets are IMDB, Cornell movies, Amazon and Twitter, which have a different number of reviews each.The performance of the PA and RR with different FE algorithms on various datasets is the best among other algorithms as they give the highest accuracies in a range between 87% and 99.96%.Of the other tested algorithms, LR and SVM also promise an acceptable and convenient performance.Some of the difficulties of sentiment analysis could be utilized as further extensions in the future of this research.The future research will focus on the detection of sarcasm and applying sentiment analysis to more domains such as YouTube, Yelp, Facebook, and cross-domains.
Unigram or TF-IDF as FE methods.The highest recall values on Cornell movies dataset were for PA, RR, LR, SVM, and SGD.Unigram and TF-IDF outperformed the other FE methods.On the Twitter dataset, PA, RR, BNB, MNB, and LR achieved the highest recall.Trigram achieved the lowest recall over different ML algorithms.
using unigram or T-IDF as FE methods.The highest f-score values on Cornell movies dataset were for PA, RR, LR, SVM, and SGD.Unigram and TF-IDF outperformed the other FE methods.On the Twitter dataset, PA, RR, BNB, MNB, and LR achieved the highest f-score.Trigram achieved the lowest f-score among the different ML algorithms.

Table 1 .
Characteristics of different datasets.

Table 2 .
The comparison of various machine learning (ML) classifiers' performance using different features.