This section presents an in-depth analysis of each primary study to answer the above research question. Different spam review detection methods used by previous works are listed and the pros and cons of each are discussed.
The classification of spam review detection under this taxonomy is as follows below.
3.2.1. Machine Learning Approaches
Machine learning is one of the most important and prominent approaches for spam review detection and is generally categorized into supervised and unsupervised learning [
40]. Below, researchers discuss different machine learning methods that have been proposed for spam review detection.
1. Supervised Learning
Supervised learning approaches used for spam review detection are commonly based on the classification methods [
15]. In this learning technique, two datasets are required: training data and test data. Training data is utilized to train the classifier and afterwards test data is utilized to evaluate the performance of a classifier [
28,
46]. Methods such as Support Vector Machine (SVM) and Naïve Bayes (NB) have already shown great success in opinion mining.
Researchers usually start by gathering and crawling the dataset. The next step is to prepare and pre-process the dataset according to the domain. Once the dataset is prepared then features are extracted from the dataset by using the feature engineering approach. The next step is to train the classifier by using training data. Finally, the performance of a classifier can be validated by using test data [
13].
Table 12 shows the comparison of different supervised learning techniques used in the spam review detection works.
i. Decision Tree Classifier
Decision tree (DT) classifiers give a hierarchical decomposition of the training data space and are used to learn the rules to identify the authenticity of the review [
47,
48]. A tree is formed by using different features and their values. Information gain is calculated by using a list of features. The feature that has maximum information gain is used as the root node of the decision tree. The interior nodes of the decision tree are labeled with unique features and these features have low information gain as compared to the root node. This procedure is repeated until all reviews are classified as spam or not-spam reviews. In the study by Jotheeswaran et al. [
49], the IMDb movie reviews dataset is used, and inverse document frequency method is used to extract unique features and decision tree induction selects relevant features. They claimed to have correctly classified 75% of the reviews as spam or not-spam.
ii. Rule-Based Classifier
Rule-based (RB) classifiers use different rules to classify spam or not-spam reviews [
31,
50]. Rules may be applied to reviewer attributes, the content of the review, or the product. Ka et al. [
51] used a rule-based approach to emotion cause component detection for a Chinese micro-blog dataset and they claimed 65% accuracy. A rule might be based on font size, time to write reviews, how often reviewer writes the reviews, length of the review, and how frequently sentimental words like “bad” and “good” are written [
52,
53]. The following four sample rules elaborate on the process of identifying spam or not-spam review class.
Rule_1: If a reviewer writes review 1 for product X and he again writes review 2 for product X within one minute, then the second review belongs to spam class.
Rule_2: If a reviewer writes review 1 for product X and he again writes review 2 for product X with the same font size and style, then the second review belongs to spam class.
Rule_3: If there are two reviews for the same product and the length of the reviews is also same, then the second review will be considered as spam.
Rule_4: If a reviewer writes the review for a product with too many sentimental words, such as “bad” and “good”, the review belongs to spam class.
iii. Probabilistic Classifier
The probabilistic approach is different from other approaches in a way that certain changes between different reviews are expressed statistically rather than some rules that are written by a human or machine [
54].
Bayesian Network:
A Bayesian network shows the probability of the relationship among different nodes (features) [
55], and the feature is an element of a review that is being used to classify the review. Moreover, each node of the graphical model represents a random variable and the edge represents the probability dependence between random variables. The relationship between different edges is represented by Directed Acyclic Graphs. The probability of a node occurring is the product of the probability that the random variable in the node occurs given that the parents have occurred. In the following equation,
is the probability of any node
and
is the probability of the parent. Moreover,
is also called conditional probability.
This network model has been used in previous works to find spam reviews about any product or group of spam reviewers. Li et al. [
27] crawled product reviews from Epinions.com and applied the Naïve Bayesian algorithm. They claimed 63% accuracy in the detection of spam reviews. Halees et al. [
34] used Arabic opinion reviews from TripAdvisor and applied the Naïve Bayesian classifier for the spam detection. They claimed 99% accuracy. Similarly, the system proposed by reference [
56] reported 94% accuracy on a customer review dataset by employing the Naïve Bayesian classifier.
Naïve Bayes:
Naïve Bayes (NB) classifier is also called a linear classifier and is used for both classification as well as training purposes. This is a probabilistic classifier method based on Bayes’ theorem. Moreover, Naïve Bayes classifiers are based upon the naïve assumptions that the features in a dataset are mutually independent. The following equation is the mathematical representation of Naïve Bayes classifier.
is the posterior probability of the target class with the given predicate attribute. is the prior probability of class. is the probability of the predictor class. is the prior probability of the predictor. There are different types of Naïve Bayes with different uses. It was observed from the literature review that the Naïve Bayes method is further divided into two text classification methods: (1) Multi-variate Bernoulli Naïve Bayes is used when feature vector is represented by 0 s and 1 s, where 0 s indicate a feature that does not occur in the review and 1 s represents a feature that occurs in the review; (2) Multinomial Naïve Bayes is typically used for discrete counts to determine how often a word occurs in the document.
Maximum Entropy:
Maximum entropy (ME) is used when there are only two outcomes of the classification. Maximum entropy model assigns a class by computing a probability from an exponential function of different features and assigns a different weight to each class [
57]. Logistic Regression (LR) extracts a set of weighted features from the author’s reviews, takes a log, and then each different feature is multiplied by the weight and calculated. Nitin et al. [
35] used a crawled dataset from the Amazon website and applied Logistic Regression learner. They extracted review content and reviewer specific features and reported 78% accuracy.
iv. Linear Classifier
Linear classifiers utilize a linear combination of feature values of reviews and work well for the review classification problem, as it takes less time to train as compared to a non-linear classifier [
58,
59]. In linear classifiers, Support Vector Machine (SVM) classification is best suited for the text data. This is because of the sparse nature of the text where features are not related to each other, but they tend to correlate to one another, and generally, these features are organized into separate categories [
15]. Support Vector Machine method analyzes data and defines decision boundaries by having hyper-planes. In binary classification problem, the hyper-plane separates the document vector in one class from other class, where the separation between hyper-planes is desired to be kept as large as possible. Support Vector Machine optimization procedure maximizes the predictive accuracy while automatically avoiding over-fitting of the training data. Moreover, SVM projects the input data into the kernel space, and then it builds a linear model in this kernel space. For dataset
, where y represents the class and
x is the attribute which belongs to class y. Therefore, any hyper plane can be written as
, where
w is the normal vector to the hyperplane. SVM works very well for the small amount of training data and provides better results for good tokenizers. Several studies [
4,
15,
33,
36] used a SVM learner, where Ott et al. [
15] and Shojaee et al. [
33] used hotels review datasets through Amazon Mechanical Turk (AMT) and reported 89.9% and 84% accuracy, respectively. The variation in accuracy is because both works used different feature engineering techniques. On the other hand, Mukherjee et al. [
4] used Yelp’s real dataset and claimed 86.1% accuracy in detecting spam reviews and Fei et al. [
36] used an Amazon dataset and reported 71% accuracy.
2. Unsupervised Learning
Publicly available review datasets with labeled classes are very scarce [
4]. Hence, unsupervised learning methods that do not require a dataset with the class label are usually employed on such data [
5]. Unsupervised learning methods drive the structure by considering the relationship between data; this structure is known as clustering. Data in one cluster is dissimilar to the data in another cluster and a domain expert may suggest a label to any cluster by observing the characteristics of the data within that cluster.
Table 13 shows the comparison of different unsupervised learning techniques.
i. Twice-Clustering Technique
Twice-clustering is used to improve the precision and diversity of an unsupervised learning method [
64,
65,
66]. Twice-clustering works in a series of steps. First, the original dataset is divided by using k-fold cross-validation. Second, all the training data to cluster is chosen for the first time to form a cluster subclass and then clustering is applied to each subclass to form a sample subset of each subclass. The sample subset of each subclass may be introducing some biasness. Therefore, to overcome this problem, it was observed by literature review that non-uniform random sampling is a good approach to form a sample subset of each subclass [
67]. Finally, a subset of each subclass is selected to construct a training set to train an unsupervised learner. Jia et al. [
61] used the twice-clustering method on a product review dataset from 360buy.com and reported 66% accuracy in the detection of spam reviews.
ii. K-means Clustering
K-means clustering has been shown to work well for large-scale data and its accuracy level is also high compared to other clustering algorithms [
68]. The K-means clustering algorithm collects the extracted terms according to their feature values into K number clusters, and K is any positive number that is used to determine the number of clusters. The K-means clustering algorithm performs the following steps.
Pick a number (K) of cluster centers (at random)
Assign every item to its nearest cluster center (e.g., using Euclidean distance)
Move each cluster center to the mean of its assigned items
Repeat steps 2 and 3 until convergence is achieved (change in cluster assignments less than a threshold)
Specifically, previous works reported that K-means clustering yields promising results in the domain of opinion mining and spam detection [
69]. Jia et al. [
62] reported 71% precision by employing K-means clustering on Chinese language reviews about products. Ha et al. [
60] used a K-means approach on mobile phone reviews and reported 72% accuracy.
3.2.2. Lexicon Based Technique
In this technique, different features of a given text are compared against sentimental lexicons and sentimental values are determined before their usage. People usually use different sets of words and expressions to express their feelings and opinions about a product or services. This list of words and expressions is stored in sentimental lexicons. A document is positive if it has more positive word lexicons, otherwise it is considered negative. Specifically, the following steps are carried out: (i) Each text is pre-processed by removal of HTML tags and noisy characters. (ii) Text sentimental score is initialized by 0. (iii) Tokens are assigned to each text and each token is checked, whether it is present in the sentimental directory or not. (iv) If the total sentimental score is greater than the threshold then the review is classified as negative, otherwise it is positive. This technique falls under unsupervised learning, as it does not have labeled data for the training [
70].
Table 14 presents the accuracy of different lexicon-based approaches.
There are two different methods for the construction of sentimental lexicon: Dictionary-based method and Corpus-based method.
1. Dictionary-based Method
In the dictionary-based method, targeted opinion words with an identified orientation are collected and are then searched from the WordNet dictionary for their antonyms and synonyms. The newly found words are added to the seed list. This iterative process is continued until no new words are found. The limitation of this method is that it is usually difficult to find different opinion words for a specific domain. Ben et al. [
71] used the dictionary-based method on a review dataset from Blog06 and reported 78% accuracy in the detection of spam reviews. Taboada et al. [
72] employed a dictionary-based method on an Amazon Mechanical Turk (AMT) dataset and reported 89% accuracy.
2. Corpus-based Technique
This technique is based on syntactic patterns in large corpora [
73]. It produces a large collection of opinion words with high accuracy and needs large training data. Moreover, this approach can find opinion words with domain-specific orientation. The main benefit of this approach as compared to the dictionary-based approach is that Corpus-based technique produces specific opinion words in the respective domain and their orientations is better to understand. It may also help find domain and context specific opinion words and their orientations utilizing a domain corpus. The corpus-based technique, which is based on the domain-specific orientation, is best suited, as a word or phrase listed in an opinion lexicon does not mean that it is expressing an opinion in a sentence. For instance, in the sentence, “I am looking for good health insurance”, “good” does not express either a positive or negative opinion on any insurance. Aurangzeb et al. [
74] used corpus-based approach on customer reviews data and claimed 86.6% accuracy. Zhang et al. [
75] employed a corpus-based technique on a Chinese language review dataset and proposed an aspect-based sentimental analysis system. They claim to achieve 82% accuracy through their system. Medinas et al. [
76] used a combination of machine learning and lexicon approach. They claimed 82% accuracy by using CNET and IMDb datasets.
Discussion
This section reviewed existing literature on spam review detection methods published between 2007 and 2018. An attempt has been made to provide researchers with a comparative analysis of different spam review detection methods and their reported accuracy. Generally, spam review detection techniques are classified into two categories. The first one is machine-learning-based methods, which are further classified into two categories, supervised and unsupervised learning. The accuracy of different supervised-learning-based works is presented in
Table 12. It shows that Support Vector Machine and Naïve Bayes perform better as compared to other supervised learning methods.
Table 13 shows that Aspect based and K-nearest neighbors approaches perform better in unsupervised learning approaches. The second approach is Lexicon-based, which is further divided into two categories, Dictionary-based and Corpus-based methods. The Dictionary-based approach is more efficient in terms of processing time as compared to supervised learning but yields less accuracy. The Corpus-based technique depends upon the dictionary related to the specific seed words of the domain.
Table 14 shows that Corpus-based and Dictionary-based approach produce better accuracy as compared to other Lexicon-based techniques. It was observed in existing literature that all spam review detection methods are effective in identifying spam reviews; however, machine-learning-based supervised approaches generally yield better results. In recent years, new network-based filtering algorithms have been proposed, which filter out good or bad opinions from review datasets to aid the potential user, and these proposed algorithms produce better accuracy as compared to existing network-based spam review detection methods [
77,
78].