1. Introduction
Usercreated content (UCC, also called usergenerated content) are appearing in the era of Web 2.0, encouraging individuals to publish their opinions or reviews on different types of subjects such as ecommerce [
1], tourism industry [
2], software engineering [
3], and so forth. People can easily publish their opinions, comments, or views on online shopping platforms, blogs, and forums. For instance, users’ comments are a musthave section on a shopping website. Potential customers typically view other people’s reviews before making consumption decisions, and they may be strongly influenced by existing users’ opinions. Negative reviews can cause poor reputation to a product or even a brand. On the contrary, positive reviews can impart a noble image to a product and bring about advantages over other products in the market competition.
Some companies purposely post positive reviews to promote their own products in order to make money. Even worse, they even hire an “Online Water Army” [
4] to post negative comments to defame competitors’ products. Since fake reviews can be generated at little cost, a large number of deceptive reviews have arisen with the development of Internet.
In order to get rid of fabricated reviews, researchers propose several approaches to differentiate false reviews from truthful ones. Jindal and Liu [
5] point out that spam reviews are widespread after they analyzed 5.8 million reviews and 2.14 million reviewers from Amazon.com. They categorize spam reviews into three types: untruthful reviews, onlyonebrand ones, and irrelevant ones. They also named such fake opinions as “opinion spam”, which is made with vicious incentives and purposes.
According to Jindal and Liu’s work, it is not hard to manually identify onlyonebrand reviews and irrelevant reviews. However, it is very difficult for human beings to differentiate purposely fabricated reviews from truthful ones; in other words, it is very difficult to identify deceptive reviews which are deliberately and elaborately crafted by spammers from truthful ones [
5]. Thus, several machine learning techniques are proposed to fulfill the classification task of automatic deceptive review identification. Due to the lack of the golden set of spam reviews, Jindal and Liu assume duplicate reviews as spam information for research. However, Pang and Lee deem it inappropriate to merely regard duplicate reviews as spam reviews because the connotation of the spam review is much more than a duplicate review [
6]. In this paper, we define spam reviews as those reviews that are purposely made to mislead ordinary consumers with fake information on the Internet.
Ott et al. [
7] combine the ngram feature and psychological feature to identify spam reviews. With support vector machine (SVM) as the base classifier, they claim that they obtain accuracy as high as 90% in the identification task, and this outcome is superior over human beings’ manual decision, which is about 60% accuracy. Other researchers have also proposed several approaches to automatic spam identification, such as Feng et al. [
8], Zhou et al. [
9], and Li et al. [
10].
In our prior work [
11], we examined two base classifiers: SVM and Naïve Bayes. The experimental result indicates that the average accuracy is very similar for each of the two baseclassifiers. This conclusion is also consistent with other researchers, like Huang et al. [
12]. For this reason, we only use SVM as the base classifier in this research.
Labeled reviews are crucial to improve the performance of spam identification. However, labeled reviews are expensive to obtain in practice. It requires extensive human labor and a lot of time is spent to produce enough labeled reviews to train the base classifier. In contrast, there are vast amounts of unlabeled reviews available on the Internet. Thus, it is natural to attempt to make use of unlabeled reviews to solve the spam review identification problem by adopting particular assumptions like smoothness assumption, cluster assumption, and so forth [
13].
This paper proposes a novel approach to identifying spam reviews based on entropy and the cotraining algorithm by making use of unlabeled reviews. The remainder of this paper is organized as follows.
Section 2 presents related work.
Section 3 proposes the CoFea (Cotraining by Features) approach.
Section 4 conducts the experiments on the spam dataset, and
Section 5 concludes the paper.
2. Related Work
2.1. Entropy
The size of information for a message and its uncertainty has a direct relationship. Entropy
$H$ of a signal
$x$ is defined by Shannon [
14] to measure the amount of information. The entropy can be written explicitly as follows in Equation (1).
where
$H(x)$ is the entropy of a discrete variable
$X=\{{x}_{1},...,{x}_{n}\}$.
$b$ is the base of the logarithm (generally
$b=2$, and the unit of entropy is bit).
$p({x}_{i})$ is the probability of samples where
${x}_{i}$ represents each sample of the data point
$i$. The value of
$p({x}_{i}){\mathrm{log}}_{b}p({x}_{i})$ is taken to be 0 in the case of
$p({x}_{i})=0$.
Based on the idea of entropy, in the proposed CoFea approach, we calculate the entropy
$H(x)$ of each term
$x$ in Equation (2).
Here, $\leftD\right$ is the total number of reviews. $\left{D}_{1}\right$ is the number of truthful reviews and $\left{D}_{2}\right$ is the number of deceptive reviews. ${x}_{1}$ is the number of truthful reviews which contain term $x$, and ${x}_{2}$ is the number of deceptive reviews which contain term $x$. The result of the summation is the entropy value of term $x$. If one term occurred with same frequency in both truthful and deceptive reviews, we cannot deduce useful information from it for deceptive review identification. However, if one term occurred merely in either truthful or deceptive reviews, then it can provide useful information for deceptive review identification due to the potential link between this term and the label of reviews. Further, as for the former terms, they have larger entropy values than the latter terms. From this point of view, the smaller the entropy is, the greater the amount of information the term contains for deceptive review identification.
We sort the lexical terms of all the reviews by its entropy scores in a descending order. Then we evenly divide the terms of the sequence into two distinct subsets: all oddnumbered terms groups and all evennumbered groups. Feature selection is conducted based on terms’ entropy value—that is, we regard subset $I$ as view ${X}_{1}$, and subset $II$ as view ${X}_{2}$.
2.2. The CoTraining Algorithm
The Cotraining algorithm is a semisupervised method, invented to combine labeled and unlabeled samples when the data can be regarded as having two distinct views [
15]. Pioneers using the cotraining method obtain accuracy as high as 95% for the task of categorizing 788 webpages by using only 12 labeled ones [
16]. Other researchers extend this study into three categories: cotraining with multiple views, cotraining with multiple classifiers, and cotraining with multiple manifolds [
17]. In this paper, we use cotraining with multiple views on spam review identification.
We use two feature sets as ${X}_{1}$ and ${X}_{2}$ to describe a data sample$x$. In other words, we have two different “views” on the data sample $x$, and each view can be represented by a vector ${x}_{i}\hspace{0.17em}(i\hspace{0.17em}=\hspace{0.17em}1,\hspace{0.17em}2)$. We assume that the two views ${X}_{1}$ and ${X}_{2}$ are independent of each other and each view in itself can provide sufficient information for the classification task. When we utilize ${f}_{1}$ and ${f}_{2}$ to denote the classifiers derived from the two views, we will obtain ${f}_{1}({x}_{1})={f}_{2}({x}_{2})=l$, where $l$ is the true label of $x$. We also can describe this idea in another way: for a data point $x=({x}_{1},{x}_{2})$, if we derive two classifiers as ${f}_{1}$ and ${f}_{2}$ from the training data, ${f}_{1}({x}_{1})$ and ${f}_{2}({x}_{2})$ will predict the same label $l$.
2.3. Support Vector Machine (SVM)
SVM is a type of supervised learning model in machine learning proposed by Vapnik et al. [
18]. This method minimizes the structural risk and turns out to provide better performance than the traditional classifiers. Formally, an SVM constructs a hyperplane in a Hilbert space. The Hilbert space has higher dimensions than that of the original one using kernel methods [
19]. A good hyperplane could make the largest distance to the nearest training data point of any class.
The optimum hyperplane can be expressed as a combination of support vectors (i.e., the data samples in the input space). Generally, the optimization problem of hyperplane can be written as follows [
20].
subject to
Here, $({x}_{i},{y}_{i})(1\le i\le l)$ are the labeled data samples with $l$ labels. For each $i\in \{1,...,n\}$, we use the slack variable ${\zeta}_{i}$ to tolerate those outlier samples. Note that ${\zeta}_{i}=max(0,1y{}_{i}(w\cdot {x}_{i}+b))$, if and only if ${\zeta}_{i}$ is the smallest nonnegative number satisfying ${y}_{i}(\omega \cdot {x}_{i}+b)\ge 1{\zeta}_{i}$. $\omega $ is the slope vector of the hyperplane. After the optimal hyperplane problem is solved, the decision function $f(x)=sgn((w,x)+b)$ can be used to classify the unlabeled data. Intuitively, the larger the distance of a data point from the hyperplane is, the more confident we will be to classify the data sample using SVM. For this reason, we use the distance $\frac{\left(w,x)+b\right}{\Vert \omega \Vert}$ as the confidence of the classifying result.
3. CoFea—The Proposed Approach
The cotraining algorithm needs two different views to combine labeled and unlabeled reviews. The stateoftheart cotraining techniques usually use different types of views. For instance, Liu et al. [
17] propose combining textual content (lexical terms) of a webpage and the hyperlinks in the webpage as two distinct views for webpage classification using the cotraining algorithm. Differing from their method, the CoFea algorithm is proposed to identify spam reviews using two views comprising purely lexical terms. The details of the CoFea algorithm are described in Algorithm 1 as follows.
Algorithm 1 The CoFea Algorithm 
Input:
$L$: a ${n}_{L}$ sized set of truthful or deceptive labeled reviews; $U$: a ${n}_{U}$ sized set of unlabeled reviews; ${X}_{1}$: the feature 1 set of terms derived from the words in reviews; ${X}_{2}$: the feature 2 set of terms derived from the words in reviews; $K$: the iteration number; ${n}_{{U}^{\prime}}$: the number of reviews ${U}^{\prime}$ drawn from $U$; $n$: the number of selected reviews which are classified as truthful; $p$: the number of selected reviews which are classified as deceptive; $\mathsf{\Upsilon}$: the reviews’ label, i.e. $\mathsf{\Upsilon}=\{deceptive,truthful\}$.
Output:
Procedure:
Randomly sample a ${n}_{{U}^{\prime}}$sized reviews set ${U}^{\prime}$ from $U$; Based on the $L$ set, train a classifier ${f}_{1}^{(0)}$ on the view of ${X}_{1}$ and a classifier ${f}_{2}^{(0)}$ on the view of ${X}_{2}$; For $t=1,...,K$ iterations: Use ${f}_{1}^{(t1)}$ to classify the reviews in ${U}^{\prime}$; Use ${f}_{2}^{(t1)}$ to classify the reviews in ${U}^{\prime}$; Select $n$ reviews from ${U}^{\prime}$ classified by ${f}_{1}^{(t1)}$ as truthful; Select $p$ reviews from ${U}^{\prime}$ classified by ${f}_{1}^{(t1)}$ as deceptive; Select $n$ reviews from ${U}^{\prime}$ classified by ${f}_{2}^{(t1)}$ as truthful; Select $p$ reviews from ${U}^{\prime}$ classified by ${f}_{2}^{(t1)}$ as deceptive; Add all those selected $2n+2p$ reviews to $L$, remove them from ${U}^{\prime}$; Randomly choose $2n+2p$ reviews from $U$ and merge them to ${U}^{\prime}$; Based on the new $L$ set, retrain a classifier ${f}_{1}^{(t)}$ on the view of ${X}_{1}$ and a classifier ${f}_{2}^{(t)}$ on the view of ${X}_{2}$; End for.

At the preparatory phase, we produce a lexicon including all those terms appearing in the reviews. Then, the lexicon is divided into two distinct subsets evenly. Here, we take one subset as $I$ and the other as $II$. That is, the terms in subset $I$ are different from the terms in subset $II$ and vice versa. Note that the two subsets are of the same size. Further, we regard subset $I$ as view ${X}_{1}$, and subset $II$ as view ${X}_{2}$.
With lines 1 and 2, we use the labeled reviews $L$ to train our base SVM classifier with two views, ${X}_{1}$ and ${X}_{2}$. With lines 4–6, we use the trained classifiers to classify the unlabeled reviews in the test set ${U}^{\prime}$ and select the number of $2n\hspace{0.17em}+\hspace{0.17em}2p$ classified reviews from ${U}^{\prime}$ to augment the training set $L$. Then, with line 7, we fetch $2n\hspace{0.17em}+\hspace{0.17em}2p$ unlabeled reviews from set $U$ to complement the test set ${U}^{\prime}$. Finally, with line 8, the base classifiers are retrained using the augmented training set $L$. With $K$ times of iteration, the CoFea algorithm is trained completely for spam review identification. We adopt the 10fold crossvalidation method to evaluate the accuracy in testing the CoFea algorithm.
There are two different methods for feature selection (in this paper, we use the terms occurring in the reviews as our feature) and those features are used to represent reviews as data samples. That is, each individual review is represented by a numeric vector and each dimension of the vector corresponds to one feature.
The CoFeaT (T abbreviates “total”) strategy is to use all terms in the lexicon for review representation. Under this strategy, we use the subset $I$ and the subset $II$ of views ${X}_{1}$ and ${X}_{2}$ as the features to represent each review as two numeric vectors. Then, the two numeric vectors are used to train the base classifier with the CoFea algorithm.
The details of the CoFeaT strategy are shown in Algorithm 2.
Algorithm 2 The CoFeaT Strategy 
Input:
L: a ${n}_{L}$sized set of truthful or deceptive labeled reviews; $U$: a ${n}_{U}$sized set of unlabeled reviews; ${r}_{L}{}^{a}$, ${r}_{L}{}^{b}$: a vector representing one review in the labeled reviews, $a,b$ means the view of a or b; ${r}_{U}{}^{a}$, ${r}_{U}{}^{b}$: a vector representing one review in the unlabeled reviews, $a,b$ means the view of a or b; ${R}_{L}{}^{a}$, ${R}_{L}{}^{b}$: a representation set of $L$, $a,b$ means the view of a or b; ${R}_{U}{}^{a}$, ${R}_{U}{}^{b}$: a representation set of $U$, $a,b$ means the view of a or b;
Output:
Procedure:
Traversing all the reviews without repetition both labeled and unlabeled, we can get a $\rho $sized lexicon with the entropy of each term as the term additive attribute; Sort the terms by entropy; Represent every review with vector, use the odd–even order of the terms in the lexicon to divide them. Get ${r}_{L}{}^{a}$, ${r}_{L}{}^{b}$, ${r}_{U}{}^{a}$ and ${r}_{U}{}^{b}$. Each vector has the length of $\rho /2$, and each vector has nearly the same entropy words; Return ${R}_{L}{}^{a}$, ${R}_{L}{}^{b}$, ${R}_{U}{}^{a}$ and ${R}_{U}{}^{b}$.

The CoFeaS (S abbreviates “sampling”) strategy uses some of the terms in the subset $I$ and the subset $II$ for review representation. By the entropy score, we know that some lexical terms in the reviews are of very limited information for spam review identification. That is to say, these terms occur in deceptive and truthful reviews without any difference in frequency. If the terms with limited information were used for review representation, as conducted in the CoFeaT strategy, then it will induce much computation complexity because the length of each review vector would be very long. Thus, it would be wise to reduce the length of each review vector for a timesaving benefit in the CoFea algorithm. This idea motivates the proposal of the CoFeaS strategy. Under this strategy, we rank all terms in the lexicon by its entropy score. Then, we predefine a threshold entropy score to remove the terms in the lexicon and use the new lexicon as a partition to produce the subset $I$ and the subset $II$ for review representation.
The details of the CoFeaS strategy are shown in Algorithm 3. The difference between the CoFeaT strategy and the CoFeaS strategy is at line 3 in Algorithm 3. Under the CoFeaS strategy, we stipulate that the length of the numeric vector of each review should be
$\theta $. That is to say, we use the top
$2\theta $ terms with minimum entropy scores as the lexicon. Then, the lexicon is further split into two subsets, the subset
$I$ and the subset
$II$, for review representation to construct the views
${X}_{1}$ and
${X}_{2}$. In the experiments, we compare the two strategies embedded with the CoFea algorithm in spam review identification.
Algorithm 3 The CoFeaS Strategy 
Input:
$L$: a ${n}_{L}$size set of truthful or deceptive labeled reviews; $U$: a ${n}_{U}$sized set of unlabeled reviews; ${r}_{L}{}^{a}$, ${r}_{L}{}^{b}$: a vector representing one labeled review, $a,b$ means the view of a or b; ${r}_{U}{}^{a}$, ${r}_{U}{}^{b}$: a vector representing one unlabeled review, $a,b$ means the view of a or b; ${R}_{L}{}^{a}$, ${R}_{L}{}^{b}$: a representation set of $L$, $a,b$ means the view of a or b; ${R}_{U}{}^{a}$, ${R}_{U}{}^{b}$: a representation set of $U$, $a,b$ means the view of a or b;
Output:
Procedure:
Traversing all the reviews without repetition both labeled and unlabeled, we can get a $\rho $sized lexicon with the entropy of each term as the term additive attribute; Sort the terms by entropy; Give a certain number $\theta $ to determine the length of a vector, i.e., the top minimum entropy terms will be in the vector; Represent every review with vector, get ${r}_{L}{}^{a}$, ${r}_{L}{}^{b}$, ${r}_{U}{}^{a}$ and ${r}_{U}{}^{b}$. Each vector has the length of $\theta $; Return ${R}_{L}{}^{a}$, ${R}_{L}{}^{b}$, ${R}_{U}{}^{a}$ and ${R}_{U}{}^{b}$.

4. Experiments
4.1. The Data Set
The data set we use in the experiments is the spam dataset from Myle Ott et al. [
7]. For each review, we conduct the stopword elimination, stemming, and term frequency–inverse document frequency (TFIDF) representation work [
11]. The stopword list is quoted from the USPTO (United States Patent and Trademark Office) patent fulltext and image database [
21]. The Porter stemming algorithm is employed to produce every individual word stem [
22]. We extracted all those sentences from one of the reviews using the sentence boundarydetermining method mentioned in Weiss et al.’s paper [
23]. Representing all the terms in reviews by the TFIDF method, which is a numerical statistical approach, reflects how important a word is in a document in collection or corpus [
24]. We conduct entropy computation of all lexical terms in the reviews to divide them into two subsets as subset
$I$ and subset
$II$ under the CoFeaT strategy and the CoFeaS strategy, respectively. The basic information about reviews in the spam data set is shown in
Table 1. The data set has 7677 terms (i.e., words including numbers and abbreviations) after preprocessing.
4.2. Experiment Setup
Based on the CoFea algorithm, three parameters,
$K$,
$n$, and
$p$ (i.e., the number of iterations, the number of unlabeled reviews classified as truthful, and the number of unlabeled reviews classified as deceptive) must be tuned in order to optimize the performance of the algorithm. On account of our previous work [
11], we know that the value of
$n$ and
$p$ should be equal to each other. We track accuracies of spam review identifications when we tune the parameter
$n(n=p)$ while fixing the parameter
$K$. We also track accuracies of spam review identifications when we tune the parameter
$K$ while fixing the parameter
$n(n=p)$. Because the CoFeaS strategy needs to predefine a threshold
$\theta $ for feature selection, we use
$\theta =100$ in this paper. Results of the CoFeaT and CoFeaS strategies for parameter tuning are as shown in
Figure 1a,b, respectively.
In the experiments, we set parameter $n=p$ from 1 to 6 and vary the parameter $K$ from 1 to 20. To average the performance, we implement a 5fold crossvalidation. That means we use the training set containing 320 reviews and the test set containing 80 reviews 5 times. Each time we randomly choose 5% of the data (16 data samples) for classifier training, and 15% of the data (48 data samples) for the testing phase.
To compare the CoFea algorithm with other stateoftheart techniques, we compare the CoFea algorithm and the CoSpa (Another approach for spam review identification based on Cotraining algorithm) algorithm [
11] in the experiments. The CoSpa algorithm use lexical terms as one view and probabilistic contextfree grammar as another view in cotraining for spam review identification. We fix the parameters
$n=p=5$ in both the CoFea algorithm and the CoSpa algorithm when tuning the parameter
$K$. We fix the parameter
$K=5$ in both algorithms when tuning the parameter
$n=p$.
4.3. Experimental Results
Figure 1 shows the performances of the CoFea algorithm embedded with the CoFeaT strategy (left) and the CoFeaS strategy (right). In the figure, the warmtoned color lumps represent high accuracy, and the cooltoned color lumps represent low accuracy. We can see from
Figure 1a that, when increasing the iterations, the accuracy of spam review identification has a certain degree of ascension. This approximate phenomenon also happened in the CoFeaS experiment. The accuracy quickly approaches the threshold at the
$K=6$ stage. That means the method (i.e., the CoFea algorithm) has certain limits. In general, the CoFeaT strategy performs better than the CoFeaS strategy.
Figure 2 shows the performances of different cotraining algorithms—CoSpa, CoFeaT, and CoFeaS—in spam review identification. The CoSpa algorithm has an average accuracy of 0.8157, with 0.8375 as its highest accuracy. The CoFeaT algorithm has an average accuracy as 0.8202, with 0.8326 as its highest accuracy. The CoFeaS algorithm has an average accuracy as 0.7994, with 0.8105 as its highest accuracy.
In order to better illustrate the effectiveness of CoFeaT, CoFeaS, and CoSpa, we employ Wilcoxon signedrank test [
25] to examine the statistical significance of experimental results when fixing
$n=p=5$. The Wilcoxon signedrank test indicates the CoFeaT strategy outperforms the CoSpa algorithm with a twotailed
pvalue of 0.0438, the CoFeaS strategy outperforms the CoSpa algorithm with a twotailed
pvalue of 0.0002, and the CoFeaT strategy outperforms the CoFeaS strategy with a twotailed
pvalue of 0.0000.
When fixing $K=5$, the Wilcoxon signedrank test indicates that the CoFeaT algorithm and the CoFeaS algorithm outperform the CoSpa algorithm, and the CoFeaT algorithm outperforms the CoFeaS algorithm.
The CoSpa algorithm only has its highest accuracy point without taking the stability into account. The CoFeaT algorithm has the highest mean accuracy among all the algorithms. The CoFeaS strategy has a good performance, very close to the CoFeaT strategy.
Figure 3 shows the time consumed in conducting the above experiments. The computer machine we use in the experiments had the following settings. CPU: Intel(R) Core(TM) i74700MQ @ 2.40 GHz; RAM: Kingston(R) DDR3 1600 MHz 4 × 2 GB; and Hard Drive: HGST(R) 500 GB @ 7200 r/min. The CoFea algorithm apparently has better speed performance in identifying the spam reviews, especially when dealing with a high number of iteration problems. Considering the motivation for proposing the CoFea algorithm (i.e., using terms only without other views), we argue that it is one of the most advisable algorithm among the stateoftheart techniques in spam review identification domain.
5. Concluding Remarks
In this paper, we propose a new approach, called CoFea, based on entropy and the cotraining algorithm to identify spam reviews by making use of unlabeled reviews. The experimental results show some promising aspects of the proposed approach in the paper. The contribution of the paper can be summarized as follows.
First, we sort terms by means of entropy measures. This allows feature selection and can be conducted based on the amount of information a term contains.
Second, we propose two strategies, the CoFeaT strategy and the CoFeaS strategy, with different lengths of vectors of each view to be embedded with the CoFea algorithm for spam review identification.
Third, we conduct experiments on the spam review set to compare the proposed approach with the stateoftheart techniques in spam review identification. Experimental results show that both the CoFeaT and CoFeaS strategies have produced good performances on spam review identification. The CoFeaT strategy has produced better accuracies than the CoFeaS strategy, while the CoFeaS strategy needs less time for computation than the CoFeaT strategy. Under the condition of lacking other views to implement the cotraining algorithm, the CoFea algorithm is a good alternative for spam review identification using textual content only.
Although the paper has shown some promising aspects of using cotraining on spam review identification, we admit that this paper is merely the initial step. In the future, on the one hand, we will use more data sets to examine the effectiveness of the proposed CoFea algorithm in spam review identification. On the other hand, we will also extend the cotraining algorithm to more research areas such as sentiment analysis [
26], image recognition [
27], and text classification [
28] to explore more fields. In fact, text classification is a basic technique for deceptive review identification. All the techniques mentioned in the paper can be extended to text classification. With the prosperity of ecommerce and online shopping, we regard the deceptive review identification to be a more practical application for users than pure text classification. As for sentiment analysis and image recognition, we cannot apply simple lexical terms as its feature for cotraining, and there is no golden theory on how to build its feature sets. For this reason, we will firstly conduct research on feature extraction of the two tasks and further apply the proposed cotraining approach on them.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant No. 91218302, 91318301, 61379046 and 61432001; the Fundamental Research Funds for the Central Universities (buctrc201504).
Author Contributions
Wen Zhang and Taketoshi Yoshida conceived and designed the experiments; Wen Zhang performed the experiments; Wen Zhang and Chaoqi Bu analyzed the data; Siguang Zhang contributed analysis tools; Wen Zhang and Chaoqi Bu wrote the paper. All authors have read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
 Aljukhadar, M.; Senecal, S. The user multifaceted expertise: Divergent effects of the website versusecommerce expertise. Int. J. Inf. Manag. 2016, 36, 322–332. [Google Scholar] [CrossRef]
 Xiang, Z.; Magnini, V.P.; Fesenmaier, D.R. Information technology and consumer behavior in travel andtourism: Insights from travel planning using the Internet. J. Retail. Consum. Serv. 2015, 22, 244–249. [Google Scholar] [CrossRef]
 Zhang, W.; Wang, S.; Wang, Q. KSAP: An approach to bug report assignment using KNN search and heterogeneous proximity. Inf. Softw. 2016, 70, 68–84. [Google Scholar] [CrossRef]
 Sui, D.Z. Mapping and Modeling Strategic Manipulation and Adversarial Propaganda in Social Media: Towards a tipping point/critical mass model. In Proceedings of the Workshop on Mapping Ideas: Discovering and Information Landscapes, San Diego, CA, USA, 29–30 June 2011.
 Jindal, N.; Liu, B. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 219–230.
 Pang, B.; Lee, L. Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2008, 2, 1–134. [Google Scholar] [CrossRef]
 Ott, M.; Choi, Y.; Cardie, C.; Hancock, J.T. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, USA, 19–24 June 2011; pp. 309–319.
 Feng, V.W.; Hirst, G. Detecting deceptive opinions with profile compatibility. In Proceedings of the International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–18 October 2013.
 Zhou, L.; Shi, Y.; Zhang, D. A Statistical Language Modeling Approach to Online Deception Detection. IEEE Trans. Knowl. Data Eng. 2008, 20, 1077–1081. [Google Scholar] [CrossRef]
 Li, H.; Chen, Z.; Mukherjee, A.; Liu, B.; Shao, J. Analyzing and Detecting Opinion Spam on a Large scale Dataset via Temporal and Spatial Patterns. In Proceedings of the 9th International AAAI Conference on Web and Social Media, Oxford, UK, 26–29 May 2015.
 Zhang, W.; Bu, C.; Yoshida, T.; Zhang, S. CoSpa: A Cotraining Approach for Spam Review Identification with Support Vector Machine. Information 2016, 7, 12. [Google Scholar] [CrossRef]
 Huang, J.; Lu, J.; Ling, C.X. Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy. In Proceedings of the 3rd IEEE International Conference on Data Mining, Melbourne, FL, USA, 19–22 November 2003; p. 553.
 Chapelle, O.; Schölkopf, B.; Zien, A. SemiSupervised Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
 Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
 Blum, A.; Mitchell, T. Combining labeled and unlabeled data with cotraining. In Proceedings of the Workshop on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998; pp. 92–100.
 Committee on the Fundamentals of Computer Science—Challenges and Opportunities. Computer Science: Reflections on the Field, Reflections from the Field; ISBN 0309093015. The National Academies Press: Washington, DC, USA, 2004. [Google Scholar]
 Liu, W.; Li, Y.; Tao, D.; Wang, Y. A general framework for cotraining and its applications. Neurocomputing 2015, 167, 112–121. [Google Scholar] [CrossRef]
 Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
 ShaweTaylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
 Joachims, T. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the 1999 International Conference on Machine Learning, Bled, Slovenia, 27–30 June 1999; pp. 200–209.
 USPTO Stop Words. Available online: http://ftp.uspto.gov/patft/help/stopword.htm (accessed on 1 March 2016).
 Porter Stemming Algorithm. Available online: http://tartarus.org/martin/PorterStemmer/ (accessed on 29 November 2016).
 Weiss, S.M.; Indurkhya, N.; Zhang, T.; Damerau, F. Text Mining: Predictive Methods for Analyzing Unstructured Information; Springer: New York, NY, USA, 2004; pp. 36–37. [Google Scholar]
 Rajaraman, A.; Ullman, J.D. Data Mining. In Mining of Massive Datasets; Cambridge University Press: London, UK, 2011; pp. 1–17. [Google Scholar]
 Wilcoxon, F. Individual comparisons by ranking methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
 Ravi, K.; Ravi, V. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowl. Based Syst. 2015, 89, 14–46. [Google Scholar] [CrossRef]
 Jain, A.K.; Duin, R.P.W.; Mao, J. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 4–37. [Google Scholar] [CrossRef]
 Zhang, W.; Yoshida, T.; Tang, X. Text classification based on multiword with support vector machine. Knowl. Based Syst. 2008, 21, 879–886. [Google Scholar] [CrossRef]
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CCBY) license (http://creativecommons.org/licenses/by/4.0/).