CoSpa: A Co-training Approach for Spam Review Identification with Support Vector Machine
Abstract
:1. Introduction
2. Related Work
2.1. Co-Training with Multiple Views
2.2. Probabilistic Context-Free Grammar (PCFG)
- Rule type 1: unlexicalized production rules (i.e., all production rules except for those with terminal nodes). For instance, the rules of type 1 derived from Figure 1 are [S-NP VP; NP-DT NNS; VP-VBD ADJP; ADJP-ADJP CC ADJP; ADJP-JJ; ADJP-RB JJ].
- Rule type 2: lexicalized production rules (i.e., all production rules). For instance, the rules of type 2 derived from Figure 1 are [DT-The; NNS-rooms; VBD-were; JJ-spacious; CC-and; RB-very; JJ-fresh].
- Rule type 3: unlexicalized production rules combined with the grandparent node. For instance, the rules of type 3 derived from Figure 1 are [ROOT S-NP VP; S NP-DT NNS; S VP-VBD ADJP; VP ADJP-ADJP CC ADJP; ADJP ADJP-JJ; ADJP ADJP-RB JJ].
- Rule type 4: lexicalized production rules (i.e., all production rules) combined with the grandparent node. For instance, the rules of type 4 derived from Figure 1 are [NP DT-The; NP NNS-rooms; VP VBD-were; ADJP JJ-spacious; ADJP CC-and; ADJP RB-very; ADJP JJ-fresh].
2.3. Support Vector Machine (SVM)
3. CoSpa—The Proposed Approach
3.1. Observation
3.2. The CoSpa Algorithm
Algorithm 1. The CoSpa algorithm without details on selecting classified unlabeled reviews in the co-training process. |
Input:
|
Output: two trained classifies and . |
Procedure:
|
Algorithm 2. The CoSpa-C strategy. |
Input:
|
Output: —a set of reviews with highest confidence. |
Procedure: |
For each review in do;
|
End for |
Sort the reviews in and respectively in descending order according to the absolute value of ; |
Sort the reviews in and respectively in descending order according to the absolute value of ; |
Add the first reviews in and reviews from to and remove them from ; |
Add the first reviews in and reviews from to and remove them from ; |
Return . |
Algorithm 3. The CoSpa-U strategy. |
Input:
|
Output: —a set of randomly sampled reviews. |
Procedure:
|
4. Experiments
4.1. The Datasets
4.2. Experimental Setup
4.3. Experimental Results
5. Concluding Remarks
- Firstly, using the spam dataset, we came up with the observation of difference on lexical terms and PCFG rules distributed in deceptive and truthful reviews.
- Secondly, we proposed the CoSpa algorithm based on a support vector machine to implement co-training using two representations for each review, lexical terms and PCFG rules. Further, we proposed two strategies, Co-Spa-C and CoSpa-U, to select informative unlabeled data to improve the performance of the CoSpa algorithm.
- Thirdly, we conduct experiments on the spam dataset and the deception dataset to propose the CoSpa algorithm and other baseline methods. Experimental results demonstrate that both the CoSpa-C and CoSpa-U strategies outperform a traditional SVM with lexical terms or PCFG rules in spam review detection. The CoSpa-U strategy outperforms the CoSpa-C strategy in spam review identification. The representation using PCFG rules outperforms the representation using lexical terms. We explain the experiment outcome in the paper.
Acknowledgments
Author Contributions
Conflicts of Interest
References
- Aljukhadar, M.; Senecal, S. The user multifaceted expertise: Divergent effects of the website versus e-commerce expertise. Int. J. Inf. Manag. 2016, 36, 322–332. [Google Scholar] [CrossRef]
- Xiang, Z.; Magnini, V.P.; Fesenmaier, D.R. Information technology and consumer behavior in travel and tourism: Insights from travel planning using the Internet. J. Retail. Consum. Serv. 2015, 22, 244–249. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, S.; Wang, Q. KSAP: An approach to bug report assignment using KNN search and heterogeneous proximity. Inf. Softw. Technol. 2016, 70, 68–84. [Google Scholar] [CrossRef]
- Li, H.; Chen, Z.; Liu, B.; Wei, X.; Shao, J. Spotting Fake Reviews via Collective Positive-Unlabeled Learning. In Proceedings of 2014 IEEE International Conference on Data Mining (ICDM), Shenzhen, China, 14–17 December 2014.
- Ott, M.; Choi, Y.; Cardie, C.; Hancock, J.T. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, USA, 19–24 June 2011; pp. 309–319.
- Pennebaker, J.W.; Chung, C.K.; Ireland, M.; Gonzales, A.; Booth, R.J. The Development and Psychometric Properties of LIWC2007; LIWC.net: Austin, TX, USA, 2007. [Google Scholar]
- Feng, S.; Banerjee, R.; Choi, Y. Syntactic Stylometry for Deception Detection. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Korea, 8–14 July 2012; pp. 171–175.
- Feng, V.W.; Hirst, G. Detecting deceptive opinions with profile compatibility. In Proceedings of the International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–18 October 2013; pp. 338–346.
- Zhou, L.; Shi, Y.; Zhang, D. A Statistical Language Modeling Approach to Online Deception Detection. IEEE Trans. Knowl. Data Eng. 2008, 20, 1077–1081. [Google Scholar] [CrossRef]
- Li, H.; Chen, Z.; Mukherjee, A.; Liu, B.; Shao, J. Analyzing and Detecting Opinion Spam on a Large scale Dataset via Temporal and Spatial Patterns. In Proceedings of The 9th International AAAI Conference on Web and Social Media (ICWSM-15), Oxford, UK, 26–29 May 2015.
- Jindal, N.; Liu, B. Opinion Spam and Analysis. In Proceedings of 2008 International Conference on Web Search and Data Mining (WSDM’08), Palo Alto, CA, USA, 11–12 February 2008.
- Li, F.; Huang, M.; Yang, Y.; Zhu, X. Learning to Identifying Review Spam. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI’11), Barcelona, Spain, 16–22 July 2011.
- A Statistical Analysis of 1.2 Million Amazon Reviews. Available online: http://minimaxir.com/2014 /06/reviewing-reviews/ (accessed on 1 March 2016).
- Fact Sheet of Tripadvisor. Available online: http://www.tripadvisor.com/PressCenter-c4-Fact_Sheet.html (accessed on 1 March 2016).
- Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning theory (COLT' 98), Madisson, WI, USA, 24–26 July 1998; pp. 92–100.
- Heydari, A.; Tavakoli, M.; Salim, N.; Heydari, Z. Detection of review spam: A survey. Expert Syst. Appl. 2015, 42, 3634–3642. [Google Scholar] [CrossRef]
- Fusilier, D.H.; Montes-y-Gómez, M.; Rosso, P.; Cabrera, R.G. Detecting positive and negative deceptive opinions using PU-learning. Inf. Process. Manag. 2015, 51, 433–443. [Google Scholar] [CrossRef]
- Ben-David, S.; Lu, T.; Pal, D. Does unlabeled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In Proceedings of the 21st Annual Conference on Learning Theory, Helsinki, Finland, 9–12 July 2008; pp. 33–44.
- Marc-A, K.; Tobias, S. Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics. Mach. Learn. 2014, 57, 61–81. [Google Scholar]
- Wang, W.Y.; Thadani, K.; McKeown, K.R. Identifying Event Descriptions using Co-training with Online News Summaries. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 8–13 November 2011.
- Mihalcea, R. Co-training and self-training for word sense disambiguation. In Proceedings of the 2nd Conference on Computational Natural Language Learning, Boston, MA, USA, 26–27 May 2004.
- Du, J.; Ling, C.X.; Zhou, Z.H. When does co-training work in real data? IEEE Trans. Knowl. Data Eng. 2011, 23, 788–799. [Google Scholar] [CrossRef]
- Liu, W.; Li, Y.; Tao, D.; Wang, Y. A general framework for co-training and its applications. Neurocomputing 2015, 167, 112–121. [Google Scholar] [CrossRef]
- Collins, M. Probabilistic Context-Free Grammars (PCFGs). 2013. Available online: http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/pcfgs.pdf (accessed on 26 February 2016).
- Klein, D.; Manning, C.D. Accurate Unlexicalized Parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7–12 July 2003; pp. 423–430.
- Wan, X. Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Stroudsburg, PA, USA; 2009; pp. 235–243. [Google Scholar]
- Vapnik, V. The Nature of Statistical Learning Theory; Springer-Verlag: New York, NY, USA, 1995. [Google Scholar]
- Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
- Sidney, S. Non-parametric Statistics for the Behavioral Sciences; McGraw-Hill: New York, NY, USA, 1956; pp. 75–83. [Google Scholar]
- Li, J.; Ott, M.; Cardie, C.; Hovy, E. Towards a General Rule for Identifying Deceptive Opinion Spam. In Proceedings of the 52th Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, MD, USA, 22–27 June 2014; pp. 1566–1576.
- Stanford POS Tagger for English part-of-speech. Available online: http://nlp.stanford.edu/software/tagger.shtml (accessed on 1 March 2016).
- USPTO stop words. Available online: http://ftp.uspto.gov/patft/help/stopword.htm (accessed on 1 March 2016).
- Porter stemming algorithm. Available online: http://tartarus.org/martin/PorterStemmer/ (accessed on 1 March 2016).
- Weiss, S.M.; Indurkhya, N.; Zhang, T.; Damerau, F. Text Mining: Predictive Methods for Analyzing Unstructured Information; Springer-Verlag: New York, NY, USA, 2004; pp. 36–37. [Google Scholar]
- Penn Treebank Tag-set. Available online: http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html (accessed on 1 March 2016).
- Zhang, W.; Yoshida, T.; Tang, X. Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 2008, 21, 879–886. [Google Scholar] [CrossRef]
- Liu, B.; Feng, J.; Liu, M.; Hu, H.; Wang, X. Predicting the quality of user-generated answers using co-training in community-based question answering portals. Pattern Recognit. Lett. 2015, 58, 29–34. [Google Scholar] [CrossRef]
- Hong, Y.; Zhu, W. Spatial Co-Training for Semi-Supervised Image Classification. Pattern Recognit. Lett. 2015, 63, 59–65. [Google Scholar] [CrossRef]
- Ravi, K.; Ravi, V. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowl.-Based Syst. 2015, 89, 14–46. [Google Scholar] [CrossRef]
- Xia, R.; Xu, F.; Yu, J.; Qi, Y. Erik Cambria: Polarity shift detection, elimination and ensemble: A three-stage model for document-level sentiment analysis. Inf. Process. Manag. 2016, 52, 36–45. [Google Scholar] [CrossRef]
Rank | N.D. | N.T. | P.D. | P.T. |
---|---|---|---|---|
1 | Smell (116) | Floor (136) | Visit (121) | Bathroom (116) |
2 | Luxury (86) | Charge (121) | Experience (104) | Breakfast (106) |
3 | Recommend (76) | Small (116) | Luxury (104) | Floor (95) |
4 | Suit (71) | Water (79) | Amazing (89) | Park (89) |
5 | Wife (69) | Wall (73) | Husband (83) | Bar (77) |
6 | Money (68) | Phone (73) | Food (81) | Lobby (73) |
7 | Towel (67) | Bar (72) | Travel (72) | Small (64) |
8 | Clerk (66) | Shower (68) | Wife (54) | Concierge (60) |
9 | Weekend (65) | Breakfast (65) | Fantastic (49) | Street (51) |
10 | Smoke (60) | Walk (59) | Vacation (45) | Coffee (46) |
p-Value | N.D. | N.T. | P.D. | P.T. |
---|---|---|---|---|
N.D. | - | 0.05** | 0.23 | 0.05** |
N.T. | 0.05** | - | 0.05** | 0.25 |
P.D. | 0.23 | 0.23 | - | 0.05** |
Rank | N. D. | N. T. | P. D. | P. T. |
---|---|---|---|---|
1 | VP VBD-was (2000) | ROOT S-NP VP (1848) | TO-to (1268) | S S-NP VP (1219) |
2 | S S-NP VP (1870) | S-NP VP (4004) | S-VP (1260 ) | NP PRP-we (886) |
3 | PRP$-my (737) | VBZ-is (67) | NNP-Chicago (565) | NP-NNS (500) |
4 | NP PRP$-my (736) | VP VBZ-is (631) | NP PRP$-my (472) | ADJP RB-very (371) |
5 | VBD-had (628) | NP-NNP (596) | VP SBAR-IN S (390) | NP-NP (367) |
6 | VP VBD-had (626) | NP CC-and (553) | VP PP-TO NP (361) | JJ-great (353) |
7 | S VP-MD VP (517) | ADJP-RB JJ (531) | PP NP-NN (356) | NP NP-DT JJ NN (343) |
8 | VP-VP CC VP (510) | PP NP-NN (526) | NP-NP CC NP (313) | PP IN-on (336) |
9 | VP PP-TO NP (509) | NP-DT NN NN (509) | VP-MD VP (310) | NP NP-NN (344) |
10 | WHADVP-WRB (491) | NP-DT NNS (493) | VP-VB NP(303) | VP PP-TO NP (335) |
p-Value | PCFG N.D. | PCFG N.T. | PCFG P. D. | PCFG P. T. |
---|---|---|---|---|
PCFG N.D. | - | 0.05** | 0.28 | 0.05** |
PCFG N.T. | 0.05** | - | 0.05** | 0.31 |
PCFG P. D | 0.28 | 0.29 | - | 0.05** |
Polarity | Category | # of Hotels | # of Reviews | # of Sentences |
---|---|---|---|---|
Positive | Deceptive_from_MTurk | 20 | 400 | 3043 |
Truthful_from_Web | 20 | 400 | 3480 | |
Negative | Deceptive_from_MTurk | 20 | 400 | 4149 |
Truthful_from_Web | 20 | 400 | 4483 |
Subject | Category | # of reviews | # of Sentences |
---|---|---|---|
doctor | deceptive_MTurk | 356 | 2369 |
truthful | 200 | 1151 | |
restaurant | Deceptive_MTurk | 201 | 1827 |
truthful | 200 | 1892 |
Dataset | Method Pair | K = 1 | K = 5 | K = 10 | K = 15 | K = 20 | K = 25 | K = 30 |
---|---|---|---|---|---|---|---|---|
Spam | Term with CoSpa-C vs. Term | >> | >> | >> | >> | >> | >> | >> |
Term with CoSpa-C vs. PCFG | >> | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-C vs. Term&PCFG | > | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-C vs. PCFG with CoSpa-C | < | < | ~ | << | << | << | << | |
Term with CoSpa-U vs. Term | >> | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-U vs. PCFG | >> | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-U vs. Term&PCFG | >> | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-U vs. PCFG with CoSpa-U | < | < | << | << | << | << | << | |
PCFG with CoSpa-C vs. PCFG with CoSpa-U | << | << | << | << | << | << | << | |
Deception | Term with CoSpa-C vs. Term | >> | >> | >> | >> | >> | >> | >> |
Term with CoSpa-C vs. PCFG | < | ~ | >> | >> | >> | >> | >> | |
Term with CoSpa-C vs. Term&PCFG | << | ~ | > | >> | >> | >> | >> | |
Term with CoSpa-C vs. PCFG with CoSpa-C | > | >> | ~ | << | << | << | << | |
Term with CoSpa-U vs. Term | >> | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-U vs. PCFG | >> | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-U vs. Term&PCFG | > | >> | >> | >> | >> | >> | >> | |
Term with CoSpa-U vs. PCFG with CoSpa-U | < | < | << | << | << | << | ~ | |
PCFG with CoSpa-C vs. PCFG with CoSpa-U | << | << | << | << | << | << | << |
Dataset | Method Pair | n = p = 1 | n = p = 2 | n = p = 3 | n = p = 4 | n = p = 5 | n = p = 6 | n = p = 7 |
---|---|---|---|---|---|---|---|---|
Spam | Term with CoSpa-C vs. Term with CoSpa-U | << | << | << | << | << | - | - |
Term with CoSpa-C vs. PCFG with CoSpa-C | << | << | < | < | < | - | - | |
Term with CoSpa-C vs. PCFG with CoSpa-U | > | >> | >> | >> | >> | - | - | |
PCFG with CoSpa-C vs. Term with CoSpa-U | << | << | << | << | << | - | - | |
PCFG with CoSpa-C vs. PCFG with CoSpa-U | << | << | << | << | << | - | - | |
Term with CoSpa-U vs. PCFG with CoSpa-U | << | < | << | << | << | - | - | |
Deception | Term with CoSpa-C vs. Term with CoSpa-U | << | << | << | << | << | << | << |
Term with CoSpa-C vs. PCFG with CoSpa-C | << | << | << | << | << | << | << | |
Term with CoSpa-C vs. PCFG with CoSpa-U | << | << | << | << | << | << | << | |
PCFG with CoSpa-C vs. Term with CoSpa-U | << | << | << | << | << | << | << | |
PCFG with CoSpa-C vs. PCFG with CoSpa-U | << | << | << | << | << | << | << | |
Term with CoSpa-U vs. PCFG with CoSpa-U | << | << | << | << | << | << | << |
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, W.; Bu, C.; Yoshida, T.; Zhang, S. CoSpa: A Co-training Approach for Spam Review Identification with Support Vector Machine. Information 2016, 7, 12. https://doi.org/10.3390/info7010012
Zhang W, Bu C, Yoshida T, Zhang S. CoSpa: A Co-training Approach for Spam Review Identification with Support Vector Machine. Information. 2016; 7(1):12. https://doi.org/10.3390/info7010012
Chicago/Turabian StyleZhang, Wen, Chaoqi Bu, Taketoshi Yoshida, and Siguang Zhang. 2016. "CoSpa: A Co-training Approach for Spam Review Identification with Support Vector Machine" Information 7, no. 1: 12. https://doi.org/10.3390/info7010012
APA StyleZhang, W., Bu, C., Yoshida, T., & Zhang, S. (2016). CoSpa: A Co-training Approach for Spam Review Identification with Support Vector Machine. Information, 7(1), 12. https://doi.org/10.3390/info7010012