Abstract
In the era of Web 2.0, huge volumes of consumer reviews are posted to the Internet every day. Manual approaches to detecting and analyzing fake reviews (i.e., spam) are not practical due to the problem of information overload. However, the design and development of automated methods of detecting fake reviews is a challenging research problem. The main reason is that fake reviews are specifically composed to mislead readers, so they may appear the same as legitimate reviews (i.e., ham). As a result, discriminatory features that would enable individual reviews to be classified as spam or ham may not be available. Guided by the design science research methodology, the main contribution of this study is the design and instantiation of novel computational models for detecting fake reviews. In particular, a novel text mining model is developed and integrated into a semantic language model for the detection of untruthful reviews. The models are then evaluated based on a real-world dataset collected from amazon.com. The results of our experiments confirm that the proposed models outperform other well-known baseline models in detecting fake reviews. To the best of our knowledge, the work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews. A managerial implication of our research is that firms can apply our design artifacts to monitor online consumer reviews to develop effective marketing or product design strategies based on genuine consumer feedback posted to the Internet.
- Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., and Nunamaker Jr., J. F. 2010. Detecting fake websites: The contribution of statistical learning theory. MIS Quart. 34, 3, 435--461. Google ScholarDigital Library
- Abbasi, A., Chen, H., Nunamaker Jr., J. F. 2008. Stylometric identification in electronic markets: Scalability and robustness. J. Manag. Inf. Syst. 25, 1, 49--78. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of 20th International Conference on Very Large Data Bases. 487--499. Google ScholarDigital Library
- Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonçalves, M. 2009. Detecting spammers and content promoters in online video social networks, In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 620--627. Google ScholarDigital Library
- Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 222--229. Google ScholarDigital Library
- Bratko, A., Filipič, B., Cormack, G. V., Lynam, T. R., and Zupan, B. 2006. Spam filtering using statistical data compression models. J. Mach. Learn. Res. 7, 2673--2698. Google ScholarDigital Library
- Chang, M. W., Yih, W. T., and Meek, C. 2008. Partitioned logistic regression for spam filtering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 97--105. Google ScholarDigital Library
- Chen, F., Tan, P., and Jain, A. 2009. A co-classification framework for detecting web spam and spammers in social media web sites. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1807--1810. Google ScholarDigital Library
- Cheung, M. Y., Luo, C., Sia, C. L., and Chen, H. 2009. Credibility of electronic word-of-mouth: Informational and normative determinants of on-line consumer recommendations. Int. J. Electron. Commerce 13, 4, 9--38. Google ScholarDigital Library
- Chowdhury, A., Frieder, O., Grossman, D., and McCabe M. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2, 171--191. Google ScholarDigital Library
- Cormack, G. V. and Lynam, T. R. 2005. TREC 2005 spam track overview. http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05.Google Scholar
- Cormack, G. V. 2007. TREC 2007 spam track overview. http://trec.nist.gov/pubs/trec16/papers/SPAM. OVERVIEW16.pdf.Google Scholar
- Cormack, G. V., Hidalgo, J., and Sánz, E. 2007a. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. 313--319. Google ScholarDigital Library
- Cormack, G. V., Hidalgo, J., and Sánz, E. 2007b. Online supervised spam filter evaluation, ACM Trans. Inf. Syst. 25, 3, Article 11. Google ScholarDigital Library
- Danescu-Niculescu-Mizil, Kossinets, C., Kleinberg, J., and Lee, L. 2009. How opinions are received by online communities: A case study on amazon.com helpfulness votes. In Proceedings of the 18th International Conference on World Wide Web. 141--150. Google ScholarDigital Library
- Dellarocas, C. 2003. The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Manag. Sci. 49, 10, 1407--1424. Google ScholarDigital Library
- Dellarocas, C. 2006. Strategic manipulation of internet opinion forums: Implications for consumers and firms. Manag. Sci. 52, 10, 1577--1593. Google ScholarDigital Library
- Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference. 170--177. Google ScholarDigital Library
- Gefen, D., Benbasat, I., and Pavlou, P. A. 2008. A research agenda for trust in online environments. J. Manag. Inf. Syst. 24, 4, 275--286. Google ScholarDigital Library
- Ghose, A., and Ipeirotis, P. G. 2007. Designing novel review ranking systems: Predicting the usefulness and impact of reviews. In Proceedings of the 9th International Conference on Electronic Commerce. 303--309. Google ScholarDigital Library
- Goldberg, D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA. Google ScholarDigital Library
- Gyöngyi, A. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 39--47.Google Scholar
- Hand, D. and Till, R. 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 2, 171--186. Google ScholarDigital Library
- Hevner, A., March, S., Park, J., and Ram, S. 2004. Design science in information systems research. MIS Quart. 28, 1, 75--105. Google ScholarDigital Library
- Jindal, N. and Liu, B. 2007a. Analyzing and detecting review spam. In Proceedings of the 7th IEEE International Conference on Data Mining. 547--552. Google ScholarDigital Library
- Jindal, N. and Liu, B. 2007b. Review spam detection. In Proceedings of the 16th International Conference on World Wide Web. 1189--1190. Google ScholarDigital Library
- Jindal, N. and Liu, B. 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Web Data Mining. 219--229. Google ScholarDigital Library
- Jindal, N., Liu, B., and Lim, E. P. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 1549--1552. Google ScholarDigital Library
- Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning. 137--142. Google ScholarDigital Library
- Joachims, T. 1999. Making large--scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Kim, S.-M., Pantel, P., Chklovski, T., and Pennacchiotti, M. 2006. Automatically assessing review helpfulness. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 423--428. Google ScholarDigital Library
- Kullback, S. and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Statist. 22, 1, 79--86.Google ScholarCross Ref
- Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 111--119. Google ScholarDigital Library
- Lau, R. Y. K., Liao, Stephen, S. Y., and Xu, K. 2010. An empirical study of online consumer review spam: A design science approach. In Proceedings of the 31st International Conference on Information Systems.Google Scholar
- Lau, R. Y. K., Song, D., Li, Y., Cheung, C. H., Hao, J. X. 2009a. Towards a fuzzy domain ontology extraction method for adaptive e-learning, IEEE Trans. Knowl. Data Engin. 21, 6, 800--813. Google ScholarDigital Library
- Lau, R. Y. K., Lai, C. L., Ma, J., and Li, Y. 2009b. Automatic domain ontology extraction for context-sensitive opinion mining. In Proceedings of the 30th International Conference on Information Systems.Google Scholar
- Lau, R. Y. K., Lai, C. L., and Li, Y. 2009c. Leveraging the Web context for context-sensitive opinion mining. In Proceedings of the IEEE International Conference on Computer Science and Information Technology. 467--471.Google Scholar
- Lau, R. Y. K., Bruza, P. D., and Song, D. 2008. Towards a belief revision based adaptive and context sensitive information retrieval system. ACM Trans. Inf. Syst. 26, 2, Article 8. Google ScholarDigital Library
- Lau, R. Y. K., Tang, M., Wong, O., Milliner, S., and Chen, Y. 2006. An evolutionary learning approach for adaptive negotiation agents. Int. J. Intel. Syst. 21, 1, 41--72. Google ScholarDigital Library
- Lau, R. Y. K. 2003. Context-sensitive text mining and belief revision for intelligent information retrieval on the web. J. Web Intell. Agent Syst. 1, 3-4, 151--172. Google ScholarDigital Library
- Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B. and Lauw, H. W. 2010. Detecting product review spammers using rating behaviors, In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 939--948. Google ScholarDigital Library
- Lin, Y. R., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. L. 2008. Detecting splogs via temporal dynamics using self-similarity analysis, ACM Trans. Web 2, 1, Article 4. Google ScholarDigital Library
- Liu, X. and Croft, B. 2004. Cluster-Based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193. Google ScholarDigital Library
- Liu, Y., Huang, X., An, A., and Yu, X. 2008. Modeling and predicting the helpfulness of online reviews. In Proceedings of the 8th IEEE International Conference on Data Mining. 443--452. Google ScholarDigital Library
- Macdonald, C. and Ounis, I. 2007. Overview of the TREC 2007 blog track. In Proceedings of the 16th Text REtrieval Conference. http://trec.nist.gov/pubs/trec16/.Google Scholar
- Macdonald, C., Ounis, I., and Soboroff, I. 2009. Is spam an issue for opinionated blog post search? In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 710--711. Google ScholarDigital Library
- March, S. T., and Storey, V. C. 2008. Design science in the information systems discipline. MIS Quart. 32, 4, 725--730. Google ScholarDigital Library
- Martinez-Romo, J. and Araujo, L. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. 21--28. Google ScholarDigital Library
- Mishne, G., Carmel, D., and Lempel, R. 2005a. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 1--6.Google Scholar
- Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. 1990. Introduction to WordNet: An on-line lexical database. J. Lexicogr. 3, 4, 234--244.Google ScholarCross Ref
- Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York. Google ScholarDigital Library
- Nadas, A. 1984. Estimation of probabilities in the language model of the IBM speech recognition system, IEEE Trans. Acoust. Speech Signal Process. 32, 4, 859.Google ScholarCross Ref
- Nie, J. Y., Cao, G., and Bai, J. 2006. Inferential language models for information retrieval. ACM Trans. Asian Lang. Inf. Process. 5, 4, 296--322. Google ScholarDigital Library
- Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. 83--92. Google ScholarDigital Library
- Papadimitriou, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, MA.Google Scholar
- Peffers, K., Tuunanen, T., Rothenberger, M., and Chatterjee, S. 2008. A design science research methodology for information systems research. J. Manag. Inf. Syst. 24, 3, 45--77. Google ScholarDigital Library
- Perrin, P. and Petry, F. 2003. Extraction and representation of contextual information for knowledge discovery in texts. Inf. Sci. 151, 125--152. Google ScholarDigital Library
- Piskorski, J., Sydow, M., and Weiss, D. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web. 25--28. Google ScholarDigital Library
- Ponte, J. and Croft, B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--281. Google ScholarDigital Library
- Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
- Riloff, E. M., Wilson, T., Hoffmann, P., Somasundaran, S., Kessler, J., Wiebe, J., Choi, Y., Cardie, C., and Patwardhan, S. 2005. Opinionfinder: A system for subjectivity analysis. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 34--35. Google ScholarDigital Library
- Salton, G. and McGill, H. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York. Google ScholarDigital Library
- Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference. 206--213. Google ScholarDigital Library
- Shannon, C. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379--423.Google ScholarCross Ref
- Xiao, B. and Benbasat, I. 2011. Product-related deception in e-commerce: A theoretical perspective. MIS Quart. 35, 1, 169--195. Google ScholarDigital Library
- Yan, X. Lau, R. Y. K., Song, D., Li, X., and Ma, J. 2011. Towards a semantic granularity model for domain-specific information retrieval. ACM Trans. Inf. Syst. 29, 3, Article 15. Google ScholarDigital Library
- Zheleva, E., Kolcz, A., and Getoor, L. 2008. Trusting spam reporters: A reporter-based reputation system for email filtering. ACM Trans. Inf. Syst. 27, 1, Article 3. Google ScholarDigital Library
- Zhou, B. and Pei, J. 2009. Link spam target detection using page farms. ACM Trans. Knowl. Discov. Data 3, 3, Article 13. Google ScholarDigital Library
Index Terms
- Text mining and probabilistic language modeling for online review spam detection
Recommendations
Review spam detection
WWW '07: Proceedings of the 16th international conference on World Wide WebIt is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products. They are used by potential customers to find ...
Detecting blog spam hashtags using topic modeling
ICEC '16: Proceedings of the 18th Annual International Conference on Electronic Commerce: e-Commerce in Smart connected WorldTremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various ...
Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques
Despite the great advances in spam detection, spam remains a major problem that has affected the global economy enormously. Spam attacks are popularly perpetrated through different digital platforms with a large electronic audience, such as emails, ...
Comments