skip to main content
research-article

Text mining and probabilistic language modeling for online review spam detection

Published:05 January 2012Publication History
Skip Abstract Section

Abstract

In the era of Web 2.0, huge volumes of consumer reviews are posted to the Internet every day. Manual approaches to detecting and analyzing fake reviews (i.e., spam) are not practical due to the problem of information overload. However, the design and development of automated methods of detecting fake reviews is a challenging research problem. The main reason is that fake reviews are specifically composed to mislead readers, so they may appear the same as legitimate reviews (i.e., ham). As a result, discriminatory features that would enable individual reviews to be classified as spam or ham may not be available. Guided by the design science research methodology, the main contribution of this study is the design and instantiation of novel computational models for detecting fake reviews. In particular, a novel text mining model is developed and integrated into a semantic language model for the detection of untruthful reviews. The models are then evaluated based on a real-world dataset collected from amazon.com. The results of our experiments confirm that the proposed models outperform other well-known baseline models in detecting fake reviews. To the best of our knowledge, the work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews. A managerial implication of our research is that firms can apply our design artifacts to monitor online consumer reviews to develop effective marketing or product design strategies based on genuine consumer feedback posted to the Internet.

References

  1. Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., and Nunamaker Jr., J. F. 2010. Detecting fake websites: The contribution of statistical learning theory. MIS Quart. 34, 3, 435--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abbasi, A., Chen, H., Nunamaker Jr., J. F. 2008. Stylometric identification in electronic markets: Scalability and robustness. J. Manag. Inf. Syst. 25, 1, 49--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of 20th International Conference on Very Large Data Bases. 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonçalves, M. 2009. Detecting spammers and content promoters in online video social networks, In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 620--627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 222--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bratko, A., Filipič, B., Cormack, G. V., Lynam, T. R., and Zupan, B. 2006. Spam filtering using statistical data compression models. J. Mach. Learn. Res. 7, 2673--2698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chang, M. W., Yih, W. T., and Meek, C. 2008. Partitioned logistic regression for spam filtering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 97--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen, F., Tan, P., and Jain, A. 2009. A co-classification framework for detecting web spam and spammers in social media web sites. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1807--1810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cheung, M. Y., Luo, C., Sia, C. L., and Chen, H. 2009. Credibility of electronic word-of-mouth: Informational and normative determinants of on-line consumer recommendations. Int. J. Electron. Commerce 13, 4, 9--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chowdhury, A., Frieder, O., Grossman, D., and McCabe M. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2, 171--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cormack, G. V. and Lynam, T. R. 2005. TREC 2005 spam track overview. http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05.Google ScholarGoogle Scholar
  12. Cormack, G. V. 2007. TREC 2007 spam track overview. http://trec.nist.gov/pubs/trec16/papers/SPAM. OVERVIEW16.pdf.Google ScholarGoogle Scholar
  13. Cormack, G. V., Hidalgo, J., and Sánz, E. 2007a. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. 313--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cormack, G. V., Hidalgo, J., and Sánz, E. 2007b. Online supervised spam filter evaluation, ACM Trans. Inf. Syst. 25, 3, Article 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Danescu-Niculescu-Mizil, Kossinets, C., Kleinberg, J., and Lee, L. 2009. How opinions are received by online communities: A case study on amazon.com helpfulness votes. In Proceedings of the 18th International Conference on World Wide Web. 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dellarocas, C. 2003. The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Manag. Sci. 49, 10, 1407--1424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dellarocas, C. 2006. Strategic manipulation of internet opinion forums: Implications for consumers and firms. Manag. Sci. 52, 10, 1577--1593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference. 170--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gefen, D., Benbasat, I., and Pavlou, P. A. 2008. A research agenda for trust in online environments. J. Manag. Inf. Syst. 24, 4, 275--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ghose, A., and Ipeirotis, P. G. 2007. Designing novel review ranking systems: Predicting the usefulness and impact of reviews. In Proceedings of the 9th International Conference on Electronic Commerce. 303--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Goldberg, D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gyöngyi, A. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 39--47.Google ScholarGoogle Scholar
  23. Hand, D. and Till, R. 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 2, 171--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hevner, A., March, S., Park, J., and Ram, S. 2004. Design science in information systems research. MIS Quart. 28, 1, 75--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jindal, N. and Liu, B. 2007a. Analyzing and detecting review spam. In Proceedings of the 7th IEEE International Conference on Data Mining. 547--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jindal, N. and Liu, B. 2007b. Review spam detection. In Proceedings of the 16th International Conference on World Wide Web. 1189--1190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jindal, N. and Liu, B. 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Web Data Mining. 219--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jindal, N., Liu, B., and Lim, E. P. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 1549--1552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning. 137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Joachims, T. 1999. Making large--scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kim, S.-M., Pantel, P., Chklovski, T., and Pennacchiotti, M. 2006. Automatically assessing review helpfulness. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 423--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kullback, S. and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Statist. 22, 1, 79--86.Google ScholarGoogle ScholarCross RefCross Ref
  33. Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 111--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lau, R. Y. K., Liao, Stephen, S. Y., and Xu, K. 2010. An empirical study of online consumer review spam: A design science approach. In Proceedings of the 31st International Conference on Information Systems.Google ScholarGoogle Scholar
  35. Lau, R. Y. K., Song, D., Li, Y., Cheung, C. H., Hao, J. X. 2009a. Towards a fuzzy domain ontology extraction method for adaptive e-learning, IEEE Trans. Knowl. Data Engin. 21, 6, 800--813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lau, R. Y. K., Lai, C. L., Ma, J., and Li, Y. 2009b. Automatic domain ontology extraction for context-sensitive opinion mining. In Proceedings of the 30th International Conference on Information Systems.Google ScholarGoogle Scholar
  37. Lau, R. Y. K., Lai, C. L., and Li, Y. 2009c. Leveraging the Web context for context-sensitive opinion mining. In Proceedings of the IEEE International Conference on Computer Science and Information Technology. 467--471.Google ScholarGoogle Scholar
  38. Lau, R. Y. K., Bruza, P. D., and Song, D. 2008. Towards a belief revision based adaptive and context sensitive information retrieval system. ACM Trans. Inf. Syst. 26, 2, Article 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lau, R. Y. K., Tang, M., Wong, O., Milliner, S., and Chen, Y. 2006. An evolutionary learning approach for adaptive negotiation agents. Int. J. Intel. Syst. 21, 1, 41--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Lau, R. Y. K. 2003. Context-sensitive text mining and belief revision for intelligent information retrieval on the web. J. Web Intell. Agent Syst. 1, 3-4, 151--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B. and Lauw, H. W. 2010. Detecting product review spammers using rating behaviors, In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 939--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Lin, Y. R., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. L. 2008. Detecting splogs via temporal dynamics using self-similarity analysis, ACM Trans. Web 2, 1, Article 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Liu, X. and Croft, B. 2004. Cluster-Based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Liu, Y., Huang, X., An, A., and Yu, X. 2008. Modeling and predicting the helpfulness of online reviews. In Proceedings of the 8th IEEE International Conference on Data Mining. 443--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Macdonald, C. and Ounis, I. 2007. Overview of the TREC 2007 blog track. In Proceedings of the 16th Text REtrieval Conference. http://trec.nist.gov/pubs/trec16/.Google ScholarGoogle Scholar
  46. Macdonald, C., Ounis, I., and Soboroff, I. 2009. Is spam an issue for opinionated blog post search? In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 710--711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. March, S. T., and Storey, V. C. 2008. Design science in the information systems discipline. MIS Quart. 32, 4, 725--730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Martinez-Romo, J. and Araujo, L. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Mishne, G., Carmel, D., and Lempel, R. 2005a. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 1--6.Google ScholarGoogle Scholar
  50. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. 1990. Introduction to WordNet: An on-line lexical database. J. Lexicogr. 3, 4, 234--244.Google ScholarGoogle ScholarCross RefCross Ref
  51. Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Nadas, A. 1984. Estimation of probabilities in the language model of the IBM speech recognition system, IEEE Trans. Acoust. Speech Signal Process. 32, 4, 859.Google ScholarGoogle ScholarCross RefCross Ref
  53. Nie, J. Y., Cao, G., and Bai, J. 2006. Inferential language models for information retrieval. ACM Trans. Asian Lang. Inf. Process. 5, 4, 296--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Papadimitriou, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  56. Peffers, K., Tuunanen, T., Rothenberger, M., and Chatterjee, S. 2008. A design science research methodology for information systems research. J. Manag. Inf. Syst. 24, 3, 45--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Perrin, P. and Petry, F. 2003. Extraction and representation of contextual information for knowledge discovery in texts. Inf. Sci. 151, 125--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Piskorski, J., Sydow, M., and Weiss, D. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web. 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Ponte, J. and Croft, B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  61. Riloff, E. M., Wilson, T., Hoffmann, P., Somasundaran, S., Kessler, J., Wiebe, J., Choi, Y., Cardie, C., and Patwardhan, S. 2005. Opinionfinder: A system for subjectivity analysis. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 34--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Salton, G. and McGill, H. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference. 206--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Shannon, C. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379--423.Google ScholarGoogle ScholarCross RefCross Ref
  65. Xiao, B. and Benbasat, I. 2011. Product-related deception in e-commerce: A theoretical perspective. MIS Quart. 35, 1, 169--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Yan, X. Lau, R. Y. K., Song, D., Li, X., and Ma, J. 2011. Towards a semantic granularity model for domain-specific information retrieval. ACM Trans. Inf. Syst. 29, 3, Article 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Zheleva, E., Kolcz, A., and Getoor, L. 2008. Trusting spam reporters: A reporter-based reputation system for email filtering. ACM Trans. Inf. Syst. 27, 1, Article 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Zhou, B. and Pei, J. 2009. Link spam target detection using page farms. ACM Trans. Knowl. Discov. Data 3, 3, Article 13. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Text mining and probabilistic language modeling for online review spam detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Management Information Systems
          ACM Transactions on Management Information Systems  Volume 2, Issue 4
          December 2011
          141 pages
          ISSN:2158-656X
          EISSN:2158-6578
          DOI:10.1145/2070710
          Issue’s Table of Contents

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 January 2012
          • Accepted: 1 September 2011
          • Revised: 1 August 2011
          • Received: 1 April 2011
          Published in tmis Volume 2, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader