research-article

Text mining and probabilistic language modeling for online review spam detection

Authors:
Raymond Y. K. Lau

City University of Hong Kong, China

City University of Hong Kong, China
View Profile

,
S. Y. Liao

City University of Hong Kong, China

City University of Hong Kong, China
View Profile

,
Ron Chi-Wai Kwok

City University of Hong Kong, China

City University of Hong Kong, China
View Profile

,
Kaiquan Xu

Nanjing University, China

Nanjing University, China
View Profile

,
Yunqing Xia

Tsinghua University, China

Tsinghua University, China
View Profile

,
Yuefeng Li

Queensland University of Technology, Australia

Queensland University of Technology, Australia
View Profile

ACM Transactions on Management Information Systems Volume 2 Issue 4Article No.: 25pp 1–30https://doi.org/10.1145/2070710.2070716

Published:05 January 2012Publication History

ACM Transactions on Management Information Systems

Abstract

In the era of Web 2.0, huge volumes of consumer reviews are posted to the Internet every day. Manual approaches to detecting and analyzing fake reviews (i.e., spam) are not practical due to the problem of information overload. However, the design and development of automated methods of detecting fake reviews is a challenging research problem. The main reason is that fake reviews are specifically composed to mislead readers, so they may appear the same as legitimate reviews (i.e., ham). As a result, discriminatory features that would enable individual reviews to be classified as spam or ham may not be available. Guided by the design science research methodology, the main contribution of this study is the design and instantiation of novel computational models for detecting fake reviews. In particular, a novel text mining model is developed and integrated into a semantic language model for the detection of untruthful reviews. The models are then evaluated based on a real-world dataset collected from amazon.com. The results of our experiments confirm that the proposed models outperform other well-known baseline models in detecting fake reviews. To the best of our knowledge, the work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews. A managerial implication of our research is that firms can apply our design artifacts to monitor online consumer reviews to develop effective marketing or product design strategies based on genuine consumer feedback posted to the Internet.

References

Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., and Nunamaker Jr., J. F. 2010. Detecting fake websites: The contribution of statistical learning theory. MIS Quart. 34, 3, 435--461. Google ScholarDigital Library
Abbasi, A., Chen, H., Nunamaker Jr., J. F. 2008. Stylometric identification in electronic markets: Scalability and robustness. J. Manag. Inf. Syst. 25, 1, 49--78. Google ScholarDigital Library
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of 20th International Conference on Very Large Data Bases. 487--499. Google ScholarDigital Library
Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonçalves, M. 2009. Detecting spammers and content promoters in online video social networks, In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 620--627. Google ScholarDigital Library
Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 222--229. Google ScholarDigital Library
Bratko, A., Filipič, B., Cormack, G. V., Lynam, T. R., and Zupan, B. 2006. Spam filtering using statistical data compression models. J. Mach. Learn. Res. 7, 2673--2698. Google ScholarDigital Library
Chang, M. W., Yih, W. T., and Meek, C. 2008. Partitioned logistic regression for spam filtering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 97--105. Google ScholarDigital Library
Chen, F., Tan, P., and Jain, A. 2009. A co-classification framework for detecting web spam and spammers in social media web sites. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1807--1810. Google ScholarDigital Library
Cheung, M. Y., Luo, C., Sia, C. L., and Chen, H. 2009. Credibility of electronic word-of-mouth: Informational and normative determinants of on-line consumer recommendations. Int. J. Electron. Commerce 13, 4, 9--38. Google ScholarDigital Library
Chowdhury, A., Frieder, O., Grossman, D., and McCabe M. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2, 171--191. Google ScholarDigital Library
Cormack, G. V. and Lynam, T. R. 2005. TREC 2005 spam track overview. http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05.Google Scholar
Cormack, G. V. 2007. TREC 2007 spam track overview. http://trec.nist.gov/pubs/trec16/papers/SPAM. OVERVIEW16.pdf.Google Scholar
Cormack, G. V., Hidalgo, J., and Sánz, E. 2007a. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. 313--319. Google ScholarDigital Library
Cormack, G. V., Hidalgo, J., and Sánz, E. 2007b. Online supervised spam filter evaluation, ACM Trans. Inf. Syst. 25, 3, Article 11. Google ScholarDigital Library
Danescu-Niculescu-Mizil, Kossinets, C., Kleinberg, J., and Lee, L. 2009. How opinions are received by online communities: A case study on amazon.com helpfulness votes. In Proceedings of the 18th International Conference on World Wide Web. 141--150. Google ScholarDigital Library
Dellarocas, C. 2003. The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Manag. Sci. 49, 10, 1407--1424. Google ScholarDigital Library
Dellarocas, C. 2006. Strategic manipulation of internet opinion forums: Implications for consumers and firms. Manag. Sci. 52, 10, 1577--1593. Google ScholarDigital Library
Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference. 170--177. Google ScholarDigital Library
Gefen, D., Benbasat, I., and Pavlou, P. A. 2008. A research agenda for trust in online environments. J. Manag. Inf. Syst. 24, 4, 275--286. Google ScholarDigital Library
Ghose, A., and Ipeirotis, P. G. 2007. Designing novel review ranking systems: Predicting the usefulness and impact of reviews. In Proceedings of the 9th International Conference on Electronic Commerce. 303--309. Google ScholarDigital Library
Goldberg, D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA. Google ScholarDigital Library
Gyöngyi, A. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 39--47.Google Scholar
Hand, D. and Till, R. 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 2, 171--186. Google ScholarDigital Library
Hevner, A., March, S., Park, J., and Ram, S. 2004. Design science in information systems research. MIS Quart. 28, 1, 75--105. Google ScholarDigital Library
Jindal, N. and Liu, B. 2007a. Analyzing and detecting review spam. In Proceedings of the 7th IEEE International Conference on Data Mining. 547--552. Google ScholarDigital Library
Jindal, N. and Liu, B. 2007b. Review spam detection. In Proceedings of the 16th International Conference on World Wide Web. 1189--1190. Google ScholarDigital Library
Jindal, N. and Liu, B. 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Web Data Mining. 219--229. Google ScholarDigital Library
Jindal, N., Liu, B., and Lim, E. P. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 1549--1552. Google ScholarDigital Library
Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning. 137--142. Google ScholarDigital Library
Joachims, T. 1999. Making large--scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA. Google ScholarDigital Library
Kim, S.-M., Pantel, P., Chklovski, T., and Pennacchiotti, M. 2006. Automatically assessing review helpfulness. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 423--428. Google ScholarDigital Library
Kullback, S. and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Statist. 22, 1, 79--86.Google ScholarCross Ref
Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 111--119. Google ScholarDigital Library
Lau, R. Y. K., Liao, Stephen, S. Y., and Xu, K. 2010. An empirical study of online consumer review spam: A design science approach. In Proceedings of the 31st International Conference on Information Systems.Google Scholar
Lau, R. Y. K., Song, D., Li, Y., Cheung, C. H., Hao, J. X. 2009a. Towards a fuzzy domain ontology extraction method for adaptive e-learning, IEEE Trans. Knowl. Data Engin. 21, 6, 800--813. Google ScholarDigital Library
Lau, R. Y. K., Lai, C. L., Ma, J., and Li, Y. 2009b. Automatic domain ontology extraction for context-sensitive opinion mining. In Proceedings of the 30th International Conference on Information Systems.Google Scholar
Lau, R. Y. K., Lai, C. L., and Li, Y. 2009c. Leveraging the Web context for context-sensitive opinion mining. In Proceedings of the IEEE International Conference on Computer Science and Information Technology. 467--471.Google Scholar
Lau, R. Y. K., Bruza, P. D., and Song, D. 2008. Towards a belief revision based adaptive and context sensitive information retrieval system. ACM Trans. Inf. Syst. 26, 2, Article 8. Google ScholarDigital Library
Lau, R. Y. K., Tang, M., Wong, O., Milliner, S., and Chen, Y. 2006. An evolutionary learning approach for adaptive negotiation agents. Int. J. Intel. Syst. 21, 1, 41--72. Google ScholarDigital Library
Lau, R. Y. K. 2003. Context-sensitive text mining and belief revision for intelligent information retrieval on the web. J. Web Intell. Agent Syst. 1, 3-4, 151--172. Google ScholarDigital Library
Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B. and Lauw, H. W. 2010. Detecting product review spammers using rating behaviors, In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 939--948. Google ScholarDigital Library
Lin, Y. R., Sundaram, H., Chi, Y., Tatemura, J., and Tseng, B. L. 2008. Detecting splogs via temporal dynamics using self-similarity analysis, ACM Trans. Web 2, 1, Article 4. Google ScholarDigital Library
Liu, X. and Croft, B. 2004. Cluster-Based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193. Google ScholarDigital Library
Liu, Y., Huang, X., An, A., and Yu, X. 2008. Modeling and predicting the helpfulness of online reviews. In Proceedings of the 8th IEEE International Conference on Data Mining. 443--452. Google ScholarDigital Library
Macdonald, C. and Ounis, I. 2007. Overview of the TREC 2007 blog track. In Proceedings of the 16th Text REtrieval Conference. http://trec.nist.gov/pubs/trec16/.Google Scholar
Macdonald, C., Ounis, I., and Soboroff, I. 2009. Is spam an issue for opinionated blog post search&quest; In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 710--711. Google ScholarDigital Library
March, S. T., and Storey, V. C. 2008. Design science in the information systems discipline. MIS Quart. 32, 4, 725--730. Google ScholarDigital Library
Martinez-Romo, J. and Araujo, L. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web. 21--28. Google ScholarDigital Library
Mishne, G., Carmel, D., and Lempel, R. 2005a. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 1--6.Google Scholar
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. 1990. Introduction to WordNet: An on-line lexical database. J. Lexicogr. 3, 4, 234--244.Google ScholarCross Ref
Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York. Google ScholarDigital Library
Nadas, A. 1984. Estimation of probabilities in the language model of the IBM speech recognition system, IEEE Trans. Acoust. Speech Signal Process. 32, 4, 859.Google ScholarCross Ref
Nie, J. Y., Cao, G., and Bai, J. 2006. Inferential language models for information retrieval. ACM Trans. Asian Lang. Inf. Process. 5, 4, 296--322. Google ScholarDigital Library
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. 83--92. Google ScholarDigital Library
Papadimitriou, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, MA.Google Scholar
Peffers, K., Tuunanen, T., Rothenberger, M., and Chatterjee, S. 2008. A design science research methodology for information systems research. J. Manag. Inf. Syst. 24, 3, 45--77. Google ScholarDigital Library
Perrin, P. and Petry, F. 2003. Extraction and representation of contextual information for knowledge discovery in texts. Inf. Sci. 151, 125--152. Google ScholarDigital Library
Piskorski, J., Sydow, M., and Weiss, D. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web. 25--28. Google ScholarDigital Library
Ponte, J. and Croft, B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--281. Google ScholarDigital Library
Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
Riloff, E. M., Wilson, T., Hoffmann, P., Somasundaran, S., Kessler, J., Wiebe, J., Choi, Y., Cardie, C., and Patwardhan, S. 2005. Opinionfinder: A system for subjectivity analysis. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 34--35. Google ScholarDigital Library
Salton, G. and McGill, H. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York. Google ScholarDigital Library
Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference. 206--213. Google ScholarDigital Library
Shannon, C. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379--423.Google ScholarCross Ref
Xiao, B. and Benbasat, I. 2011. Product-related deception in e-commerce: A theoretical perspective. MIS Quart. 35, 1, 169--195. Google ScholarDigital Library
Yan, X. Lau, R. Y. K., Song, D., Li, X., and Ma, J. 2011. Towards a semantic granularity model for domain-specific information retrieval. ACM Trans. Inf. Syst. 29, 3, Article 15. Google ScholarDigital Library
Zheleva, E., Kolcz, A., and Getoor, L. 2008. Trusting spam reporters: A reporter-based reputation system for email filtering. ACM Trans. Inf. Syst. 27, 1, Article 3. Google ScholarDigital Library
Zhou, B. and Pei, J. 2009. Link spam target detection using page farms. ACM Trans. Knowl. Discov. Data 3, 3, Article 13. Google ScholarDigital Library

Index Terms

Text mining and probabilistic language modeling for online review spam detection
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
  2. Information systems applications

Recommendations

Review spam detection
WWW '07: Proceedings of the 16th international conference on World Wide Web

It is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products. They are used by potential customers to find ...
Read More
Detecting blog spam hashtags using topic modeling
ICEC '16: Proceedings of the 18th Annual International Conference on Electronic Commerce: e-Commerce in Smart connected World

Tremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various ...
Read More
Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques

Despite the great advances in spam detection, spam remains a major problem that has affected the global economy enormously. Spam attacks are popularly perpetrated through different digital platforms with a large electronic audience, such as emails, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Management Information Systems Volume 2, Issue 4
December 2011
141 pages
ISSN:2158-656X
EISSN:2158-6578
DOI:10.1145/2070710
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 January 2012
- Accepted: 1 September 2011
- Revised: 1 August 2011
- Received: 1 April 2011
Published in tmis Volume 2, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Language models
design science
review spam
spam detection
text mining
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 152
  Total Citations
  View Citations
- 3,250
  Total Downloads
- Downloads (Last 12 months)101
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text mining and probabilistic language modeling for online review spam detection

ACM Transactions on Management Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Review spam detection

Detecting blog spam hashtags using topic modeling

Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Text mining and probabilistic language modeling for online review spam detection

ACM Transactions on Management Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Review spam detection

Detecting blog spam hashtags using topic modeling

Advances in spam detection for email spam, web spam, social network spam, and review spam: ML-based and nature-inspired-based techniques

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media