Twitter Content-Based Spam Filtering

Santos, Igor; Miñambres-Marcos, Igor; Laorden, Carlos; Galán-García, Patxi; Santamaría-Ibirika, Aitor; Bringas, Pablo García

doi:10.1007/978-3-319-01854-6_46

Igor Santos¹¹,
Igor Miñambres-Marcos¹¹,
Carlos Laorden¹¹,
Patxi Galán-García¹¹,
Aitor Santamaría-Ibirika¹¹ &
…
Pablo García Bringas¹¹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 239))

2146 Accesses
17 Citations
1 Altmetric

Abstract

Twitter has become one of the most used social networks. And, as happens with every popular media, it is prone to misuse. In this context, spam in Twitter has emerged in the last years, becoming an important problem for the users. In the last years, several approaches have appeared that are able to determine whether an user is a spammer or not. However, these blacklisting systems cannot filter every spam message and a spammer may create another account and restart sending spam. In this paper, we propose a content-based approach to filter spam tweets. We have used the text in the tweet and machine learning and compression algorithms to filter those undesired tweets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Thomas, K., Grier, C., Song, D., Paxson, V.: Suspended accounts in retrospect: an analysis of twitter spam. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, pp. 243–258. ACM (2011)
Google Scholar
Bratko, A., Filipič, B., Cormack, G., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. The Journal of Machine Learning Research 7, 2673–2698 (2006)
MATH Google Scholar
Jagatic, T., Johnson, N., Jakobsson, M., Menczer, F.: Social phishing. Communications of the ACM 50(10), 94–100 (2007)
Article Google Scholar
Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammers on twitter. In: Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, CEAS (2010)
Google Scholar
Grier, C., Thomas, K., Paxson, V., Zhang, M.: @spam: The underground on 140 characters or less. In: Proceedings of the 17th ACM Conference on Computer and Communications Security, pp. 27–37. ACM (2010)
Google Scholar
Wang, A.H.: Don’t follow me: Spam detection in twitter. In: Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT), pp. 1–10. IEEE (2010)
Google Scholar
Gao, H., Chen, Y., Lee, K., Palsetia, D., Choudhary, A.: Towards online spam filtering in social networks. In: Symposium on Network and Distributed System Security, NDSS (2012)
Google Scholar
Ahmed, F., Abulaish, M.: A generic statistical approach for spam detection in online social networks. Computer Communications (in press, 2013)
Google Scholar
Martinez-Romo, J., Araujo, L.: Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Systems with Applications (2012)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Lewis, D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–18. Springer, Heidelberg (1998)
Chapter Google Scholar
Schneider, K.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pp. 307–314 (2003)
Google Scholar
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Spyropoulos, C.: An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167 (2000)
Google Scholar
Seewald, A.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intelligent Data Analysis 11(5), 497–524 (2007)
Google Scholar
Vapnik, V.: The nature of statistical learning theory. Springer (2000)
Google Scholar
Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Article Google Scholar
Blanzieri, E., Bryl, A.: Instance-based spam filtering using SVM nearest neighbor classifier. Proceedings of FLAIRS 20, 441–442 (2007)
Google Scholar
Sculley, D., Wachman, G.: Relaxed online SVMs for spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–422 (2007)
Google Scholar
Quinlan, J.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Google Scholar
Carreras, X., Márquez, L.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP 2001, 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64. Citeseer (2001)
Google Scholar
Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3(4), 243–269 (2004)
Article Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Wittel, G., Wu, S.: On attacking statistical spam filters. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS (2004)
Google Scholar
Pearl, J.: Reverend bayes on inference engines: a distributed hierarchical approach. In: Proceedings of the National Conference on Artificial Intelligence, pp. 133–136 (1982)
Google Scholar
Bayes, T.: An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society 53, 370–418 (1763)
Google Scholar
Castillo, E., Gutiérrez, J.M., Hadi, A.S.: Expert Systems and Probabilistic Network Models, Erste edn., New York, NY, USA (1996)
Google Scholar
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Garner, S.: Weka: The Waikato environment for knowledge analysis. In: Proceedings of the 1995 New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)
Google Scholar
Quinlan, J.: C4. 5 programs for machine learning. Morgan Kaufmann Publishers (1993)
Google Scholar
Fix, E., Hodges, J.L.: Discriminatory analysis: Nonparametric discrimination: Small sample performance. technical report project 21-49-004, report number 11. Technical report, USAF School of Aviation Medicine, Randolf Field, Texas (1952)
Google Scholar
Amari, S., Wu, S.: Improving support vector machine classifiers by modifying kernel functions. Neural Networks 12(6), 783–789 (1999)
Article Google Scholar
Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order markov models. J. Artif. Intell. Res. (JAIR) 22, 385–421 (2004)
MathSciNet MATH Google Scholar
Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396–402 (1984)
Article Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar
Nisenson, M., Yariv, I., El-Yaniv, R., Meir, R.: Towards behaviometric security systems: Learning to identify a typist. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 363–374. Springer, Heidelberg (2003)
Chapter Google Scholar
Willems, F.: The context-tree weighting method: Extensions. IEEE Transactions on Information Theory 44(2), 792–798 (1998)
Article MathSciNet MATH Google Scholar
Volf, P.A.J.: Weighting techniques in data compression: Theory and algorithms. Citeseer (2002)
Google Scholar
Ron, D., Singer, Y., Tishby, N.: The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning 25(2), 117–149 (1996)
Article MATH Google Scholar
Cormack, G., Horspool, R.: Data compression using dynamic markov modelling. The Computer Journal 30(6), 541–550 (1987)
Article MathSciNet Google Scholar
Cormack, G., Gómez Hidalgo, J., Sánz, E.: Spam filtering for short messages. In: Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management, pp. 313–320. ACM (2007)
Google Scholar
Cormack, G., Hidalgo, J., Sánz, E.: Feature engineering for mobile(sms) spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, vol. 23, pp. 871–872 (2007)
Google Scholar
Santos, I., Laorden, C., Sanz, B., Bringas, P.G.: Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Systems With Applications 39(1), 437–444, doi:10.1016/j.eswa.2011.07.034
Google Scholar
Laorden, C., Santos, I., Sanz, B., Alvarez, G., Bringas, P.G.: Word sense disambiguation for spam filtering. Electron. Commer. Rec. Appl. 11(3), 290–298 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

DeustoTech-Computing, Deusto Institute of Technology (DeustoTech), Avenida de las Universidades 24, 48007, Bilbao, Spain
Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, Patxi Galán-García, Aitor Santamaría-Ibirika & Pablo García Bringas

Authors

Igor Santos
View author publications
You can also search for this author in PubMed Google Scholar
Igor Miñambres-Marcos
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Laorden
View author publications
You can also search for this author in PubMed Google Scholar
Patxi Galán-García
View author publications
You can also search for this author in PubMed Google Scholar
Aitor Santamaría-Ibirika
View author publications
You can also search for this author in PubMed Google Scholar
Pablo García Bringas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Igor Santos .

Editor information

Editors and Affiliations

Department of Civil Engineering, University of Burgos, Burgos, Spain
Álvaro Herrero
Department of Civil Engineering, University of Burgos, Burgos, Spain
Bruno Baruque
German Workforce ADL Partnership Laboratory, Waltershausen, Germany
Fanny Klett
Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs (MIR Labs), Auburn, Washington, USA
Ajith Abraham
Department of Computer Science Faculty of Ele. Eng. & Computer Science, VŠB-TU Ostrava, Ostrava, Czech Republic
Václav Snášel
Department of Computer Science, University of Sao Paulo at Sao Carlos, Sao Carlos, Brazil
André C.P.L.F. de Carvalho
DeustoTech Computing, University of Deusto, Bilbao, Spain
Pablo García Bringas
Department of Computer Science Faculty of Elec. Eng. and Comp. Science, VŠB-TU Ostrava, Ostrava, Czech Republic
Ivan Zelinka
University of Salamanca, Salamanca, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, I., Miñambres-Marcos, I., Laorden, C., Galán-García, P., Santamaría-Ibirika, A., Bringas, P.G. (2014). Twitter Content-Based Spam Filtering. In: Herrero, Á., et al. International Joint Conference SOCO’13-CISIS’13-ICEUTE’13. Advances in Intelligent Systems and Computing, vol 239. Springer, Cham. https://doi.org/10.1007/978-3-319-01854-6_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-01854-6_46
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01853-9
Online ISBN: 978-3-319-01854-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics