ABSTRACT
Twitter spam has long been a critical but difficult problem to be addressed. So far, researchers have developed a series of machine learning-based methods and blacklisting techniques to detect spamming activities on Twitter. According to our investigation, current methods and techniques have achieved the accuracy of around 80%. However, due to the problems of spam drift and information fabrication, these machine-learning based methods cannot efficiently detect spam activities in real-life scenarios. Moreover, the blacklisting method cannot catch up with the variations of spamming activities as manually inspecting suspicious URLs is extremely time-consuming. In this paper, we proposed a novel technique based on deep learning techniques to address the above challenges. The syntax of each tweet will be learned through WordVector Training Mode. We then constructed a binary classifier based on the preceding representation dataset. In experiments, we collected and implemented a 10-day real Tweet datasets in order to evaluate our proposed method. We first studied the performance of different classifiers, and then compared our method to other existing text-based methods. We found that our method largely outperformed existing methods. We further compared our method to non-text-based detection techniques. According to the experiment results, our proposed method was more accurate.
- R. Aires, A. Manfrin, S. M. Aluísio, and D. Santos. Which Classification Algorithm Works Best with Stylistic Features of Portuguese in Order to Classify Web Texts According to Users' needs?. ICMC-USP, 2004.Google Scholar
- N. B. Amor, S. Benferhat, and Z. Elouedi. Naive bayes vs decision trees in intrusion detection systems. In Proceedings of the 2004 ACM symposium on Applied computing, pages 420--424. ACM, 2004. Google ScholarDigital Library
- F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), volume 6, page 12, 2010.Google Scholar
- M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, K. Thiel, and B. Wiswedel. Knime-the konstanz information miner: version 2.0 and beyond. AcM SIGKDD explorations Newsletter, 11(1):26--31, 2009. Google ScholarDigital Library
- L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- C. Chen, J. Zhang, Y. Xiang, and W. Zhou. Asymmetric self-learning for tackling twitter spam drift. In 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 208--213. IEEE, 2015.Google ScholarCross Ref
- C. Chen, J. Zhang, Y. Xie, Y. Xiang, W. Zhou, M. M. Hassan, A. AlElaiwi, and M. Alrubaian. A performance evaluation of machine learning-based streaming spam tweets detection. IEEE Transactions on Computational Social Systems, 2(3):65--76, 2015.Google ScholarCross Ref
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493--2537, 2011. Google ScholarDigital Library
- T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1--15. Springer, 2000. Google ScholarDigital Library
- V. N. Ghate and S. V. Dudul. Optimal mlp neural network classifier for fault detection of three phase induction motor. Expert Systems with Applications, 37(4):3468--3481, 2010. Google ScholarDigital Library
- C. Grier, K. Thomas, V. Paxson, and M. Zhang. @ spam: the underground on 140 characters or less. In Proceedings of the 17th ACM conference on Computer and communications security, pages 27--37. ACM, 2010. Google ScholarDigital Library
- A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56--65. ACM, 2007. Google ScholarDigital Library
- X. Jin, C. Lin, J. Luo, and J. Han. A data mining-based spam detection system for social media networks. Proceedings of the VLDB Endowment, 4(12):1458--1461, 2011.Google ScholarDigital Library
- Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188--1196, 2014.Google ScholarDigital Library
- Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436--444, 2015.Google ScholarCross Ref
- K. Lee, J. Caverlee, and S. Webb. Uncovering social spammers: social honeypots+ machine learning. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 435--442. ACM, 2010. Google ScholarDigital Library
- S. Lee and J. Kim. Warningbird: Detecting suspicious urls in twitter stream. In NDSS, volume 12, pages 1--13, 2012.Google Scholar
- A. Liaw and M. Wiener. Classification and regression by randomforest. R news, 2(3):18--22, 2002.Google Scholar
- S. Liu, J. Zhang, Y. Wang, and Y. Xiang. Fuzzy-based feature and instance recovery. In Asian Conference on Intelligent Information and Database Systems, pages 605--615. Springer, 2016.Google ScholarCross Ref
- S. Liu, J. Zhang, and Y. Xiang. Statistical detection of online drifting twitter spam: Invited paper. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, pages 1--10. ACM, 2016. Google ScholarDigital Library
- J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Learning to detect malicious urls. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):30, 2011. Google ScholarDigital Library
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- J. Oliver, P. Pajares, C. Ke, C. Chen, and Y. Xiang. An in-depth analysis of abuse on twitter. Trend Micro, 225, 2014.Google Scholar
- J. D. Rennie, L. Shih, J. Teevan, D. R. Karger, et al. Tackling the poor assumptions of naive bayes text classifiers. In ICML, volume 3, pages 616--623. Washington DC), 2003.Google ScholarDigital Library
- K. Rybina. Sentiment analysis of contexts around query terms in documents. PhD thesis, MasterâĂŹs thesis, 2012.Google Scholar
- J. Song, S. Lee, and J. Kim. Spam filtering in twitter using sender-receiver relationship. In International Workshop on Recent Advances in Intrusion Detection, pages 301--317. Springer, 2011. Google ScholarDigital Library
- G. Stringhini, C. Kruegel, and G. Vigna. Detecting spammers on social networks. In Proceedings of the 26th Annual Computer Security Applications Conference, pages 1--9. ACM, 2010. Google ScholarDigital Library
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014. Google ScholarDigital Library
- D. Tang, F. Wei, B. Qin, T. Liu, and M. Zhou. Coooolll: A deep learning system for twitter sentiment classification. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 208--212, 2014.Google ScholarCross Ref
- D. Urbansky, K. Muthmann, P. Katz, and S. Reichert. Tud palladian overview. TU Dresden, Department of Systems Engineering, Chair Computer Networks, IIR Group, 5, 2011.Google Scholar
- A. H. Wang. Don't follow me: Spam detection in twitter. In Security and Cryptography (SECRYPT), Proceedings of the 2010 International Conference on, pages 1--10. IEEE, 2010.Google Scholar
- D. Wang, S. B. Navathe, L. Liu, D. Irani, A. Tamersoy, and C. Pu. Click traffic analysis of short url spam on twitter. In Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), 2013 9th International Conference Conference on, pages 250--259. IEEE, 2013.Google Scholar
- C. Yang, R. Harkreader, and G. Gu. Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Transactions on Information Forensics and Security, 8(8):1280--1293, 2013. Google ScholarDigital Library
Index Terms
- Twitter spam detection based on deep learning
Recommendations
Statistical Detection of Online Drifting Twitter Spam: Invited Paper
ASIA CCS '16: Proceedings of the 11th ACM on Asia Conference on Computer and Communications SecuritySpam has become a critical problem in online social networks. This paper focuses on Twitter spam detection. Recent research works focus on applying machine learning techniques for Twitter spam detection, which make use of the statistical features of ...
A comprehensive survey on deep learning based malware detection techniques
AbstractRecent theoretical and practical studies have revealed that malware is one of the most harmful threats to the digital world. Malware mitigation techniques have evolved over the years to ensure security. Earlier, several classical ...
Spam detection on twitter using traditional classifiers
ATC'11: Proceedings of the 8th international conference on Autonomic and trusted computingSocial networking sites have become very popular in recent years. Users use them to find new friends, updates their existing friends with their latest thoughts and activities. Among these sites, Twitter is the fastest growing site. Its popularity also ...
Comments