Skip to main content

An Advanced Spam Detection Technique Based on Self-adaptive Piecewise Hash Algorithm

  • Conference paper
Web Technologies and Applications (APWeb 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8710))

Included in the following conference series:

  • 1672 Accesses

Abstract

Nowadays, email spam problems continue growing drastically and many spam detection algorithms have been developed at the same time. However, there are several shortcomings shared by most of these algorithms. In order to solve these shortcomings, we present an advanced spam detection technique(ASDT). It is based on the extremum characteristic theory, Rabin fingerprint algorithm, modified Bayesian method and optimization theory. Then we designed several experiments to evaluate ASDT’s performance, including accuracy, speed and robustness, by comparing them with SFSPH, SFSPH-S, the famous DSC algorithm and the Email Remove-duplicate Algorithm Based on SHA-1(ERABS). Our extensive experiments demonstrated that ASDT has the best accuracy, speed and robustness on spam filtering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005) (2005)

    Google Scholar 

  2. Hayati, P., et al.: Definition of spam 2.0: New spamming boom. In: 2010 4th IEEE International Conference on Digital Ecosystems and Technologies (DEST). IEEE (2010)

    Google Scholar 

  3. Moniza, P., Asha, P.: An assortment of spam detection system. In: 2012 International Conference on Computing, Electronics and Electrical Technologies (ICCEET). IEEE (2012)

    Google Scholar 

  4. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  5. Whitworth, B., Whitworth, E.: Spam and the social-technical gap. Computer 37(10), 38–45 (2004)

    Article  Google Scholar 

  6. Xu, Q., et al.: Sms spam detection using noncontent features.”. IEEE Intelligent Systems 27(6), 44–51 (2012)

    Article  Google Scholar 

  7. Hidalgo, G., María, J., et al.: Content based SMS spam filtering. In: Proceedings of the 2006 ACM Symposium on Document Engineering. ACM (2006)

    Google Scholar 

  8. Resnick, P.: RFC 2822: Internet message format. IETF (Standards Track) Request for Comments 2822 (2001)

    Google Scholar 

  9. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006)

    Article  Google Scholar 

  10. Breitinger, F., Baier, H.: Performance issues about context-triggered piecewise hashing. In: Gladyshev, P., Rogers, M.K. (eds.) ICDF2C 2011. LNICST, vol. 88, pp. 141–155. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  11. Broder, A.Z., et al.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8), 1157–1166 (1997)

    Article  Google Scholar 

  12. Kołcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2004)

    Google Scholar 

  13. Zhang, M., Li, B.C., Chen, L.: Email Remove-duplicate Algorithm Based on SHA-1. Computer Engineering 11, 098 (2008)

    Google Scholar 

  14. Kołcz, A.: Lexicon randomization for near-duplicate detection with I-Match. The Journal of Supercomputing 45(3), 255–276 (2008)

    Article  Google Scholar 

  15. Sun, J.Z., Ma, Y.Q., Li, Y.H.: Data Chunking Algorithm Based on Byte-fingerprint Extremum Characteristics. Computer Engineering 8, 26 (2010)

    Google Scholar 

  16. Zhong, Z., Li, K.: Speed Up Statistical Spam Filter by Approximation. IEEE Transactions on Computers 60(1), 120–134 (2011)

    Article  MathSciNet  Google Scholar 

  17. Rabin, M.O.: Fingerprinting by random polynomials. Center for Research in Computing Techn. Aiken Computation Laboratory, Univ. (1981)

    Google Scholar 

  18. Luo, Q., Qin, Y.-P., Wang, C.-L.: Anti-spam technology review. Journal of Bohai University (Natural Science Edition) 4 (2008)

    Google Scholar 

  19. Kosmopoulos, A., Paliouras, G., Androutsopoulos, I.: Adaptive spam filtering using only naive bayes text classifiers. In: Proceedings of the Fifth Conference on Email and Anti-Spam (CEAS) (2008)

    Google Scholar 

  20. Shao, J., Yan, X., Shao, S.: SNR of DNA sequences mapped by general affine transformations of the indicator sequences. Journal of Mathematical Biology 67(2), 433–451 (2013)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhu, J., Li, A. (2014). An Advanced Spam Detection Technique Based on Self-adaptive Piecewise Hash Algorithm. In: Han, W., Huang, Z., Hu, C., Zhang, H., Guo, L. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8710. Springer, Cham. https://doi.org/10.1007/978-3-319-11119-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11119-3_14

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11118-6

  • Online ISBN: 978-3-319-11119-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics