skip to main content
10.1145/2381896.2381910acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Tracking concept drift in malware families

Authors Info & Claims
Published:19 October 2012Publication History

ABSTRACT

The previous efforts in the use of machine learning for malware detection have assumed that malware population is stationary i.e. probability distribution of the observed characteristics (features) of malware populations don't change over time. In this paper, we investigate this assumption for malware families as populations. Malware, by design, constantly evolves so as to defeat detection. Evolution in malware may lead to a nonstationary malware population. The problem of nonstationary populations has been called concept drift in machine learning. Tracking concept drift is critical to the successful application of ML based methods for malware detection. If the evolution causes the malware population to drift rapidly then frequent retraining of classifiers may be required to prevent degradation in performance. On the other hand, if the drift is found to be negligible, then ML based methods are robust for such populations for long periods of time.

We propose two measures for tracking concept drift in malware families when feature sets are very large-relative temporal similarity and metafeatures. We illustrate the use of the proposed measures with a study on 3500+ samples from three families of x86 malware, spanning over 5 years. The results of the study show negligible drift in mnemonic 2-grams extracted from unpacked versions of the samples. The measures can likewise be applied to track drift in any number of malware families. Tracking drift in this manner also provides a novel method for feature type selection, i.e., use the feature type that drifts the least.

References

  1. T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. In Proc. of the 28th Annual Intl. Computer Software and Applications Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bailey, J. Oberheide, J. Andersen, Z. Mao, F. Jahanian, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the 10th international conference on Recent advances in intrusion detection, pages 178--197. Springer-Verlag, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. U. Bayer, P. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based malware clustering. In Network and Distributed System Security Symposium (NDSS), 2009.Google ScholarGoogle Scholar
  4. S. Choi, S. Cha, and C. Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43--48, 2010.Google ScholarGoogle Scholar
  5. C. Collberg and J. Nagra. Surreptitious software: obfuscation, watermarking, and tamperproofing for software protection. Addison-Wesley Professional, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Delany, P. Cunningham, and B. Smyth. Ecue: A spam filter that uses machine learning to track concept drift. In Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29-September 1, 2006, Riva del Garda, Italy, pages 627--631. IOS Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Delany, P. Cunningham, and A. Tsymbal. A comparison of ensemble and case-base maintenance techniques for handling concept drift in spam filtering. In Proceedings of the 19th International Conference on Artificial Intelligence (FLAIRS 2006), pages 340--345, 2006.Google ScholarGoogle Scholar
  8. S. Delany, P. Cunningham, A. Tsymbal, and L. Coyle. A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems, 18(4):187--195, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Dries and U. Rückert. Adaptive concept drift detection. Statistical Analysis and Data Mining, 2(5-6):311--327, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Fdez-Riverola, E. Iglesias, F. Díaz, J. Méndez, and J. Corchado. Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications, 33(1):36--48, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer, 2001.Google ScholarGoogle Scholar
  12. J. Friedman and L. Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, pages 697--717, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. Advances in Artificial Intelligence - SBIA 2004, pages 66--112, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. GDataSoftware. G data malware report. http://www.gdatasoftware.com/uploads/media/G_Data_MalwareReport_H1_2011_EN.pdf, 2011.Google ScholarGoogle Scholar
  15. F. Guo, P. Ferrie, and T. Chiueh. A study of the packer problem and its solutions. In Recent Advances in Intrusion Detection, pages 98--115. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2):359--374, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Hayes, A. Walenstein, and A. Lakhotia. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology, 5(4):335--343, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  18. D. Helmbold and P. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27--45, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Hogg, J. McKean, and A. Craig. Introduction to mathematical statistics, 2005.Google ScholarGoogle Scholar
  20. M. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13--23, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. Kelly, D. Hand, and N. Adams. The impact of changing populations on classifier performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 367--371. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281--300, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Klinkenberg and I. Renz. Adaptive information filtering: Learning drifting concepts. In Proc. of AAAI-98/ICML-98 workshop Learning for Text Categorization, pages 33--40, 1998.Google ScholarGoogle Scholar
  24. J. Kolter and M. Maloof. Learning to detect and classify malicious executables in the wild. The Journal of Machine Learning Research, 7:2721--2744, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Kuh, T. Petsche, and R. Rivest. Learning time-varying concepts. In Proceedings of the 1990 Conference on Advances in Neural Information Processing Systems (NIPS), pages 183--189, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Lehman. Laws of software evolution revisited. Software process technology, pages 108--124, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Li, L. Liu, D. Gao, and M. Reiter. On challenges in evaluating malware clustering. In Recent Advances in Intrusion Detection, pages 238--255. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Masud, T. Al-Khateeb, K. Hamlen, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Cloud-based malware detection for evolving data streams. ACM Transactions on Management Information Systems (TMIS), 2(3):16, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. M. Masud, L. Khan, and B. Thuraisingham. A hybrid model to detect malicious executables. In Proc. of the IEEE Intl. Conf. on Communications (ICC 2007), pages 1443--1448, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  30. T. R. Microsoft Protection Center and Response. Malware encyclopedia. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Browse.aspx, 2011.Google ScholarGoogle Scholar
  31. R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici. Unknown malcode detection using OPCODE representation. In Intelligence and Security Informatics, volume 5376 of Lecture Notes in Computer Science, pages 204--215. Springer Berlin, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Nachenberg. Computer virus-coevolution. Communications of the ACM, 50(1):46--51, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Rosenblum, B. Miller, and X. Zhu. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 21--28, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Rubinstein, B. Nelson, L. Huang, A. Joseph, S. Lau, S. Rao, N. Taft, and J. Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Internet Measurement Conference, pages 1--14, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. Data mining methods for detection of new malicious executables. In Proc. of S & P 2001: IEEE Symposium on Security and Privacy, pages 38--49, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Tan, M. Steinbach, and V. Kumar. Introduction to data mining. Pearson Addison Wesley, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 2004.Google ScholarGoogle Scholar
  39. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69--101, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue. Measuring similarity of large software systems based on source code correspondence. Product Focused Software Process Improvement, pages 179--208, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. I. Zliobaite. Learning under concept drift: an overview. Technical report, Vilnius University, Lithuania, 2009.Google ScholarGoogle Scholar

Index Terms

  1. Tracking concept drift in malware families

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence
        October 2012
        116 pages
        ISBN:9781450316644
        DOI:10.1145/2381896

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 October 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        AISec '12 Paper Acceptance Rate10of24submissions,42%Overall Acceptance Rate94of231submissions,41%

        Upcoming Conference

        CCS '24
        ACM SIGSAC Conference on Computer and Communications Security
        October 14 - 18, 2024
        Salt Lake City , UT , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader