ABSTRACT
The previous efforts in the use of machine learning for malware detection have assumed that malware population is stationary i.e. probability distribution of the observed characteristics (features) of malware populations don't change over time. In this paper, we investigate this assumption for malware families as populations. Malware, by design, constantly evolves so as to defeat detection. Evolution in malware may lead to a nonstationary malware population. The problem of nonstationary populations has been called concept drift in machine learning. Tracking concept drift is critical to the successful application of ML based methods for malware detection. If the evolution causes the malware population to drift rapidly then frequent retraining of classifiers may be required to prevent degradation in performance. On the other hand, if the drift is found to be negligible, then ML based methods are robust for such populations for long periods of time.
We propose two measures for tracking concept drift in malware families when feature sets are very large-relative temporal similarity and metafeatures. We illustrate the use of the proposed measures with a study on 3500+ samples from three families of x86 malware, spanning over 5 years. The results of the study show negligible drift in mnemonic 2-grams extracted from unpacked versions of the samples. The measures can likewise be applied to track drift in any number of malware families. Tracking drift in this manner also provides a novel method for feature type selection, i.e., use the feature type that drifts the least.
- T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. In Proc. of the 28th Annual Intl. Computer Software and Applications Conference, 2003. Google ScholarDigital Library
- M. Bailey, J. Oberheide, J. Andersen, Z. Mao, F. Jahanian, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the 10th international conference on Recent advances in intrusion detection, pages 178--197. Springer-Verlag, 2007. Google ScholarDigital Library
- U. Bayer, P. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based malware clustering. In Network and Distributed System Security Symposium (NDSS), 2009.Google Scholar
- S. Choi, S. Cha, and C. Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43--48, 2010.Google Scholar
- C. Collberg and J. Nagra. Surreptitious software: obfuscation, watermarking, and tamperproofing for software protection. Addison-Wesley Professional, 2009. Google ScholarDigital Library
- S. Delany, P. Cunningham, and B. Smyth. Ecue: A spam filter that uses machine learning to track concept drift. In Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29-September 1, 2006, Riva del Garda, Italy, pages 627--631. IOS Press, 2006. Google ScholarDigital Library
- S. Delany, P. Cunningham, and A. Tsymbal. A comparison of ensemble and case-base maintenance techniques for handling concept drift in spam filtering. In Proceedings of the 19th International Conference on Artificial Intelligence (FLAIRS 2006), pages 340--345, 2006.Google Scholar
- S. Delany, P. Cunningham, A. Tsymbal, and L. Coyle. A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems, 18(4):187--195, 2005. Google ScholarDigital Library
- A. Dries and U. Rückert. Adaptive concept drift detection. Statistical Analysis and Data Mining, 2(5-6):311--327, 2009. Google ScholarDigital Library
- F. Fdez-Riverola, E. Iglesias, F. Díaz, J. Méndez, and J. Corchado. Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications, 33(1):36--48, 2007. Google ScholarDigital Library
- J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer, 2001.Google Scholar
- J. Friedman and L. Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, pages 697--717, 1979.Google ScholarCross Ref
- J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. Advances in Artificial Intelligence - SBIA 2004, pages 66--112, 2004.Google ScholarCross Ref
- GDataSoftware. G data malware report. http://www.gdatasoftware.com/uploads/media/G_Data_MalwareReport_H1_2011_EN.pdf, 2011.Google Scholar
- F. Guo, P. Ferrie, and T. Chiueh. A study of the packer problem and its solutions. In Recent Advances in Intrusion Detection, pages 98--115. Springer, 2008. Google ScholarDigital Library
- P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2):359--374, 2002.Google ScholarCross Ref
- M. Hayes, A. Walenstein, and A. Lakhotia. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology, 5(4):335--343, 2009.Google ScholarCross Ref
- D. Helmbold and P. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27--45, 1994. Google ScholarDigital Library
- R. Hogg, J. McKean, and A. Craig. Introduction to mathematical statistics, 2005.Google Scholar
- M. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13--23, 2005.Google ScholarCross Ref
- M. Kelly, D. Hand, and N. Adams. The impact of changing populations on classifier performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 367--371. ACM, 1999. Google ScholarDigital Library
- R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281--300, 2004. Google ScholarDigital Library
- R. Klinkenberg and I. Renz. Adaptive information filtering: Learning drifting concepts. In Proc. of AAAI-98/ICML-98 workshop Learning for Text Categorization, pages 33--40, 1998.Google Scholar
- J. Kolter and M. Maloof. Learning to detect and classify malicious executables in the wild. The Journal of Machine Learning Research, 7:2721--2744, 2006. Google ScholarDigital Library
- A. Kuh, T. Petsche, and R. Rivest. Learning time-varying concepts. In Proceedings of the 1990 Conference on Advances in Neural Information Processing Systems (NIPS), pages 183--189, 1990. Google ScholarDigital Library
- M. Lehman. Laws of software evolution revisited. Software process technology, pages 108--124, 1996. Google ScholarDigital Library
- P. Li, L. Liu, D. Gao, and M. Reiter. On challenges in evaluating malware clustering. In Recent Advances in Intrusion Detection, pages 238--255. Springer, 2010. Google ScholarDigital Library
- M. Masud, T. Al-Khateeb, K. Hamlen, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Cloud-based malware detection for evolving data streams. ACM Transactions on Management Information Systems (TMIS), 2(3):16, 2011. Google ScholarDigital Library
- M. M. Masud, L. Khan, and B. Thuraisingham. A hybrid model to detect malicious executables. In Proc. of the IEEE Intl. Conf. on Communications (ICC 2007), pages 1443--1448, 2007.Google ScholarCross Ref
- T. R. Microsoft Protection Center and Response. Malware encyclopedia. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Browse.aspx, 2011.Google Scholar
- R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici. Unknown malcode detection using OPCODE representation. In Intelligence and Security Informatics, volume 5376 of Lecture Notes in Computer Science, pages 204--215. Springer Berlin, 2008. Google ScholarDigital Library
- C. Nachenberg. Computer virus-coevolution. Communications of the ACM, 50(1):46--51, 1997. Google ScholarDigital Library
- N. Rosenblum, B. Miller, and X. Zhu. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 21--28, 2010. Google ScholarDigital Library
- B. Rubinstein, B. Nelson, L. Huang, A. Joseph, S. Lau, S. Rao, N. Taft, and J. Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Internet Measurement Conference, pages 1--14, 2009. Google ScholarDigital Library
- M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. Data mining methods for detection of new malicious executables. In Proc. of S & P 2001: IEEE Symposium on Security and Privacy, pages 38--49, 2001. Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
- P. Tan, M. Steinbach, and V. Kumar. Introduction to data mining. Pearson Addison Wesley, 2006. Google ScholarDigital Library
- A. Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 2004.Google Scholar
- G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69--101, 1996. Google ScholarDigital Library
- T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue. Measuring similarity of large software systems based on source code correspondence. Product Focused Software Process Improvement, pages 179--208, 2005. Google ScholarDigital Library
- I. Zliobaite. Learning under concept drift: an overview. Technical report, Vilnius University, Lithuania, 2009.Google Scholar
Index Terms
- Tracking concept drift in malware families
Recommendations
The Next Malware Battleground: Recovery After Unknown Infection
Malware has become a natural aspect of Internet computing due to the imperfectness of systems that identify malware and prevent their installation. Our ability to control the volume of unwanted and malicious traffic on the Internet—the spam messages, ...
Correlation Analysis between Spamming Botnets and Malware Infected Hosts
SAINT '11: Proceedings of the 2011 IEEE/IPSJ International Symposium on Applications and the InternetMany of recent cyber attacks are being launched by botnets for the purpose of carrying out large-scale cyber attacks such as spam emails, Distributed Denial of Service (DDoS), network scanning and so on. In many cases, these botnets consist of a lot of ...
Testing malware detectors
In today's interconnected world, malware, such as worms and viruses, can cause havoc. A malware detector (commonly known as virus scanner) attempts to identify malware. In spite of the importance of malware detectors, there is a dearth of testing ...
Comments