research-article

Tracking concept drift in malware families

Authors:
Anshuman Singh

University of Louisiana at Lafayette, Lafayette, LA, USA

University of Louisiana at Lafayette, Lafayette, LA, USA
View Profile

,
Andrew Walenstein

University of Louisiana at Lafayette, Lafayette, LA, USA

University of Louisiana at Lafayette, Lafayette, LA, USA
View Profile

,
Arun Lakhotia

University of Louisiana at Lafayette, Lafayette, LA, USA

University of Louisiana at Lafayette, Lafayette, LA, USA
View Profile

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligenceOctober 2012Pages 81–92https://doi.org/10.1145/2381896.2381910

Published:19 October 2012Publication History

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence

Pages 81–92

ABSTRACT

The previous efforts in the use of machine learning for malware detection have assumed that malware population is stationary i.e. probability distribution of the observed characteristics (features) of malware populations don't change over time. In this paper, we investigate this assumption for malware families as populations. Malware, by design, constantly evolves so as to defeat detection. Evolution in malware may lead to a nonstationary malware population. The problem of nonstationary populations has been called concept drift in machine learning. Tracking concept drift is critical to the successful application of ML based methods for malware detection. If the evolution causes the malware population to drift rapidly then frequent retraining of classifiers may be required to prevent degradation in performance. On the other hand, if the drift is found to be negligible, then ML based methods are robust for such populations for long periods of time.

We propose two measures for tracking concept drift in malware families when feature sets are very large-relative temporal similarity and metafeatures. We illustrate the use of the proposed measures with a study on 3500+ samples from three families of x86 malware, spanning over 5 years. The results of the study show negligible drift in mnemonic 2-grams extracted from unpacked versions of the samples. The measures can likewise be applied to track drift in any number of malware families. Tracking drift in this manner also provides a novel method for feature type selection, i.e., use the feature type that drifts the least.

References

T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. In Proc. of the 28th Annual Intl. Computer Software and Applications Conference, 2003. Google ScholarDigital Library
M. Bailey, J. Oberheide, J. Andersen, Z. Mao, F. Jahanian, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the 10th international conference on Recent advances in intrusion detection, pages 178--197. Springer-Verlag, 2007. Google ScholarDigital Library
U. Bayer, P. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based malware clustering. In Network and Distributed System Security Symposium (NDSS), 2009.Google Scholar
S. Choi, S. Cha, and C. Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43--48, 2010.Google Scholar
C. Collberg and J. Nagra. Surreptitious software: obfuscation, watermarking, and tamperproofing for software protection. Addison-Wesley Professional, 2009. Google ScholarDigital Library
S. Delany, P. Cunningham, and B. Smyth. Ecue: A spam filter that uses machine learning to track concept drift. In Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29-September 1, 2006, Riva del Garda, Italy, pages 627--631. IOS Press, 2006. Google ScholarDigital Library
S. Delany, P. Cunningham, and A. Tsymbal. A comparison of ensemble and case-base maintenance techniques for handling concept drift in spam filtering. In Proceedings of the 19th International Conference on Artificial Intelligence (FLAIRS 2006), pages 340--345, 2006.Google Scholar
S. Delany, P. Cunningham, A. Tsymbal, and L. Coyle. A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems, 18(4):187--195, 2005. Google ScholarDigital Library
A. Dries and U. Rückert. Adaptive concept drift detection. Statistical Analysis and Data Mining, 2(5-6):311--327, 2009. Google ScholarDigital Library
F. Fdez-Riverola, E. Iglesias, F. Díaz, J. Méndez, and J. Corchado. Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications, 33(1):36--48, 2007. Google ScholarDigital Library
J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer, 2001.Google Scholar
J. Friedman and L. Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, pages 697--717, 1979.Google ScholarCross Ref
J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. Advances in Artificial Intelligence - SBIA 2004, pages 66--112, 2004.Google ScholarCross Ref
GDataSoftware. G data malware report. http://www.gdatasoftware.com/uploads/media/G_Data_MalwareReport_H1_2011_EN.pdf, 2011.Google Scholar
F. Guo, P. Ferrie, and T. Chiueh. A study of the packer problem and its solutions. In Recent Advances in Intrusion Detection, pages 98--115. Springer, 2008. Google ScholarDigital Library
P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2):359--374, 2002.Google ScholarCross Ref
M. Hayes, A. Walenstein, and A. Lakhotia. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology, 5(4):335--343, 2009.Google ScholarCross Ref
D. Helmbold and P. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27--45, 1994. Google ScholarDigital Library
R. Hogg, J. McKean, and A. Craig. Introduction to mathematical statistics, 2005.Google Scholar
M. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13--23, 2005.Google ScholarCross Ref
M. Kelly, D. Hand, and N. Adams. The impact of changing populations on classifier performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 367--371. ACM, 1999. Google ScholarDigital Library
R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281--300, 2004. Google ScholarDigital Library
R. Klinkenberg and I. Renz. Adaptive information filtering: Learning drifting concepts. In Proc. of AAAI-98/ICML-98 workshop Learning for Text Categorization, pages 33--40, 1998.Google Scholar
J. Kolter and M. Maloof. Learning to detect and classify malicious executables in the wild. The Journal of Machine Learning Research, 7:2721--2744, 2006. Google ScholarDigital Library
A. Kuh, T. Petsche, and R. Rivest. Learning time-varying concepts. In Proceedings of the 1990 Conference on Advances in Neural Information Processing Systems (NIPS), pages 183--189, 1990. Google ScholarDigital Library
M. Lehman. Laws of software evolution revisited. Software process technology, pages 108--124, 1996. Google ScholarDigital Library
P. Li, L. Liu, D. Gao, and M. Reiter. On challenges in evaluating malware clustering. In Recent Advances in Intrusion Detection, pages 238--255. Springer, 2010. Google ScholarDigital Library
M. Masud, T. Al-Khateeb, K. Hamlen, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Cloud-based malware detection for evolving data streams. ACM Transactions on Management Information Systems (TMIS), 2(3):16, 2011. Google ScholarDigital Library
M. M. Masud, L. Khan, and B. Thuraisingham. A hybrid model to detect malicious executables. In Proc. of the IEEE Intl. Conf. on Communications (ICC 2007), pages 1443--1448, 2007.Google ScholarCross Ref
T. R. Microsoft Protection Center and Response. Malware encyclopedia. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Browse.aspx, 2011.Google Scholar
R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici. Unknown malcode detection using OPCODE representation. In Intelligence and Security Informatics, volume 5376 of Lecture Notes in Computer Science, pages 204--215. Springer Berlin, 2008. Google ScholarDigital Library
C. Nachenberg. Computer virus-coevolution. Communications of the ACM, 50(1):46--51, 1997. Google ScholarDigital Library
N. Rosenblum, B. Miller, and X. Zhu. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 21--28, 2010. Google ScholarDigital Library
B. Rubinstein, B. Nelson, L. Huang, A. Joseph, S. Lau, S. Rao, N. Taft, and J. Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Internet Measurement Conference, pages 1--14, 2009. Google ScholarDigital Library
M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. Data mining methods for detection of new malicious executables. In Proc. of S & P 2001: IEEE Symposium on Security and Privacy, pages 38--49, 2001. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
P. Tan, M. Steinbach, and V. Kumar. Introduction to data mining. Pearson Addison Wesley, 2006. Google ScholarDigital Library
A. Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 2004.Google Scholar
G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69--101, 1996. Google ScholarDigital Library
T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue. Measuring similarity of large software systems based on source code correspondence. Product Focused Software Process Improvement, pages 179--208, 2005. Google ScholarDigital Library
I. Zliobaite. Learning under concept drift: an overview. Technical report, Vilnius University, Lithuania, 2009.Google Scholar

Index Terms

Tracking concept drift in malware families
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
2. Social and professional topics
  1. Computing / technology policy
    1. Computer crime

Recommendations

The Next Malware Battleground: Recovery After Unknown Infection

Malware has become a natural aspect of Internet computing due to the imperfectness of systems that identify malware and prevent their installation. Our ability to control the volume of unwanted and malicious traffic on the Internet—the spam messages, ...
Read More
Correlation Analysis between Spamming Botnets and Malware Infected Hosts
SAINT '11: Proceedings of the 2011 IEEE/IPSJ International Symposium on Applications and the Internet

Many of recent cyber attacks are being launched by botnets for the purpose of carrying out large-scale cyber attacks such as spam emails, Distributed Denial of Service (DDoS), network scanning and so on. In many cases, these botnets consist of a lot of ...
Read More
Testing malware detectors

In today's interconnected world, malware, such as worms and viruses, can cause havoc. A malware detector (commonly known as virus scanner) attempts to identify malware. In spite of the importance of malware detectors, there is a dearth of testing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence
October 2012
116 pages
ISBN:9781450316644
DOI:10.1145/2381896
General Chair:
Ting Yu
North Carolina State University, USA
,
Program Chairs:
V. N. Venkatakrishan
University of Illinois at Chicago, USA
,
Apu Kapadia
Indiana University, Bloomington, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
concept drift
malware
metafeatures
temporal similarity
Qualifiers
- research-article
Conference

Acceptance Rates
AISec '12 Paper Acceptance Rate10of24submissions,42%Overall Acceptance Rate94of231submissions,41%
More
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 641
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Tracking concept drift in malware families

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Next Malware Battleground: Recovery After Unknown Infection

Correlation Analysis between Spamming Botnets and Malware Infected Hosts

Testing malware detectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Tracking concept drift in malware families

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Next Malware Battleground: Recovery After Unknown Infection

Correlation Analysis between Spamming Botnets and Malware Infected Hosts

Testing malware detectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media