research-article

Malware Classification and Class Imbalance via Stochastic Hashed LZJD

Authors:
Edward Raff

Laboratory For Physical Sciences, Catonsville, MD, USA

Laboratory For Physical Sciences, Catonsville, MD, USA
View Profile

,
Charles Nicholas

University of Maryland, Baltimore County, Catonsville, MD, USA

University of Maryland, Baltimore County, Catonsville, MD, USA
View Profile

AISec '17: Proceedings of the 10th ACM Workshop on Artificial Intelligence and SecurityNovember 2017Pages 111–120https://doi.org/10.1145/3128572.3140446

Published:03 November 2017Publication History

AISec '17: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

Pages 111–120

ABSTRACT

There are currently few methods that can be applied to malware classification problems which don't require domain knowledge to apply. In this work, we develop our new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance. These SHWeL vectors improve upon LZJD's accuracy, outperform byte n-grams, and allow us to build efficient algorithms for both training (a weakness of byte n-grams) and inference (a weakness of LZJD). Furthermore, our new SHWeL method also allows us to directly tackle the class imbalance problem, which is common for malware-related tasks. Compared to existing methods like SMOTE, SHWeL provides significantly improved accuracy while reducing algorithmic complexity to O(N). Because our approach is developed without the use of domain knowledge, it can be easily re-applied to any new domain where there is a need to classify byte sequences.

References

2015. Microsoft Malware Classification Challenge (BIG 2015). (2015). https://www.kaggle.com/c/malware-classification/Google Scholar
Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS) February (2014), 23--26. DOI: http://dx.doi.org/10.14722/ndss.2014.23247 Google ScholarCross Ref
Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID'07). Springer-Verlag, Berlin, Heidelberg, 178--197. http://dl.acm.org/citation.cfm?id=1776434.1776449Google ScholarDigital Library
Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. Analysis 4637, 1 (2007), 178--197. http://portal.acm.org/citation.cfm?id=1776449 Google ScholarCross Ref
G. E. a P. a Batista, a L. C. Bazzan, and M. C. Monard. 2004. Balancing Training Data for Automated Annotation of Keywords: a Case Study. Revista Tecnologia da Informação 3, 2 (2004), 15--20.Google Scholar
Rebecca Schuller Borbely. 2015. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques (2015), 1--8. DOI: http://dx.doi.org/10.1007/s11416-015-0260-0 Google ScholarCross Ref
Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145--1159. DOI: http://dx.doi.org/10.1016/S0031-3203(96)00142-2 Google ScholarDigital Library
Andrei Z. Broder. 1997. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997 (SEQUENCES '97). IEEE Computer Society, Washington, DC, USA, 21--29. http://dl.acm.org/citation.cfm?id=829502.830043Google ScholarDigital Library
Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. 2010. The Balanced Accuracy and Its Posterior Distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR '10). IEEE Computer Society, Washington, DC, USA, 3121--3124. DOI: http://dx.doi.org/10.1109/ICPR.2010.764 Google ScholarDigital Library
Manuel Cebrián, Manuel Alfonseca, Alfonso Ortega, and others. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems 5, 4 (2005), 367--384. Google ScholarCross Ref
Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Artificial Intelligence Research 16 (2002), 321--357. http://arxiv.org/abs/1106.1813Google ScholarDigital Library
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning. In Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I (ICIC'05). Springer-Verlag, Berlin, Heidelberg, 878--887. DOI: http://dx.doi.org/10.1007/11538059{_}91Google ScholarDigital Library
Matthew Hayes, Andrew Walenstein, and Arun Lakhotia. 2008. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5, 4 (2008), 335--343. DOI: http://dx.doi.org/10.1007/s11416-008-0100-6 Google ScholarCross Ref
Olivier Henchiri and Nathalie Japkowicz. 2006. A Feature Selection and Evaluation Scheme for Computer Virus Detection. In Proceedings of the Sixth International Conference on Data Mining (ICDM '06). IEEE Computer Society, Washington, DC, USA, 891--895. DOI: http://dx.doi.org/10.1109/ICDM.2006.4 Google ScholarDigital Library
Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM '10). IEEE Computer Society, Washington, DC, USA, 246--255. DOI: http://dx.doi.org/10.1109/ICDM.2010.80 Google ScholarDigital Library
Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis. In Proceedings of the 18th ACM conference on Computer and communications security - CCS. ACM Press, New York, New York, USA, 309--320. DOI: http://dx.doi.org/10.1145/2046707.2046742 Google ScholarDigital Library
J. Zico Kolter and Marcus A. Maloof. 2006. Learning to Detect and Classify Malicious Executables in the Wild. Journal of Machine Learning Research 7 (12 2006), 2721--2744. http://dl.acm.org/citation.cfm?id=1248547.1248646Google Scholar
Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365.htmlGoogle ScholarDigital Library
Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitanyi. 2004. The Similarity Metric. IEEE Transactions on Information Theory 50, 12 (2004), 3250--3264. DOI: http://dx.doi.org/10.1109/TIT.2004.838101 Google ScholarDigital Library
Ping Li. 2015. 0-Bit Consistent Weighted Sampling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 665--674. DOI: http://dx.doi.org/10.1145/2783258.2783406 Google ScholarDigital Library
Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd C. König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680. http://papers.nips.cc/paper/4403-hashing-algorithms-for-large-scale-learning.pdfGoogle Scholar
Yuping Li, Sathya Chandran Sundaramurthy, Alexandru G. Bardas, Xinming Ou, Doina Caragea, Xin Hu, and Jiyong Jang. 2015. Experimental Study of Fuzzy Hashing in Malware Clustering Analysis. In 8th Workshop on Cyber Security Experimentation and Test (CSET 15). USENIX Association, Washington, D.C. https://www.usenix.org/conference/cset15/workshop-program/presentation/liGoogle Scholar
Mark Manasse, Frank McSherry, and Kunal Talwar. 2008. Consistent Weighted Sampling. Technical Report. https://www.microsoft.com/en-us/research/publication/consistent-weighted-sampling/Google Scholar
Zane Markel and Michael Bilzor. 2014. Building a machine learning classifier for malware detection. In 2014 Second Workshop on Anti-malware Testing Research (WATeR). IEEE, 1--4. DOI: http://dx.doi.org/10.1109/WATeR.2014.7015757 Google ScholarCross Ref
Robert Moskovitch, Nir Nissim, and Yuval Elovici. 2009. Malicious Code Detection Using Active Learning. In Privacy, Security, and Trust in KDD. 74--91. DOI: http://dx.doi.org/10.1007/978-3-642-01718-6{_}6Google Scholar
Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, Nathalie Japkowicz, and Yuval Elovici. 2009. Unknown malcode detection and the imbalance problem. Journal in Computer Virology 5, 4 (11 2009), 295--308. DOI: http://dx.doi.org/10.1007/s11416-009-0122-8 Google ScholarCross Ref
Om Patri, Michael Wojnowicz, and Matt Wolff. 2017. Discovering Malware with Time Series Shapelets. In Proceedings of the 50th Hawaii International Conference on System Sciences. Google ScholarCross Ref
Edward Raff. 2017. JSAT: Java Statistical Analysis Tool, a Library for Machine Learning. Journal of Machine Learning Research 18, 23 (2017), 1--5. http://jmlr.org/papers/v18/16-131.htmlGoogle ScholarDigital Library
Edward Raff and Charles Nicholas. 2017. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '17. ACM Press, New York, New York, USA, 1007--1015. DOI: http://dx.doi.org/10.1145/3097983.3098111 Google ScholarDigital Library
Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean, and Charles Nicholas. 2016. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques (9 2016). DOI: http://dx.doi.org/10.1007/s11416-016-0283-1 Google ScholarCross Ref
D. Krishna Sandeep Reddy and Arun K. Pujari. 2006. N-gram analysis for computer virus detection. Journal in Computer Virology 2, 3 (11 2006), 231--239. DOI: http://dx.doi.org/10.1007/s11416-006-0027-8 Google ScholarCross Ref
J.-Michael Roberts. 2011. Virus Share. (2011). https://virusshare.com/Google Scholar
Christian Rossow, Christian J. Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten van Steen. 2012. Prudent Practices for Designing Malware Experiments: Status Quo and Outlook. In 2012 IEEE Symposium on Security and Privacy. IEEE, 65--79. DOI: http://dx.doi.org/10.1109/SP.2012.14 Google ScholarDigital Library
M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. 2001. Data Mining Methods for Detection of New Malicious Executables. In Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001. IEEE Comput. Soc, 38--49. DOI: http://dx.doi.org/10.1109/SECPRI.2001.924286 Google Scholar
Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16--29. DOI: http://dx.doi.org/10.1016/j.istr.2009.03.003 Google ScholarDigital Library
Salvatore J. Stolfo, Ke Wang, and Wei-Jen Li. 2007. Towards Stealthy Malware Detection. In Malware Detection, Mihai Christodorescu, Somesh Jha, Douglas Maughan, Dawn Song, and Cliff Wang (Eds.). Springer US, Boston, MA, 231--249. DOI: http://dx.doi.org/10.1007/978-0-387-44599-1{_}11Google Scholar
Stephanie Wehner. 2007. Analyzing Worms and Network Traffic Using Compression. Journal of Computer Security 15, 3 (8 2007), 303--320. http://dl.acm.org/citation.cfm?id=1370628.1370630Google ScholarCross Ref
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. ACM Press, New York, New York, USA, 1113--1120. DOI: http://dx.doi.org/10.1145/1553374.1553516 Google ScholarDigital Library
Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1035--1043. DOI: http://dx.doi.org/10.1145/3038912.3052598 Google ScholarDigital Library
Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. 2010. Hierarchical Associative Classifier (HAC) for Malware Detection from the Large and Imbalanced Gray List. Journal of Intelligent Information Systems 35, 1 (8 2010), 1--20. DOI: http://dx.doi.org/10.1007/s10844-009-0086-7 Google ScholarDigital Library
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (5 1977), 337--343. DOI: http://dx.doi.org/10.1109/TIT.1977.1055714 Google ScholarDigital Library
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5 (9 1978), 530--536. DOI: http://dx.doi.org/10.1109/TIT.1978.1055934 Google ScholarDigital Library

Index Terms

Malware Classification and Class Imbalance via Stochastic Hashed LZJD

Recommendations

Malware classification method via binary content comparison
RACS '12: Proceedings of the 2012 ACM Research in Applied Computation Symposium

With the wide spread uses of the Internet, the number of Internet attacks keeps increasing, and malware is the main cause of most Internet attacks. Malware is used by attackers to infect normal users' computers and to acquire private information as well ...
Read More
Malware Function Classification Using APIs in Initial Behavior
ASIAJCIS '15: Proceedings of the 2015 10th Asia Joint Conference on Information Security

Malware proliferation has become a serious threat to the Internet in recent years. Most of the current malware are subspecies of existing malware that have been automatically generated by illegal tools. To conduct an efficient analysis of malware, ...
Read More
A Malware Classification Method Based on Generic Malware Information
ICONIP 2015: Proceeings, Part II, of the 22nd International Conference on Neural Information Processing - Volume 9490

Since attackers easily have been making malware using dedicated malware generation tools, the number of malware is increasing rapidly. However, it is hard to analyze all malwares because of rise in high-volume of malwares. For this reason, many ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AISec '17: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
November 2017
140 pages
ISBN:9781450352024
DOI:10.1145/3128572
General Chair:
Bhavani Thuraisingham
University of Texas at Dallas, USA
,
Program Chairs:
Battista Biggio
Pluribus One and University of Cagliari, Italy
,
David Mandell Freeman
Facebook Inc., USA
,
Brad Miller
Google Inc., USA
,
Arunesh Sinha
University of Michigan, Ann Arbor, USA
Copyright © 2017 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cyber security
lzjd
malware classification
shwel
Qualifiers
- research-article
Conference

Acceptance Rates
AISec '17 Paper Acceptance Rate11of36submissions,31%Overall Acceptance Rate94of231submissions,41%
More
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 447
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Malware Classification and Class Imbalance via Stochastic Hashed LZJD

AISec '17: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Malware classification method via binary content comparison

Malware Function Classification Using APIs in Initial Behavior

A Malware Classification Method Based on Generic Malware Information