skip to main content
10.1145/3128572.3140446acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Malware Classification and Class Imbalance via Stochastic Hashed LZJD

Authors Info & Claims
Published:03 November 2017Publication History

ABSTRACT

There are currently few methods that can be applied to malware classification problems which don't require domain knowledge to apply. In this work, we develop our new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance. These SHWeL vectors improve upon LZJD's accuracy, outperform byte n-grams, and allow us to build efficient algorithms for both training (a weakness of byte n-grams) and inference (a weakness of LZJD). Furthermore, our new SHWeL method also allows us to directly tackle the class imbalance problem, which is common for malware-related tasks. Compared to existing methods like SMOTE, SHWeL provides significantly improved accuracy while reducing algorithmic complexity to O(N). Because our approach is developed without the use of domain knowledge, it can be easily re-applied to any new domain where there is a need to classify byte sequences.

References

  1. 2015. Microsoft Malware Classification Challenge (BIG 2015). (2015). https://www.kaggle.com/c/malware-classification/Google ScholarGoogle Scholar
  2. Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS) February (2014), 23--26. DOI: http://dx.doi.org/10.14722/ndss.2014.23247 Google ScholarGoogle ScholarCross RefCross Ref
  3. Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID'07). Springer-Verlag, Berlin, Heidelberg, 178--197. http://dl.acm.org/citation.cfm?id=1776434.1776449Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. Analysis 4637, 1 (2007), 178--197. http://portal.acm.org/citation.cfm?id=1776449 Google ScholarGoogle ScholarCross RefCross Ref
  5. G. E. a P. a Batista, a L. C. Bazzan, and M. C. Monard. 2004. Balancing Training Data for Automated Annotation of Keywords: a Case Study. Revista Tecnologia da Informação 3, 2 (2004), 15--20.Google ScholarGoogle Scholar
  6. Rebecca Schuller Borbely. 2015. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques (2015), 1--8. DOI: http://dx.doi.org/10.1007/s11416-015-0260-0 Google ScholarGoogle ScholarCross RefCross Ref
  7. Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145--1159. DOI: http://dx.doi.org/10.1016/S0031-3203(96)00142-2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrei Z. Broder. 1997. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997 (SEQUENCES '97). IEEE Computer Society, Washington, DC, USA, 21--29. http://dl.acm.org/citation.cfm?id=829502.830043Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. 2010. The Balanced Accuracy and Its Posterior Distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR '10). IEEE Computer Society, Washington, DC, USA, 3121--3124. DOI: http://dx.doi.org/10.1109/ICPR.2010.764 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Manuel Cebrián, Manuel Alfonseca, Alfonso Ortega, and others. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems 5, 4 (2005), 367--384. Google ScholarGoogle ScholarCross RefCross Ref
  11. Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Artificial Intelligence Research 16 (2002), 321--357. http://arxiv.org/abs/1106.1813Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning. In Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I (ICIC'05). Springer-Verlag, Berlin, Heidelberg, 878--887. DOI: http://dx.doi.org/10.1007/11538059{_}91Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Matthew Hayes, Andrew Walenstein, and Arun Lakhotia. 2008. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5, 4 (2008), 335--343. DOI: http://dx.doi.org/10.1007/s11416-008-0100-6 Google ScholarGoogle ScholarCross RefCross Ref
  14. Olivier Henchiri and Nathalie Japkowicz. 2006. A Feature Selection and Evaluation Scheme for Computer Virus Detection. In Proceedings of the Sixth International Conference on Data Mining (ICDM '06). IEEE Computer Society, Washington, DC, USA, 891--895. DOI: http://dx.doi.org/10.1109/ICDM.2006.4 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM '10). IEEE Computer Society, Washington, DC, USA, 246--255. DOI: http://dx.doi.org/10.1109/ICDM.2010.80 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis. In Proceedings of the 18th ACM conference on Computer and communications security - CCS. ACM Press, New York, New York, USA, 309--320. DOI: http://dx.doi.org/10.1145/2046707.2046742 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Zico Kolter and Marcus A. Maloof. 2006. Learning to Detect and Classify Malicious Executables in the Wild. Journal of Machine Learning Research 7 (12 2006), 2721--2744. http://dl.acm.org/citation.cfm?id=1248547.1248646Google ScholarGoogle Scholar
  18. Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitanyi. 2004. The Similarity Metric. IEEE Transactions on Information Theory 50, 12 (2004), 3250--3264. DOI: http://dx.doi.org/10.1109/TIT.2004.838101 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ping Li. 2015. 0-Bit Consistent Weighted Sampling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 665--674. DOI: http://dx.doi.org/10.1145/2783258.2783406 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd C. König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680. http://papers.nips.cc/paper/4403-hashing-algorithms-for-large-scale-learning.pdfGoogle ScholarGoogle Scholar
  22. Yuping Li, Sathya Chandran Sundaramurthy, Alexandru G. Bardas, Xinming Ou, Doina Caragea, Xin Hu, and Jiyong Jang. 2015. Experimental Study of Fuzzy Hashing in Malware Clustering Analysis. In 8th Workshop on Cyber Security Experimentation and Test (CSET 15). USENIX Association, Washington, D.C. https://www.usenix.org/conference/cset15/workshop-program/presentation/liGoogle ScholarGoogle Scholar
  23. Mark Manasse, Frank McSherry, and Kunal Talwar. 2008. Consistent Weighted Sampling. Technical Report. https://www.microsoft.com/en-us/research/publication/consistent-weighted-sampling/Google ScholarGoogle Scholar
  24. Zane Markel and Michael Bilzor. 2014. Building a machine learning classifier for malware detection. In 2014 Second Workshop on Anti-malware Testing Research (WATeR). IEEE, 1--4. DOI: http://dx.doi.org/10.1109/WATeR.2014.7015757 Google ScholarGoogle ScholarCross RefCross Ref
  25. Robert Moskovitch, Nir Nissim, and Yuval Elovici. 2009. Malicious Code Detection Using Active Learning. In Privacy, Security, and Trust in KDD. 74--91. DOI: http://dx.doi.org/10.1007/978-3-642-01718-6{_}6Google ScholarGoogle Scholar
  26. Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, Nathalie Japkowicz, and Yuval Elovici. 2009. Unknown malcode detection and the imbalance problem. Journal in Computer Virology 5, 4 (11 2009), 295--308. DOI: http://dx.doi.org/10.1007/s11416-009-0122-8 Google ScholarGoogle ScholarCross RefCross Ref
  27. Om Patri, Michael Wojnowicz, and Matt Wolff. 2017. Discovering Malware with Time Series Shapelets. In Proceedings of the 50th Hawaii International Conference on System Sciences. Google ScholarGoogle ScholarCross RefCross Ref
  28. Edward Raff. 2017. JSAT: Java Statistical Analysis Tool, a Library for Machine Learning. Journal of Machine Learning Research 18, 23 (2017), 1--5. http://jmlr.org/papers/v18/16-131.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  29. Edward Raff and Charles Nicholas. 2017. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '17. ACM Press, New York, New York, USA, 1007--1015. DOI: http://dx.doi.org/10.1145/3097983.3098111 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean, and Charles Nicholas. 2016. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques (9 2016). DOI: http://dx.doi.org/10.1007/s11416-016-0283-1 Google ScholarGoogle ScholarCross RefCross Ref
  31. D. Krishna Sandeep Reddy and Arun K. Pujari. 2006. N-gram analysis for computer virus detection. Journal in Computer Virology 2, 3 (11 2006), 231--239. DOI: http://dx.doi.org/10.1007/s11416-006-0027-8 Google ScholarGoogle ScholarCross RefCross Ref
  32. J.-Michael Roberts. 2011. Virus Share. (2011). https://virusshare.com/Google ScholarGoogle Scholar
  33. Christian Rossow, Christian J. Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten van Steen. 2012. Prudent Practices for Designing Malware Experiments: Status Quo and Outlook. In 2012 IEEE Symposium on Security and Privacy. IEEE, 65--79. DOI: http://dx.doi.org/10.1109/SP.2012.14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. 2001. Data Mining Methods for Detection of New Malicious Executables. In Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001. IEEE Comput. Soc, 38--49. DOI: http://dx.doi.org/10.1109/SECPRI.2001.924286 Google ScholarGoogle Scholar
  35. Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16--29. DOI: http://dx.doi.org/10.1016/j.istr.2009.03.003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Salvatore J. Stolfo, Ke Wang, and Wei-Jen Li. 2007. Towards Stealthy Malware Detection. In Malware Detection, Mihai Christodorescu, Somesh Jha, Douglas Maughan, Dawn Song, and Cliff Wang (Eds.). Springer US, Boston, MA, 231--249. DOI: http://dx.doi.org/10.1007/978-0-387-44599-1{_}11Google ScholarGoogle Scholar
  37. Stephanie Wehner. 2007. Analyzing Worms and Network Traffic Using Compression. Journal of Computer Security 15, 3 (8 2007), 303--320. http://dl.acm.org/citation.cfm?id=1370628.1370630Google ScholarGoogle ScholarCross RefCross Ref
  38. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. ACM Press, New York, New York, USA, 1113--1120. DOI: http://dx.doi.org/10.1145/1553374.1553516 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1035--1043. DOI: http://dx.doi.org/10.1145/3038912.3052598 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. 2010. Hierarchical Associative Classifier (HAC) for Malware Detection from the Large and Imbalanced Gray List. Journal of Intelligent Information Systems 35, 1 (8 2010), 1--20. DOI: http://dx.doi.org/10.1007/s10844-009-0086-7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (5 1977), 337--343. DOI: http://dx.doi.org/10.1109/TIT.1977.1055714 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5 (9 1978), 530--536. DOI: http://dx.doi.org/10.1109/TIT.1978.1055934 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Malware Classification and Class Imbalance via Stochastic Hashed LZJD

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          AISec '17: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
          November 2017
          140 pages
          ISBN:9781450352024
          DOI:10.1145/3128572

          Copyright © 2017 ACM

          Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 November 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          AISec '17 Paper Acceptance Rate11of36submissions,31%Overall Acceptance Rate94of231submissions,41%

          Upcoming Conference

          CCS '24
          ACM SIGSAC Conference on Computer and Communications Security
          October 14 - 18, 2024
          Salt Lake City , UT , USA

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader