ABSTRACT
There are currently few methods that can be applied to malware classification problems which don't require domain knowledge to apply. In this work, we develop our new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance. These SHWeL vectors improve upon LZJD's accuracy, outperform byte n-grams, and allow us to build efficient algorithms for both training (a weakness of byte n-grams) and inference (a weakness of LZJD). Furthermore, our new SHWeL method also allows us to directly tackle the class imbalance problem, which is common for malware-related tasks. Compared to existing methods like SMOTE, SHWeL provides significantly improved accuracy while reducing algorithmic complexity to O(N). Because our approach is developed without the use of domain knowledge, it can be easily re-applied to any new domain where there is a need to classify byte sequences.
- 2015. Microsoft Malware Classification Challenge (BIG 2015). (2015). https://www.kaggle.com/c/malware-classification/Google Scholar
- Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS) February (2014), 23--26. DOI: http://dx.doi.org/10.14722/ndss.2014.23247 Google ScholarCross Ref
- Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID'07). Springer-Verlag, Berlin, Heidelberg, 178--197. http://dl.acm.org/citation.cfm?id=1776434.1776449Google ScholarDigital Library
- Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. Analysis 4637, 1 (2007), 178--197. http://portal.acm.org/citation.cfm?id=1776449 Google ScholarCross Ref
- G. E. a P. a Batista, a L. C. Bazzan, and M. C. Monard. 2004. Balancing Training Data for Automated Annotation of Keywords: a Case Study. Revista Tecnologia da Informação 3, 2 (2004), 15--20.Google Scholar
- Rebecca Schuller Borbely. 2015. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques (2015), 1--8. DOI: http://dx.doi.org/10.1007/s11416-015-0260-0 Google ScholarCross Ref
- Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145--1159. DOI: http://dx.doi.org/10.1016/S0031-3203(96)00142-2 Google ScholarDigital Library
- Andrei Z. Broder. 1997. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997 (SEQUENCES '97). IEEE Computer Society, Washington, DC, USA, 21--29. http://dl.acm.org/citation.cfm?id=829502.830043Google ScholarDigital Library
- Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. 2010. The Balanced Accuracy and Its Posterior Distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR '10). IEEE Computer Society, Washington, DC, USA, 3121--3124. DOI: http://dx.doi.org/10.1109/ICPR.2010.764 Google ScholarDigital Library
- Manuel Cebrián, Manuel Alfonseca, Alfonso Ortega, and others. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems 5, 4 (2005), 367--384. Google ScholarCross Ref
- Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Artificial Intelligence Research 16 (2002), 321--357. http://arxiv.org/abs/1106.1813Google ScholarDigital Library
- Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning. In Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I (ICIC'05). Springer-Verlag, Berlin, Heidelberg, 878--887. DOI: http://dx.doi.org/10.1007/11538059{_}91Google ScholarDigital Library
- Matthew Hayes, Andrew Walenstein, and Arun Lakhotia. 2008. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5, 4 (2008), 335--343. DOI: http://dx.doi.org/10.1007/s11416-008-0100-6 Google ScholarCross Ref
- Olivier Henchiri and Nathalie Japkowicz. 2006. A Feature Selection and Evaluation Scheme for Computer Virus Detection. In Proceedings of the Sixth International Conference on Data Mining (ICDM '06). IEEE Computer Society, Washington, DC, USA, 891--895. DOI: http://dx.doi.org/10.1109/ICDM.2006.4 Google ScholarDigital Library
- Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM '10). IEEE Computer Society, Washington, DC, USA, 246--255. DOI: http://dx.doi.org/10.1109/ICDM.2010.80 Google ScholarDigital Library
- Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis. In Proceedings of the 18th ACM conference on Computer and communications security - CCS. ACM Press, New York, New York, USA, 309--320. DOI: http://dx.doi.org/10.1145/2046707.2046742 Google ScholarDigital Library
- J. Zico Kolter and Marcus A. Maloof. 2006. Learning to Detect and Classify Malicious Executables in the Wild. Journal of Machine Learning Research 7 (12 2006), 2721--2744. http://dl.acm.org/citation.cfm?id=1248547.1248646Google Scholar
- Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365.htmlGoogle ScholarDigital Library
- Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitanyi. 2004. The Similarity Metric. IEEE Transactions on Information Theory 50, 12 (2004), 3250--3264. DOI: http://dx.doi.org/10.1109/TIT.2004.838101 Google ScholarDigital Library
- Ping Li. 2015. 0-Bit Consistent Weighted Sampling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 665--674. DOI: http://dx.doi.org/10.1145/2783258.2783406 Google ScholarDigital Library
- Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd C. König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680. http://papers.nips.cc/paper/4403-hashing-algorithms-for-large-scale-learning.pdfGoogle Scholar
- Yuping Li, Sathya Chandran Sundaramurthy, Alexandru G. Bardas, Xinming Ou, Doina Caragea, Xin Hu, and Jiyong Jang. 2015. Experimental Study of Fuzzy Hashing in Malware Clustering Analysis. In 8th Workshop on Cyber Security Experimentation and Test (CSET 15). USENIX Association, Washington, D.C. https://www.usenix.org/conference/cset15/workshop-program/presentation/liGoogle Scholar
- Mark Manasse, Frank McSherry, and Kunal Talwar. 2008. Consistent Weighted Sampling. Technical Report. https://www.microsoft.com/en-us/research/publication/consistent-weighted-sampling/Google Scholar
- Zane Markel and Michael Bilzor. 2014. Building a machine learning classifier for malware detection. In 2014 Second Workshop on Anti-malware Testing Research (WATeR). IEEE, 1--4. DOI: http://dx.doi.org/10.1109/WATeR.2014.7015757 Google ScholarCross Ref
- Robert Moskovitch, Nir Nissim, and Yuval Elovici. 2009. Malicious Code Detection Using Active Learning. In Privacy, Security, and Trust in KDD. 74--91. DOI: http://dx.doi.org/10.1007/978-3-642-01718-6{_}6Google Scholar
- Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, Nathalie Japkowicz, and Yuval Elovici. 2009. Unknown malcode detection and the imbalance problem. Journal in Computer Virology 5, 4 (11 2009), 295--308. DOI: http://dx.doi.org/10.1007/s11416-009-0122-8 Google ScholarCross Ref
- Om Patri, Michael Wojnowicz, and Matt Wolff. 2017. Discovering Malware with Time Series Shapelets. In Proceedings of the 50th Hawaii International Conference on System Sciences. Google ScholarCross Ref
- Edward Raff. 2017. JSAT: Java Statistical Analysis Tool, a Library for Machine Learning. Journal of Machine Learning Research 18, 23 (2017), 1--5. http://jmlr.org/papers/v18/16-131.htmlGoogle ScholarDigital Library
- Edward Raff and Charles Nicholas. 2017. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '17. ACM Press, New York, New York, USA, 1007--1015. DOI: http://dx.doi.org/10.1145/3097983.3098111 Google ScholarDigital Library
- Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean, and Charles Nicholas. 2016. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques (9 2016). DOI: http://dx.doi.org/10.1007/s11416-016-0283-1 Google ScholarCross Ref
- D. Krishna Sandeep Reddy and Arun K. Pujari. 2006. N-gram analysis for computer virus detection. Journal in Computer Virology 2, 3 (11 2006), 231--239. DOI: http://dx.doi.org/10.1007/s11416-006-0027-8 Google ScholarCross Ref
- J.-Michael Roberts. 2011. Virus Share. (2011). https://virusshare.com/Google Scholar
- Christian Rossow, Christian J. Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten van Steen. 2012. Prudent Practices for Designing Malware Experiments: Status Quo and Outlook. In 2012 IEEE Symposium on Security and Privacy. IEEE, 65--79. DOI: http://dx.doi.org/10.1109/SP.2012.14 Google ScholarDigital Library
- M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. 2001. Data Mining Methods for Detection of New Malicious Executables. In Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001. IEEE Comput. Soc, 38--49. DOI: http://dx.doi.org/10.1109/SECPRI.2001.924286 Google Scholar
- Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16--29. DOI: http://dx.doi.org/10.1016/j.istr.2009.03.003 Google ScholarDigital Library
- Salvatore J. Stolfo, Ke Wang, and Wei-Jen Li. 2007. Towards Stealthy Malware Detection. In Malware Detection, Mihai Christodorescu, Somesh Jha, Douglas Maughan, Dawn Song, and Cliff Wang (Eds.). Springer US, Boston, MA, 231--249. DOI: http://dx.doi.org/10.1007/978-0-387-44599-1{_}11Google Scholar
- Stephanie Wehner. 2007. Analyzing Worms and Network Traffic Using Compression. Journal of Computer Security 15, 3 (8 2007), 303--320. http://dl.acm.org/citation.cfm?id=1370628.1370630Google ScholarCross Ref
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. ACM Press, New York, New York, USA, 1113--1120. DOI: http://dx.doi.org/10.1145/1553374.1553516 Google ScholarDigital Library
- Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1035--1043. DOI: http://dx.doi.org/10.1145/3038912.3052598 Google ScholarDigital Library
- Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. 2010. Hierarchical Associative Classifier (HAC) for Malware Detection from the Large and Imbalanced Gray List. Journal of Intelligent Information Systems 35, 1 (8 2010), 1--20. DOI: http://dx.doi.org/10.1007/s10844-009-0086-7 Google ScholarDigital Library
- Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (5 1977), 337--343. DOI: http://dx.doi.org/10.1109/TIT.1977.1055714 Google ScholarDigital Library
- Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5 (9 1978), 530--536. DOI: http://dx.doi.org/10.1109/TIT.1978.1055934 Google ScholarDigital Library
Index Terms
- Malware Classification and Class Imbalance via Stochastic Hashed LZJD
Recommendations
Malware classification method via binary content comparison
RACS '12: Proceedings of the 2012 ACM Research in Applied Computation SymposiumWith the wide spread uses of the Internet, the number of Internet attacks keeps increasing, and malware is the main cause of most Internet attacks. Malware is used by attackers to infect normal users' computers and to acquire private information as well ...
Malware Function Classification Using APIs in Initial Behavior
ASIAJCIS '15: Proceedings of the 2015 10th Asia Joint Conference on Information SecurityMalware proliferation has become a serious threat to the Internet in recent years. Most of the current malware are subspecies of existing malware that have been automatically generated by illegal tools. To conduct an efficient analysis of malware, ...
A Malware Classification Method Based on Generic Malware Information
ICONIP 2015: Proceeings, Part II, of the 22nd International Conference on Neural Information Processing - Volume 9490Since attackers easily have been making malware using dedicated malware generation tools, the number of malware is increasing rapidly. However, it is hard to analyze all malwares because of rise in high-volume of malwares. For this reason, many ...
Comments