Automatic malware classification and new malware detection using machine learning

Liu, Liu; Wang, Bao-sheng; Yu, Bo; Zhong, Qiu-xi

doi:10.1631/FITEE.1601325

Automatic malware classification and new malware detection using machine learning

Published: 27 October 2017

Volume 18, pages 1336–1347, (2017)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Liu Liu ORCID: orcid.org/0000-0002-6523-1454¹,
Bao-sheng Wang¹,
Bo Yu¹ &
…
Qiu-xi Zhong¹

2231 Accesses
103 Citations
3 Altmetric
Explore all metrics

Abstract

The explosive growth of malware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect new kinds of malware programs. Therefore, we propose a machine learning based malware analysis system, which is composed of three modules: data processing, decision making, and new malware detection. The data processing module deals with gray-scale images, Opcode n-gram, and import functions, which are employed to extract the features of the malware. The decision-making module uses the features to classify the malware and to identify suspicious malware. Finally, the detection module uses the shared nearest neighbor (SNN) clustering algorithm to discover new malware families. Our approach is evaluated on more than 20 000 malware instances, which were collected by Kingsoft, ESET NOD32, and Anubis. The results show that our system can effectively classify the unknown malware with a best accuracy of 98.9%, and successfully detects 86.7% of the new malware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative Performance Evaluation of Supervised Classification Models on Large Static Malware Dataset

Malware Classification Using Machine Learning

A Malware Classification Method Based on Generic Malware Information

References

Annachhatre, C., Austin, T.H., Stamp, M., 2015. Hidden Markov models for malware classification. J. Comput. Virol. Hack. Tech., 11(2):59–73. https://doi.org/10.1007/s11416-014-0215-x
Article Google Scholar
Cheng, J.Y.C., Tsai, T.S., Yang, C.S., 2013. An information retrieval approach for malware classification based on Windows API calls. Int. Conf. on Machine Learning and Cybernetics, p.1678–1683. https://doi.org/10.1109/ICMLC.2013.6890868
Google Scholar
Damodaran, A., di Troia, F., Visaggio, C.A., et al., 2017. A comparison of static, dynamic, and hybrid analysis for malware detection. J. Comput. Virol. Hack. Tech., 13(1): 1–12. https://doi.org/10.1007/s11416-015-0261-z
Article Google Scholar
Ding, Y.X., Dai, W., Yan, S.L., et al., 2014. Control flowbased Opcode behavior analysis for malware detection. Comput. Secur., 44:65–74. https://doi.org/10.1016/j.cose.2014.04.003
Article Google Scholar
Egele, M., Scholte, T., Kirda, E., et al., 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM Comput. Surv., 44(2): Article 6. https://doi.org/10.1145/2089125.2089126
Google Scholar
Ertoz, L., Steinbach, M., Kumar, V., 2002. A new shared nearest neighbor clustering algorithm and its applications. Workshop on Clustering High Dimensional Data and Its Applications at the 2nd SIAM Int. Conf. on Data Mining, p.105–115.
Google Scholar
Gandotra, E., Bansal, D., Sofat, S., 2014. Malware analysis and classification: a survey. J. Inform. Secur., 5(2):44440. https://doi.org/10.4236/jis.2014.52006
Google Scholar
Han, K.S., Lim, J.H., Im, E.G., 2013. Malware analysis method using visualization of binary files. Proc. on Research in Adaptive and Convergent Systems, p.317–321. https://doi.org/10.1145/2513228.2513294
Google Scholar
Hu, Q.H., Yu, D.R., Xie, Z.X., et al., 2007. EROS: ensemble rough subspaces. Patt. Recogn., 40(12):3728–3739. https://doi.org/10.1016/j.patcog.2007.04.022
Article Google Scholar
Islam, R., Tian, R.H., Batten, L.M., et al., 2013. Classification of malware based on integrated static and dynamic features. J. Netw. Comput. Appl., 36(2):646–656. https://doi.org/10.1016/j.jnca.2012.10.004
Article Google Scholar
Iwamoto, K., Wasaki, K., 2012. Malware classification based on extracted API sequences using static analysis. Proc. Asian Internet Engineering Conf., p.31–38. https://doi.org/10.1145/2402599.2402604
Chapter Google Scholar
Jain, S., Meena, Y.K., 2011. Byte level n-gram analysis for malware detection. In: Venugopal, K.R., Patnaik, L.M. (Eds.), Computer Networks and Intelligent Computing. Springer, Berlin, p.51–59. https://doi.org/10.1007/978-3-642-22786-8_6
Google Scholar
Jarvis, R.A., Patrick, E.A., 1973. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput., C-22(11):1025–1034. https://doi.org/10.1109/T-C.1973.223640
Article Google Scholar
Jolliffe, I.T., 2002. Principal Component Analysis (2nd Ed.). Springer, New York. https://doi.org/10.1007/b98835
MATH Google Scholar
Kancherla, K., Mukkamala, S., 2013. Image visualization based malware detection. IEEE Symp. on Computational Intelligence in Cyber Security, p.40–44. https://doi.org/10.1109/CICYBS.2013.6597204
Google Scholar
Kapoor, A., Dhavale, S., 2016. Control flow graph based multiclass malware detection using bi-normal separation. Defen. Sci. J., 66(2):138–145. https://doi.org/10.14429/dsj.66.9701
Article Google Scholar
Kaspersky Labs, 2015. Security Bulletin 2015. https://securelist. com/files/2015/12/KSB_2015_Statistics_FINAL_EN. pdf
Google Scholar
Kinable, J., Kostakis, O., 2011. Malware classification based on call graph clustering. J. Comput. Virol., 7(4):233–245. https://doi.org/10.1007/s11416-011-0151-y
Article Google Scholar
Kong, D.G., Yan, G.H., 2013. Discriminant malware distance learning on structural information for automated malware classification. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1357–1365. https://doi.org/10.1145/2487575.2488219
Google Scholar
Lee, J., Jeong, K., Lee, H., 2010. Detecting metamorphic malwares using code graphs. Proc. ACM Symp. on Applied Computing, p.1970–1977. https://doi.org/10.1145/1774088.1774505
Google Scholar
Lin, C.T., Wang, N.J., Xiao, H., et al., 2015. Feature selection and extraction for malware classification. J. Inform. Sci. Eng., 31(3):965–992. https://doi.org/10.6688/JISE.2015.31.3.11
Google Scholar
Lin, D., Stamp, M., 2011. Hunting for undetectable metamorphic viruses. J. Comput. Virol., 7(3):201–214. https://doi.org/10.1007/s11416-010-0148-y
Article Google Scholar
Liu, X.W., Wang, L., Huang, G.B., et al., 2015. Multiple kernel extreme learning machine. Neurocomputing, 149: 253–264. https://doi.org/10.1016/j.neucom.2013.09.072
Article Google Scholar
Musale, M., Austin, T.H., Stamp, M., 2015. Hunting for metamorphic JavaScript malware. J. Comput. Virol. Hack. Tech., 11(2):89–102. https://doi.org/10.1007/s11416-014-0225-8
Article Google Scholar
Nataraj, L., Karthikeyan, S., Jacob, G., et al., 2014. Malware images: visualization and automatic classification. Proc. 8th Int. Symp. on Visualization for Cyber Security. https://doi.org/10.1145/2016904.2016908
Google Scholar
Oliva, A., Torralba, A., 2001. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis., 42(3):145–175. https://doi.org/10.1023/A:1011139631724
Article Google Scholar
Pascanu, R., Stokes, J.W., Sanossian, H., et al., 2015. Malware classification with recurrent networks. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.1916–1920. https://doi.org/10.1109/ICASSP.2015.7178304
Google Scholar
Roundy, K.A., Miller, B.P., 2010. Hybrid analysis and control of malware. In: Jha, S., Sommer, R., Kreibich, C. (Eds.), Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, p.317–338. https://doi.org/10.1007/978-3-642-15512-3_17
Google Scholar
Russo, A., Sabelfeld, A., 2010. Dynamic vs. static flowsensitive security analysis. 23rd IEEE Computer Security Foundations Symp., p.186–199. https://doi.org/10.1109/CSF.2010.20
Google Scholar
Salton, G., McGill, M.J., 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, USA.
MATH Google Scholar
Shabtai, A., Moskovitch, R., Elovici, Y., et al., 2009. Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inform. Secur. Tech. Rep., 14(1):16–29. https://doi.org/10.1016/j.istr.2009.03.003
Article Google Scholar
Tao, H., Ma, X., Qiao, M., 2013. Subspace selective ensemble algorithm based on feature clustering. J. Comput., 8(2): 509–516.
Article Google Scholar
Tian, R.H., Batten, L., Islam, R., et al., 2009. An automated classification system based on the strings of Trojan and virus families. 4th Int. Conf. on Malicious and Unwanted Software, p.23–30. https://doi.org/10.1109/MALWARE.2009.5403021
Google Scholar
Tian, R.H., Islam, R., Batten, L., et al., 2010. Differentiating malware from cleanware using behavioural analysis. 5th Int. Conf. on Malicious and Unwanted Software, p.23–30. https://doi.org/10.1109/MALWARE.2010.5665796
Chapter Google Scholar
Tsyganok, K., Tumoyan, E., Babenko, L., et al., 2012. Classification of polymorphic and metamorphic malware samples based on their behavior. Proc. 5th Int. Conf. on Security of Information and Networks, p.111–116. https://doi.org/10.1145/2388576.2388591
Google Scholar
Wong, W., Stamp, M., 2006. Hunting for metamorphic engines. J. Comput. Virol., 2(3):211–229. https://doi.org/10.1007/s11416-006-0028-7
Article Google Scholar
Yan, G.H., Brown, N., Kong, D.G., 2013. Exploring discriminatory features for automated malware classification. In: Rieck, K., Stewin, P., Seifert, J.P. (Eds.), Detection of Intrusions and Malware, and Vulnerability Assessment. Springer Berlin Heidelberg, p.41–61. https://doi.org/10.1007/978-3-642-39235-1_3
Chapter Google Scholar
Yao, W., Chen, X.Q., Zhao, Y., et al., 2012. Concurrent subspace width optimization method for RBF neural network modeling. IEEE Trans. Neur. Netw. Learn. Syst., 23(2): 247–259. https://doi.org/10.1109/TNNLS.2011.2178560
Article Google Scholar
Ye, Y.F., Li, T., Chen, Y., et al., 2010. Automatic malware categorization using cluster ensemble. Proc. 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.95–104. https://doi.org/10.1145/1835804.1835820
Google Scholar
Yu, Y., Wang, H.M., Yin, G., et al., 2016. Reviewer recommendation for pull-requests in GitHub: what can we learn from code review and bug assignment? Inform. Softw. Technol., 74:204–218. https://doi.org/10.1016/j.infsof.2016.01.004
Article Google Scholar
Zhou, Z.H., Wu, J.X., Tang, W., 2002. Ensembling neural networks: many could be better than all. Artif. Intell., 137(1–2):239–263. https://doi.org/10.1016/S0004-3702(02)00190-X
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Liu Liu, Bao-sheng Wang, Bo Yu & Qiu-xi Zhong

Authors

Liu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bao-sheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yu
View author publications
You can also search for this author in PubMed Google Scholar
Qiu-xi Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liu Liu.

Additional information

Project supported by the National Natural Science Foundation of China (No. 61303264) and the National Basic Research Program (973) of China (Nos. 2012CB315906 and 0800065111001)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, L., Wang, Bs., Yu, B. et al. Automatic malware classification and new malware detection using machine learning. Frontiers Inf Technol Electronic Eng 18, 1336–1347 (2017). https://doi.org/10.1631/FITEE.1601325

Download citation

Received: 12 June 2016
Accepted: 14 September 2016
Published: 27 October 2017
Issue Date: September 2017
DOI: https://doi.org/10.1631/FITEE.1601325

Key words

CLC number

TP309.5

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic malware classification and new malware detection using machine learning

Abstract

Access this article

Similar content being viewed by others

Comparative Performance Evaluation of Supervised Classification Models on Large Static Malware Dataset

Malware Classification Using Machine Learning

A Malware Classification Method Based on Generic Malware Information

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Automatic malware classification and new malware detection using machine learning

Abstract

Access this article

Similar content being viewed by others

Comparative Performance Evaluation of Supervised Classification Models on Large Static Malware Dataset

Malware Classification Using Machine Learning

A Malware Classification Method Based on Generic Malware Information

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation