Sub-curve HMM: A malware detection approach based on partial analysis of API call sequences

doi:10.1016/j.cose.2020.101773

Computers & Security

Volume 92, May 2020, 101773

https://doi.org/10.1016/j.cose.2020.101773 Get rights and content

Abstract

Malicious software (Malware) plays an important role in penetrating and extracting sensitive information. Based on dynamic program’s behavior monitoring, existing solutions have shown that the Hidden Markov Model (HMM) is efficient in detecting malware using sequences of API calls. However, an obfuscation technique could insert minimal data stealing code into a large set of legitimate instructions, which makes the detector ineffective. Additionally, existing solutions require a whole picture of a program’s actions, and hence a small chunk of activities is much harder to detect. Substantial performance degradation can occur during the detection when a long sequence of APIs is used. This paper proposes the Sub-Curve HMM feature extraction approach that focuses on matching subsets of activities from the running processes that potentially lead to data exfiltration incidents. Sequences of API calls are used to train HMMs and test the likelihood of matching to the model. Malicious and benign activities gain different matching scores over an adjoining set of API calls. By projecting a sequence of matching score into a curve, our approach discriminates malicious actions using discontinuities in the slope of the curve. The experimental results show that the proposed approach outperforms existing solutions in detecting six (6) families of malware: the detection accuracy of Sub-Curve HMM is over 94% compared to 83% for the baseline HMM approach and 73% for Information Gain.

Introduction

The high penetration of data-centric services, as mobile computing or online transactions, has markedly increased the risk of exposing the sensitive data of legitimate users to sophisticated malware and tools (Ullah et al., 2018). The main purpose of this malware and tools is to access and steal sensitive data from the users. Consequently, detecting malicious software is a crucial topic to prevent user’s privacy. This is even more critical for specific organizational targets, such as government bodies or military authorities (e.g., leakage of sub-contractor database (Conifer, 2017)). Recent outbreaks of ransomware malware are good examples of devastating consequences.

Some important limitations of existing malware detection approaches are: (1) the signature of the malware is new to Antivirus software, and (2) the anomaly-based systems cannot detect variations in the malware behavior; and they therefore cannot differentiate between legitimate and malignant activities. Intuitively, attackers develop malicious programs using various common techniques, which are performed using similar series of application programming interface (API) calls. Hence, observing the behavior of all running processes (to identify potential malware) is the key to detecting data breaches.

Existing work (Annachhatre, Austin, Stamp, 2015, Canfora, Mercaldo, Visaggio, 2016, Damodaran, Di Troia, Visaggio, Austin, Stamp, 2017) have shown that Hidden Markov Model (HMM) can accurately discriminate between the behavior of a malware and a benign software. HMM models a program’s activities as probabilities of sequences of API calls. Although previous studies (e.g., Annachhatre, Austin, Stamp, 2015, Ki, Kim, Kim, 2015) show promising accuracy in detecting malware, experiments however were carried on short API sequences (i.e., around 280 instructions on average). Dynamic behavior analysis of a malware can generate though very long sequences of API calls. Indeed, our experiments on Keylogger malware (MD5 hash value d4259130a53dae8ddce89ebc8e80b560) generated more than 300,000 series of API calls in less than 2 min. This study found that HMM performs poorly with long API sequences. This is because the HMM only gives a matching probability P(O|λ) between the model λ and all data in the testing sequence O for a whole API sequence. Since the matching score is computed using joint probabilities of each API instruction (INS) in a given sequence, malware could evade the detection by posing as a benign software for the majority of the time and only performs malicious activities for a brief period of time to make the final matching score smaller and indistinguishable from benign software. Nevertheless, to the best of our knowledge, existing work has not properly addressed the impact of the length of API sequences on the detection accuracy. Hence, this work focuses on improving the detection accuracy of long API sequences, where API calls datasets varies from around 5,000 to nearly 80,000 instructions per API sequence on average.

This paper describes the proposed Sub-Curve HMM approach that addresses the HMM limitations by focusing on subsets of API calls sequences. It focuses on finding subsets of matching patterns rather than the average probabilities from the whole sequence. Specifically, the proposed approach observes subsets of a sequence from API INS, which looks for high matching probabilities with the training sequences. These probabilities (i.e., Likelihood) scores are converted to a curve. The proposed approach also uses changing trends (or slope) of the curve to detect discontinuities in the series of probability scores, called Sub-Curve. These curves are selected as features, and they are used by classifiers to detect fragments of malware’s suspicious behaviors. The experimental results show that the proposed approach outperforms existing ones in malware detection accuracy, especially for the Keylogger family that had the longest average API sequence length in our study (76,101 INS). The detection accuracy is over 94% compared to the baseline HMM accuracy of 83%, which includes the whole API sequence.

This paper is organized as follows. Section 2 describes the proposed approach in detail. Section 3 shows the experimental setup, and the results are discussed in Section 4. Related work and conclusion are presented in Sections 6 and 7 respectively.

Section snippets

The sub-curve HMM approach

This section describes the Sub-Curve HMM approach that aims to extract shorter sequences of API calls as features to improve the accuracy and efficiency of dynamic malware detection approach using HMM.

Fig. 1 depicts the architecture of our approach. It consists of components for feature extraction and classification. The executable binaries of malware and benign programs are used for training and testing. There are three steps in selecting discriminating feature vectors: API feature extraction,

Evaluation

This section describes the experimental environment and the various conducted steps. The intent of this section is to benchmark the efficiency of the proposed method with existing relevant approaches.

Experimental results and comparisons

This section summarizes the experimental results of every component of the Sub-Curve HMM approach as well as comparing them with baseline methods. Section 4.1 discusses the results from SC-HMM approach, and the outcomes from ABC-HMM is presented in Section 4.2. Finally, outputs from the complete Sub-Curve HMM implementation (SABC-HMM) is analyzed in Section 4.3.

Discussion

The limitations of the proposed approach and computation complexity of its overheads are discussed in this section.

Related work

Existing malware detection approaches can be generally categorized into investigative, preventive, and detective approaches (Alneyadi, Sithirasenan, Muthukkumarasamy, 2016, Ullah, Edwards, Ramdhany, Chitchyan, Babar, Rashid, 2018). The investigative approach identifies security flaws by trying to reconstruct data breach events – which are used to develop a security patch. Various investigative solutions have been proposed, such as watermarking techniques (Melkundi and Chandankhede, 2015) and

Conclusion

This paper proposes a feature extraction approach based on HMM that makes use of API call sequences. The main contribution of the proposed method is sub-contained behavior extraction. This allows small pieces of malicious activities contained in a long sequence of observation to be detected.

We have presented limitations of detection efficiency of HMM and information gain techniques, especially when long API call sequences were examined. Compared to existing techniques, our work outperforms

CRediT authorship contribution statement

Jakapan Suaboot: Conceptualization, Methodology, Software, Validation, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization. Zahir Tari: Conceptualization, Methodology, Resources, Writing - review & editing, Supervision, Project administration, Funding acquisition. Abdun Mahmood: Conceptualization, Methodology, Validation, Writing - review & editing, Supervision. Albert Y. Zomaya: Writing - review & editing, Funding acquisition. Wei Li: Writing -

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

We would like to thank the ARC (Australian Research Council) for their support of the work carried in this paper. The ARC Linkage project is LP160100406, and we would also like to thank ATMC for their financial and technical support as a the industry partner.

Jakapan Suaboot is a Ph.D. candidate on cybersecurity at RMIT University. He received the B.Eng and M.Eng degrees in Computer Engineering from Prince of Songkla University (Thailand) in 2007 and 2010 respectively. He had been working as a lecturer from 2010-2017 with Department of Computer Engineering, Prince of Songkla University, Phuket, Thailand. His research interests include malware detection, data breach prevention and machine learning technologies.

References (67)

S. Alneyadi et al.
A survey on data leakage prevention systems
J. Netw. Comput. Appl.
(2016)
G. Cantrell et al.
Experiments in hiding data inside the file structure of common office documents: a stegonography application
Proceedings of The 2004 international Symposium on information and Communication Technologies
(2004)
G.A. Fink
Markov Models for Pattern Recognition: From Theory to Applications
(2014)
D. Oktavianto et al.
Cuckoo malware analysis
(2013)
Virus Total, 2018. Virustotal-free online virus, malware and url...
Bitdefender, 2020. Cybersecurity solutions for business and personal...
C. Annachhatre et al.
Hidden Markov models for malware classification
J. Comput. Virol. Hacking Tech.
(2015)
Avast Software s.r.o., 2020. Avg 2020 free...
Avira Operations GmbH & Co. KG., 2020. Avira...
A. Awad et al.
Data leakage detection using system call provenance
Proceedings of 2016 International Conference on Intelligent Networking and Collaborative Systems
(2016)

Baidu, 2016. Baidu...

L.E. Baum et al.

Statistical inference for probabilistic functions of finite state markov chains

Ann. Math. Stat.

(1966)

L. Breiman

Random forests

Mach. Learn.

(2001)

Broadcom Inc., 2020. Symantec - global leader in next generation cyber...

D. Canali et al.

A quantitative study of accuracy in system call-based malware detection

Proceedings of the 2012 International Symposium on Software Testing and Analysis

(2012)

G. Canfora et al.

An hmm and structural entropy based detector for android malware: an empirical study

Comput. Secur.

(2016)

Check Point Research

H2 2017 Global Threat Intelligence Trends Report

S.S. Keerthi et al.

Improvements to Platt’s smo algorithm for svm classifier design

Neural Comput.

(2001)

A. Kharraz et al.

Unveil: a large-scale, automated approach to detecting ransomware

Proceedings of 25th USENIX Security Symposium

(2016)

Y. Ki et al.

A novel approach to detect malware based on api call sequence analysis

Int. J. Distrib. Sensor Netw.

(2015)

B. Kolosnjaji et al.

Deep learning for classification of malware system call sequences

Proceedings of Australasian Joint Conference on Artificial Intelligence

(2016)

Kotler, I., Klein, A., 2017. The adventures of av and the leaky...

Cited by (0)

Zahir Tari is a full professor in Distributed Systems at RMIT University (Australia). He received a bachelor’s degree in mathematics from University of Algiers (USTHB, Algeria) in 1984, M.Sc. in Operational Research from University of Grenoble (France) in 1985 and Ph.D. degree in Computer Science from University of Grenoble (France) in 1989. Zahir’s expertise is in the areas of system’s performance (e.g. P2P, Cloud, IoT) as well as system’s security (e.g. SCADA, Smart Grid, Cloud, IoT). He is the co-author of six books (John Wiley, Springer) and he has edited over 25 conference proceedings. Zahir is also a recipient of over 8M$ in funding from ARC (Australian Research Council) and lately part of a successful 7th Framework AU2EU (Australia to European) bid on Authorization and Authentication for Entrusted Unions. Finally, Zahir was an associate editor of the IEEE Transactions on Computers (TC), IEEE Transactions on Parallel and Distributed Systems (TPDS) and IEEE Magazine on Cloud Computing.

Abdun Naser Mahmood received the BSc degree in applied physics and electronics from the University of Dhaka, Bangladesh, the MSc (research) degree in computer science from the University of Dhaka, Bangladesh, and the PhD degree from the University of Melbourne, Australia, in 1997, 1999, and 2008, respectively. He has been working as an academic in Computer Science since 1999. He is currently an Associate Professor in the Dept. of Computer Science and IT, La Trobe University. Previously he also worked in the School of Engineering and IT, University of New South Wales (2012-2017), and Royal Melbourne Institute of Technology (2008-2011) He has been an Assistant Professor with the University of Dhaka (2000-2010). . His research interests include data mining techniques for data stream analysis, data exfiltration detection, scalable network traffic analysis, anomaly detection, and industrial SCADA security. He has published his work in various IEEE Transactions and A-tier International Journals and Conferences. He is a senior member of the IEEE.

Albert Y. Zomaya is the Chair Professor of High-Performance Computing & Networking in the School of Information Technologies, University of Sydney, and he also serves as the Director of the Centre for Distributed and High-Performance Computing. Professor Zomaya published more than 550 scientific papers and articles and is author, co-author or editor of more than 20books. He is the Founding Editor in Chief of the IEEE Transactions on Sustainable Computing and serves as an associate editor for more than 20 leading journals. Professor Zomaya served as an Editor in Chief for the IEEE Transactions on Computers (2011-2014). He is a Chartered Engineer, a Fellow of AAAS, IEEE, and IET. Professor Zomaya’s research interests are in the areas of parallel and distributed computing and complex systems.

Wei Li received his PhD degree from School of Information Technologies at The University of Sydney. He is currently a research associate in Centre for Distributed and High-Performance Computing, The University of Sydney. His research interests include Internet of Things, edge computing, sustainable computing, task scheduling, energy efficiency and optimization. He is the recipient of four IEEE or ACM conference best paper awards. He received the IEEE TCSC Award for Excellence in Scalable Computing for Early Career Researchers (2018) and the IEEE Outstanding Leadership Award (2018). He is a senior member of the IEEE Computer Society and the IEEE, and a member of the ACM.

View full text

Sub-curve HMM: A malware detection approach based on partial analysis of API call sequences

Abstract

Introduction

Section snippets

The sub-curve HMM approach

Evaluation

Experimental results and comparisons

Discussion

Related work

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

J. Netw. Comput. Appl.

Hidden Markov models for malware classification

J. Comput. Virol. Hacking Tech.

Data leakage detection using system call provenance

Proceedings of 2016 International Conference on Intelligent Networking and Collaborative Systems

Statistical inference for probabilistic functions of finite state markov chains

Ann. Math. Stat.

Random forests

Mach. Learn.

A quantitative study of accuracy in system call-based malware detection

Proceedings of the 2012 International Symposium on Software Testing and Analysis

An hmm and structural entropy based detector for android malware: an empirical study

Comput. Secur.

H2 2017 Global Threat Intelligence Trends Report

Technical Report

A comparison of static, dynamic, and hybrid analysis for malware detection

J. Comput. Virol. Hacking Tech.

How significant is a boxplot outlier?

J. Stat. Edu.

Zeus: King of the bots

Technical Report

Malicious sequential pattern mining for automatic malware detection

Expert Syst. Appl.

LokiBot: If not stealing, then extorting

Technical Report

Very simple classification rules perform well on most commonly used datasets

Mach. Learn.

Scalable and precise taint analysis for android

Proceedings of the 2015 International Symposium on Software Testing and Analysis

A fast malware feature selection approach using a hybrid of multi-linear and stepwise binary logistic regression

Concurrency Comput.

Malware Mines, Steals Cryptocurrencies From Victims

Technical Report

Improvements to Platt’s smo algorithm for svm classifier design

Neural Comput.

Unveil: a large-scale, automated approach to detecting ransomware

Proceedings of 25th USENIX Security Symposium

A novel approach to detect malware based on api call sequence analysis

Int. J. Distrib. Sensor Netw.

Deep learning for classification of malware system call sequences

Proceedings of Australasian Joint Conference on Artificial Intelligence