Elsevier

Computers & Security

Volume 52, July 2015, Pages 251-266
Computers & Security

AMAL: High-fidelity, behavior-based automated malware analysis and classification

https://doi.org/10.1016/j.cose.2015.04.001Get rights and content

Abstract

This paper introduces AMAL, an automated and behavior-based malware analysis and labeling system that addresses shortcomings of the existing systems. AMAL consists of two sub-systems, AutoMal and MaLabel. AutoMal provides tools to collect low granularity behavioral artifacts that characterize malware usage of the file system, memory, network, and registry, and does that by running malware samples in virtualized environments. On the other hand, MaLabel uses those artifacts to create representative features, use them for building classifiers trained by manually vetted training samples, and use those classifiers to classify malware samples into families similar in behavior. AutoMal also enables unsupervised learning, by implementing multiple clustering algorithms for samples grouping. An evaluation of both AutoMal and MaLabel based on medium-scale (4000 samples) and large-scale datasets (more than 115,000 samples)—collected and analyzed by AutoMal over 13 months—shows AMAL's effectiveness in accurately characterizing, classifying, and grouping malware samples. MaLabel achieves a precision of 99.5% and recall of 99.6% for certain families' classification, and more than 98% of precision and recall for unsupervised clustering. Several benchmarks, cost estimates and measurements highlight the merits of AMAL.

Introduction

Malware classification and clustering is an old problem that many industrial and academic efforts have tackled. There are two common and broad techniques used for malware detection, that are also utilized for classification: signature based (Rieck et al., 2011, Tian et al., 2008, Kinable and Kostakis, 2011) and behavior based (Park et al., 2010, Rieck et al., 2008, Zhao et al., 2010, Perdisci et al., 2010, Strayer et al., 2008) techniques. Signature based techniques use a common sequence of bytes that appear in the binary code of a malware family to detect and identify malware samples. On the one hand, while signature-based techniques are very fast, since they do not require running the sample to identify it, they are sometimes inaccurate, can be thwarted using obfuscation, and require a prior knowledge, including a set of known signatures associated with the tested families.

For example, antivirus companies use static signatures to detect a known malware, which may completely miss a zero-day malware that has not been seen before (Bilge and Dumitras, 2012). Also malware has become more sophisticated by using polymorphic obfuscation, packing, and code rearranging to thwart antivirus signatures (Sharif et al., 2009). Antivirus companies also use heuristic signatures to detect and classify known malicious behavior. However, heuristics often group families of malware together and give them generic labels, which are not useful for most security and intelligence applications. Also, labels given by antivirus companies to the same malware sample may vary by vendor. There are many inconsistencies among the antivirus vendors for different malware families (Bailey et al., 2007).

The behavior-based approach uses artifacts the malware creates during execution. While this approach is more expensive since it requires running the malware sample in order to obtain artifacts and features for behavior characterization, it tends to have higher accuracy in characterizing malware samples. Also, behavior characterization is agnostic to the underlying code and can easily bypass code obfuscation and polymorphism, relying on somewhat easier-to-interpret features. Thus, this technique does not require the expertise required for signature-based techniques—those techniques require reverse-engineering skills to create signatures (Lee et al., 2011, Slowinska et al., 2011).

Indeed, several studies used behavioral analysis for malware classification. The first work to do so is by Baily et al. (Bailey et al., 2007), in which it is shown that high-level features of the number of processes, files, registry records, and network events, can be used for characterizing and classifying (multi-class clustering) malware samples. However, the work can be improved in various ways. First, the technique makes use of only high-level features, and misses explicit low-level and implicit features (left as a future work). Second, their work also relies on a small number of samples for validation of the technique, and the only source for creating ground truth for those samples was the side channel of antivirus labeling. Third, their technique is limited to one clustering algorithm (hierarchical clustering with the Jaccard index for similarity), and it is unclear how other algorithms perform for the same task. Last, their technique is intended only for clustering, and does not consider binary classification problems. Indeed, binary classification has an appealing unique business opportunity to it.

More recently, Bayer et al. (Bayer et al., 2009) considered improving on the results in Bailey et al. (2007) in two ways. First, they used the locality-sensitive hashing (LSH), a probabilistic dimensionality reduction method, for memory-efficient clustering. Second, instead of using high-level behavior characteristics, the authors proposed to use low OS-level features based on API-hooking for characterizing malware samples. While the technique is demonstrated to be effective, it has several shortcomings. First, malware samples scan for installed drivers and uninstall or bypass the driver used for kernel logging. More important, rootkits (like TDSS/TDL and ZeroAccess–both families are studied in our evaluation) are usually installed in the kernel and the kernel logger can be blind to all of their activities (Strackx and Piessens, 2012). Finally, their work is tested on only one hierarchical clustering algorithm, does not handle binary classification, and relies on a set of small AV-labeled samples as a ground truth—despite inconsistencies, per the same work (Bayer et al., 2009).

Rieck et al. (Rieck et al., 2008) used the same API-hooking technique in Bayer et al. (2009) to collect artifacts and use them for extracting features to characterize malware samples. However, in addition to the shortcomings shared with the work in Bayer et al. (2009), their technique suffers from low accuracy rates, perhaps due to their choice of features. While they match the highest accuracy we achieve, our lowest accuracy of classification of a malware family is 20% higher than the lowest accuracy in their system. Their method is only tested with one classifier, while we provide insight into several other learning algorithms and how they succeed or fail in classifying samples.

Our work is motivated by the work in Rieck et al., 2008, Bayer et al., 2009, Bailey et al., 2007. It is worth noting that (Bailey et al., 2007) does not provide any insight into accuracy, unlike (Bayer et al., 2009), although different from our work in the aspects stated earlier. An empirical comparison was considered with (Bayer et al., 2009) beyond timing measurements, and we believe it is impossible to do with their feature-level dataset because we need to generate the same set of features used in our system from their binaries. Assuming obtaining binaries is possible, there is no guarantee to obtain the same behavior profiles by running those binaries, given that the samples are five years old. Using their system for analyzing our malware samples to generate comparable features was not possible. Finally, a great part of our evaluation requires highly accurate labels, which are naturally obtained in our system and are not available for their samples.

In this paper, we introduce AMAL, a large-scale behavior-based solution for malware analysis and classification (both binary classification and clustering) that addresses the shortcomings of the previous solutions. To achieve its end goal, AMAL consists of two sub-systems, AutoMal and MaLabel. AutoMal builds on the prior literature in characterizing malware samples by their memory, file system, registry, and network behavior artifacts. Furthermore, MaLabel tries to address the shortcomings and limitations of the prior work in practical ways. For example, unlike (Bayer et al., 2009), MaLabel uses low-granularity behavior artifacts that are even capable of characterizing differences between variants of the same malware family. On the other hand, and given the wide-range of functionalities of MaLabel, which includes binary classification and clustering, it incorporates several techniques with several parameters and automatically chooses among the best of them to produce the best results.

To do that, and unlike the prior literature, MaLabel relies on analyst-vetted and highly-accurate labels to train classifiers and assist in labeling clusters grouped in unsupervised learning. Finally, the malware analysis and artifacts collection part of AMAL (AutoMal) has been in production since early 2009, and it enabled us to collect tens of millions, analyze several hundreds of thousands, and to manually label several tens of thousands of malware samples—thus collecting in-house intelligence that goes beyond any related work in the literature. Unlike labeling (for training and validation) in the literature, which is subject to errors, analysts who are domain experts did our labeling and human errors in their labeling are negligible. In this study, we evaluate MaLabel on variety of datasets obtained from AutoMal and show the effectiveness of AMAL in analyzing, characterizing, classifying, and labeling malware samples.

Both binary classification and clustering are operationally important. Binary and supervised classification is expensive, since it requires training a model with solid ground-truth, and using representative artifacts of the families (classes) of interest, both of which are nontrivially obtained. However, the cost of binary classification in our operational settings is justified. The classification problem is interesting to us because the volume of Malware we receive on daily basis is larger than the capacity of our analysts. Classification enables us to train a model on a small set of known malware and extrapolate our model to find new samples in large volumes of malware we receive on daily basis. For the majority of our customers, who consist of large financial institutes, the threat of banking Trojans and specifically new and unidentified variants of known families is of interest to them. We use this classification system to identify malware variants of the same family based on their behavior to inform our customers about new malware threats pertaining to their interest. Another benefit of this approach is that we are able to ignore or give low priority to known insignificant malware families. For example, by identifying FakeAV, a family that tricks the victim into purchasing a fake antivirus product by alerting them to fake infections on their system, we can use our classifier to filter out all FakeAV samples from our malware feed to focus on undiscovered threats that are relevant and interesting.

Clustering is interesting, as it always remains a challenging and open-ended problem. Clustering allows us to group samples of similar behavior together. For that, we manually inspect the samples in each cluster, and augment the labels we have of identified malware over each cluster to identify the majority in that cluster. Furthermore, we use memory signatures, like YARA rules, to tag a specific signature of a family based on its memory artifacts and then use that information to label clusters. Finally in the rare cases of giving a cluster a name when all other methods are exhausted we would use majority voting of labels returned by a large number of antivirus scanners. The automatic labeling problem remains partly unsolved, as it is the case in the literature (Bailey et al., 2007), and we leave improving on that for future work.

Our novel contributions in this work are methodical. While a comparative study of various algorithms under various settings should be done in any applied machine learning to the security problem at hand, this was unfortunately not done. Our system addresses many timely problems highlighted in the literature. Our novel contribution is not only the reliance on multiple algorithms but highly accurate evaluation, multiple fine-grained features for multiple families characterization, the build of a system that extracts those features, and demonstrating its efficiency at scale. Our system for classification always matches accuracy of state-of-the-art systems, and improves it in many settings (further details are in Sections 3.1.1 AutoMal: behavior-based malware analyzer, 3.1.2 MaLabel: automated labeling). Our clustering study shows the relevance and efficiency of off-the-shelf techniques; the prior work takes 138 min with LSH optimization to cluster 75 k samples, while our system clusters 115 k samples in under 1 h without optimization at the expense of additional memory (Table 9). To sum up, our contribution is as follows:

  • We introduce AMAL, a fully automated system for analysis, classification, and clustering of malware samples. AMAL consists of two subsystems, AutoMal and MaLabel. AutoMal is a feature-rich and low granularity, behavior-based artifact collecting system that runs malware samples in virtualized environments and characterizes them by reporting memory, file system, registry, and network behavior. On the other hand, MaLabel uses artifacts generated by AutoMal to create features and then use them in classifying and clustering malware samples into families with similar characteristics. Both systems have been in production and helped analyze and identify hundreds of thousands of malware samples. AutoMal by design follows several guidelines in Rossow et al. (2012) for safety, and MaLabel follows several guidelines for data and algorithms transparency, correctness, and realism.

  • Based on an in-house product of AMAL, we use both medium and large-scale datasets to show AMAL's effectiveness by demonstrating more than 99% of precision and recall in classification and more than 98% of precision and recall in clustering malware. Our validation makes use of several algorithms and settings and demonstrates the practicality of our system at scale, even when using off-the-shelf algorithms.

The organization of the rest of this paper is as follows. In Section 2, we review the related literature. In Section 3 we describe our system in details, including AutoMal, the automatic malware analysis sub-system and MaLabel, the automated malware classification sub-system. In Section 4, we evaluate our system. In Section 5 we outline some of the future work and concluding remarks.

Section snippets

Related work

There has been plenty of work in the recent literature on the use of machine learning algorithms for classifying malware samples (Tian et al., 2009, Bailey et al., 2007, Rieck et al., 2011, Park et al., 2010, Tian et al., 2008, Rieck et al., 2008, Kinable and Kostakis, 2011, Ramilli and Bishop, 2010, Provos et al., 2007, Binsalleeh et al., 2010). These works are classified into two categories: signature based and behavior based techniques. Our work belongs to the second category of these works,

System design

The ultimate goal of AMAL is to automatically analyze malware samples and classify them into malware families based on their behavior. To that end, AMAL consists of two components, AutoMal and MaLabel. AutoMal is a behavior-based automated malware analysis system that uses memory and file system forensics, network activity logging, and registry monitoring to profile malware samples. AutoMal also summarizes such behavior into artifacts that are easy to interpret and use to characterize and

Evaluation

To evaluate the different algorithms in each application group, we use several accuracy measures to highlight the performance of various algorithms. Considering a class of interest, S, the true positive (tp) for classification is defined as all samples in S that are labeled correctly, while the true negative (tn) is all samples that are correctly rejected. The false positive (fp) is defined as all samples that are labeled in S while they are not, whereas the false negative (fn) is all samples

Conclusion and future work

In this paper we introduced AMAL, the first operational large-scale malware analysis, classification, and clustering system. AMAL is composed of two subsystems, AutoMal and MaLabel. AutoMal runs malware samples in virtualized environments and collects memory, file system, registry, and network artifacts, which are used for creating a rich set of features. Unlike the prior literature, AutoMal combines signature-based techniques with purely behavior-based techniques, thus generating highly

Aziz Mohaisen is a Senior Research Scientist at Verisign Labs. He obtained his Ph.D. degree from the Computer Science Department at the University of Minnesota – Twin Cities, in 2012. Before starting graduate school at Minnesota in 2009, he spent more than five years in South Korea, where he worked as a member of engineering staff (researcher) at ETRI, the largest government-backed research institute with fundamental contributions to the South Korean telecommunication revolution. Aziz is

References (49)

  • E. Alpaydin

    Introduction to machine learning

    (2004)
  • Anonymized for Review

    Amal: high-fidelity, behavior-based automated malware analysis and classification

    (2013)
  • M. Antonakakis et al.

    Building a dynamic reputation system for dns

  • M. Antonakakis et al.

    Detecting malware domains at the upper dns hierarchy

  • M. Bailey et al.

    Automated classification and analysis of internet malware

  • U. Bayer et al.

    Scalable, behavior-based malware clustering

  • B. Biggio et al.

    Poisoning attacks against support vector machines

  • L. Bilge et al.

    Before we knew it: an empirical study of zero-day attacks in the real world

  • L. Bilge et al.

    Exposure: finding malicious domains using passive dns analysis

  • L. Bilge et al.

    Disclosure: detecting botnet command and control servers through large-scale netflow analysis

  • H. Binsalleeh et al.

    On the analysis of the zeus botnet crimeware toolkit

  • A. Dinaburg et al.

    Ether: malware analysis via hardware virtualization extensions

  • M. Egele et al.

    A survey on automated dynamic malware-analysis techniques and tools

    ACM Comput Surv

    (Mar. 2008)
  • N. Falliere et al.

    Zeus: King of the bots

    (November 2009)
  • R.-E. Fan et al.

    Liblinear: a library for large linear classification

    J Mach Learn Res

    (2008)
  • C. Gorecki et al.

    Trumanbox: improving dynamic malware analysis by emulating the internet

  • G. Gu et al.

    Bothunter: detecting malware infection through ids-driven dialog correlation

  • G. Gu et al.

    Botminer: clustering analysis of network traffic for protocol- and structure-independent botnet detection

  • G. Gu et al.

    Botsniffer: detecting botnet command and control channels in network traffic

  • T. Holz et al.

    Measuring and detecting fast-flux service networks

  • C.-Y. Hong et al.

    Populated ip addresses: classification and applications

  • C.-J. Hsieh et al.

    A dual coordinate descent method for large-scale linear svm

  • G. Jacob et al.

    Jackstraws: picking command and control connections from bot traffic

  • J. Kinable et al.

    Malware classification based on call graph clustering

    J Comput Virology

    (2011)
  • Cited by (195)

    • Deep learning-enabled anomaly detection for IoT systems

      2023, Internet of Things (Netherlands)
    View all citing articles on Scopus

    Aziz Mohaisen is a Senior Research Scientist at Verisign Labs. He obtained his Ph.D. degree from the Computer Science Department at the University of Minnesota – Twin Cities, in 2012. Before starting graduate school at Minnesota in 2009, he spent more than five years in South Korea, where he worked as a member of engineering staff (researcher) at ETRI, the largest government-backed research institute with fundamental contributions to the South Korean telecommunication revolution. Aziz is interested in applied research in the areas of network security and privacy. His current research projects are on intelligence-based security, information sharing and reputation, Internet measurements, and emerging networking technologies. A common theme in his recent research is the use of advanced machine learning techniques to understand codes, traffic, and infrastructure performance in real-world deployments. His work was featured in popular media like the New Scientist, CBS news, Slashdot, MIT technology review, and deepdotweb, among others. Aziz is a member of ACM and IEEE.

    Omar Alrawi is a senior software engineer at Qatar Computing Research Institute, part of Qatar Foundation, in Qatar. He obtained his B.Sc. and M.Sc. degrees both in computer science from the computer science department at Purdue University. Before joining QCRI, Omar was a security analyst at Verisign Inc and Booz Allen Hamilton. Omar is interested in malware analysis and the use of machine learning algorithms for security automation.

    Manar Mohaisen received his M.S. degree in communications and signal processing from the University of Nice-Sophia Antiplois, France, in 2005 and Ph.D. from Inha University, Korea, in 2010, both in communications engineering. From 2001 to 2004, he was with the Palestinian Telecom. Co., where he was a cell planning engineer. Since Sept. 2010, he is with the Department of EEC Engineering, KoreaTech, Korea, where he is an assistant professor. His research interests include 3GPP LTE/-A systems, MIMO detection and precoding and social networks.

    1

    The views and opinions expressed in this paper are the views of the author, and do not necessarily represent the policy or position of Verisign, Inc.

    View full text