Malicious sequential pattern mining for automatic malware detection

https://doi.org/10.1016/j.eswa.2016.01.002Get rights and content

Highlights

  • An effective framework using sequence mining technique is proposed for automatic malware detection.

  • An efficient sequential pattern mining algorithm for discovering discriminative patterns between malware and benign samples.

  • A new nearest neighbor classifier as the detection module to identify unknown malware.

  • The strong results of the proposed framework compared with the existing malware detection methods in detecting new malicious samples.

Abstract

Due to its damage to Internet security, malware (e.g., virus, worm, trojan) and its detection has caught the attention of both anti-malware industry and researchers for decades. To protect legitimate users from the attacks, the most significant line of defense against malware is anti-malware software products, which mainly use signature-based method for detection. However, this method fails to recognize new, unseen malicious executables. To solve this problem, in this paper, based on the instruction sequences extracted from the file sample set, we propose an effective sequence mining algorithm to discover malicious sequential patterns, and then All-Nearest-Neighbor (ANN) classifier is constructed for malware detection based on the discovered patterns. The developed data mining framework composed of the proposed sequential pattern mining method and ANN classifier can well characterize the malicious patterns from the collected file sample set to effectively detect newly unseen malware samples. A comprehensive experimental study on a real data collection is performed to evaluate our detection framework. Promising experimental results show that our framework outperforms other alternate data mining based detection methods in identifying new malicious executables.

Introduction

Malware, short for malicious software, is software that design to damage or destruct computers without owners’ permission (Schultz, Eskin, Zadok, & Stolfo, 2001). Due to the rapid development of information technology, malware has posed a serious threat to networks as well as computer systems. For instance, worm has increasingly threaten the hosts and services by exploiting the vulnerabilities of the largely homogeneous deployed software base (Sun & Chen, 2009). In addition, in the application of the online transaction, trojan horses often steal sensitive information from online users through website phishing (Abdelhamid, Ayesh, & Thabtah, 2014). Due to the enormous loss and adverse effect cause by malware, malware detection is one of the cyber security topics that are of great interests.

To protect legitimate users from the attacks, the most significant line of defense against malware is anti-malware software products, which mainly use signature-based method for detection (Griffin, Schneider, Hu, Chiueh, 2009, Kephart, Arnold, 1994). In these scanning tools, unique signatures (a set of short and unique strings) are extracted from already known malicious files. Then, an executable file is identified as a malicious code if its signature matches with the list of available signatures. Such simple approach is fast to identify known malware with small error rate. However, extracting signature is a tough work which requires a great deal of time, funds and more importantly, the expertise. This is the main disadvantage of this method. The second issue is that signature-based method is restricted to recognize already known malware, and thus it is unreliable and ineffective against the new, unseen malicious codes. In fact, simple obfuscation techniques can easily bypass such signatures-based detection. Besides, driven by the economic benefits, today’s malware samples are created at a high speed (thousands per day). For example, Symantec reported that 21.7 million new pieces of malware were created in October 2015 (Symantec, 2015); according to McAfee Labs threat report, there were more than 400 million total malware samples in the first quarter of 2015 (McAfee Labs, 2015).

In order to solve the above-mentioned problems, heuristic-based detection method, which utilizes data mining as well as machine learning techniques, is developed to conduct intelligent malware detection. This approach aims to learn special patterns that capture the characteristics of malware. Generally, its detection process can be divided into two phases: feature extraction and classification. In the first phase, various features are extracted from malware samples via static analysis or dynamic analysis to represent the file; based on the extracted features, classification techniques are applied to identify the malware automatically. For instance, Schultz et al. (2001) extracted three different types of features (i.e., system resource information, printable strings and byte sequences) from the files, then used as inputs for Ripper, Naive Bayes and Multi-Naive Bayes to classify malware and benign files.

Since Application Programming Interface (API) calls can well represent the actions of an executable, it is one of the most effective features used by the heuristic-based methods. Many researches have been done based on API calls, including Hofmeyr, Forrest, and Somayaji (1998), Ye, Wang, Li, Ye, and Jiang (2008) and so forth. There are some other researchers applying another meaningful feature (i.e., the machine instructions) to detect malware, such as Santos et al. (2010), Shabtai, Moskovitch, Feher, Dolev, and Elovici (2012) and Runwal, Low, and Stamp (2012). Although these works demonstrate desirable detection results, they did not take the order of the features into consideration and thus fail to mine patterns with notable difference between malicious files and benign files.

In this paper, we propose a new sequence mining algorithm to discover malicious sequential patterns based on the machine instruction sequences extracted from the Windows Portable Executable (PE) files, then use it to construct a data mining framework, called MSPMD (short for Malicious Sequential Pattern based Malware Detection), to detect new malware samples. The main contributions of this paper can be summarized as follows:

  • Well represented feature for malware detection: Instruction sequences are extracted from the PE (Portable Executable) files as the preliminary features, based on which the malicious sequential patterns are mined in the next step. The extracted instruction sequences can well indicate the potential malicious patterns at the micro level. In addition, such kind of features can be easily extracted and used to generate signatures for the traditional malware detection systems.

  • Effective malicious sequential pattern mining algorithm: We propose an effective sequential pattern mining algorithm, called MSPE (Malicious Sequential Pattern Extraction), to discover malicious sequential patterns from instruction sequence. MSPE introduces the concept of objective-oriented to learn patterns with strong abilities to distinguish malware from benign files. Moreover, we design a filtering criterion in MSPE to filter the redundant patterns in the mining process in order to reduce the costs of processing time and search space. This strategy greatly enhances the efficiency of our algorithm.

  • All-Nearest-Neighbor (ANN) classifier for malware detection: We propose ANN classifier as detection module to identify malware. Different from the traditional k-nearest-neighbor method, ANN chooses k automatically during the algorithm process. More importantly, the ANN classifier is well-matched with the discovered sequential patterns, and is able to obtain better results than other classifiers in malware detection.

  • Comprehensive experimental studies: We conduct a series of experiments to evaluate each part of our framework and the whole system based on real sample collection, containing both malicious and benign PE files. The results show that MSPMD is an effective and efficient solution in detecting new malware samples.

The remainder of this paper is organized as follows: Section 2 introduces the related work. In Section 3, an overview of MSPMD is presented. Section 4 describes the method for instruction sequence feature extraction. Section 5 presents the proposed algorithm for malicious sequential pattern mining. Section 6 describes the ANN classifier for malware prediction based on the mined malicious sequential patterns. Experimental results are presented in Section 7. Finally, Section 8 concludes.

Section snippets

Related work

Signature-based method is widely used in anti-malware industry for malware detection (Griffin et al., 2009). However, this classic method always fails to detect variants of known malware or previously unseen malware. The problem lies in the signature extraction and generation process, and in fact these signatures can be easily bypassed (Ye et al., 2008). For example, to evade the widely-used signature-based detection, malware developers can employ techniques such as polymorphism and

System architecture

Fig. 1 shows the system architecture of the proposed malware detection framework MSPMD, which consists of three major components: instruction sequence extractor, malicious sequential pattern miner, and ANN (All-Nearest-Neighbor) classifier for malware prediction. We briefly describe each component below.

  • 1.

    Instruction sequence extractor: MSPMD first extracts instructions from training samples and transforms them into a group of 32-bit global IDs based on their lexicographical order. Then, a

Instruction sequence feature extraction

In the first step of MSPMD, each PE file will be transformed into an instruction sequence. These instructions are carefully chosen in order to distinguish malware from benign samples as much as possible; therefore, they can be viewed as the low-level (instruction-level) features representing the executables. In this section, we describe the method used to extract such features from the training sample set, which is implemented in two sub-steps.

Malicious sequential pattern mining

In this section, we describe the MSPE algorithm for malicious sequential pattern mining. MSPE aims at discovering the discriminative malicious sequential patterns, which can be viewed as macro-level features to represent the executables.

ANN classifier for malware prediction

In this section, we propose ANN classifier for malware detection based on the mined malicious sequential patterns. Different from the traditional k-nearest-neighbor method (Han, Kamber, & Pei, 2006), ANN chooses k automatically during the algorithm process.

Experimental results and analysis

In this section, we evaluate each part of our framework and the whole detection system MSPMD through a series of experiment with comparing to a few existing methods. All the experimental studies are conducted under the environment of Windows XP operating system plus Intel T6600 2.20 GHz CPU and 2GB of RAM.

Conclusion and future work

In this paper, we develop a data-mining-based detection framework called Malicious Sequential Pattern based Malware Detection (MSPMD), which is composed of the proposed sequential pattern mining algorithm (MSPE) and All-Nearest-Neighbor (ANN) classifier. It first extracts instruction sequences from the PE file samples and conducts feature selection before mining; then MSPE is applied to generate malicious sequential patterns. For the testing file samples, after feature representation, ANN

Acknowledgments

L. Chen’s work was supported by the National Natural Science Foundation of China under Grant no. 61175123, and the Natural Science Foundation of Fujian Province of China under Grant no. 2015J01238.

References (37)

  • EgeleM. et al.

    A survey on automated dynamic malware-analysis techniques and tools

    Computing Surveys

    (2012)
  • GriffinK. et al.

    Automatic generation of string signatures for malware detection

    Proceedings of the 12th international symposium on recent advances in intrusion detection

    (2009)
  • GuoG. et al.

    KNN model-based approach in classification

    (2003)
  • HanJ. et al.

    Data mining: Concepts and techniques

    (2006)
  • HofmeyrS.A. et al.

    Intrusion detection using sequences of system calls

    Journal of Computer Security

    (1998)
  • JainM. et al.

    Techniques in detection and analyzing malware executables: A review

    International Journal of Computer Science and Mobile Computing

    (2014)
  • KephartJ.O. et al.

    Automatic extraction of computer virus signatures

    Proceedings of 4th virus bulletin international conference

    (1994)
  • LoD. et al.

    Classification of software behaviors for failure detection: a discriminative pattern mining approach

    Proceedings of the 15th international conference on knowledge discovery and data mining

    (2009)
  • Cited by (0)

    View full text