Chimera: An Android Malware Detection Method Based on Multimodal Deep
Learning and Hybrid Analysis
Abstract
The Android Operating System (OS) everywhere, computers, cars, homes,
and, of course, personal and corporate smartphones. A recent survey from
the International Data Corporation (IDC) reveals that the Android
platform holds 85% of the smartphone market share. Its popularity and
open nature make it an attractive target for malware. According to
AV-TEST, by November 2020, 2.87M new Android malware instances were
identified in the wild. Malware detection is a challenging problem that
has been actively explored by both the industry and academia using
intelligent methods. On the one hand, traditional machine learning (ML)
malware detection methods rely on manual feature engineering that
requires expert knowledge. On the other hand, deep learning (DL) malware
detection methods perform automatic feature extraction but usually
require much more data and processing power. In this work, we propose a
new multimodal DL Android malware detection method, Chimera, that
combines both manual and automatic feature engineering by using the DL
architectures, Convolutional Neural Networks (CNN), Deep Neural Networks
(DNN), and Transformer Networks (TN) to perform feature learning from
raw data (Dalvik Executable (DEX) grayscale images), static analysis
data (Android Intents & Permissions), and dynamic analysis data (system
call sequences) respectively. To train and evaluate our model, we
implemented the Knowledge Discovery in Databases (KDD) process and used
the publicly available Android benchmark dataset Omnidroid, which
contains static and dynamic analysis data extracted from 22,000 real
malware and goodware samples. By leveraging a hybrid source of
information to learn high-level feature representations for both the
static and dynamic properties of Android applications, Chimera’s
detection Accuracy, Precision, Recall, and ROC AUC outperform classical
ML algorithms, state-of-the-art Ensemble, and Voting Ensembles ML
methods, as well as unimodal DL methods using CNNs, DNNs, TNs, and
Long-Short Term Memory Networks (LSTM). To the best of our knowledge,
this is the first work that successfully applies multimodal DL to
combine those three different modalities of data using DNNs, CNNs, and
TNs to learn a shared representation that can be used in Android malware
detection tasks.