Mitigating Metamorphic Malware Through Adversarial Learning Techniques

Babaagba, Kehinde O.; Tan, Zhiyuan

doi:10.3390/network6020022

Open AccessArticle

Mitigating Metamorphic Malware Through Adversarial Learning Techniques

by

Kehinde O. Babaagba

^*

and

Zhiyuan Tan

School of Computing, Engineering and the Built Environment, Edinburgh Napier University, 10 Colinton Road, Edinburgh EH10 5DT, UK

^*

Author to whom correspondence should be addressed.

Network 2026, 6(2), 22; https://doi.org/10.3390/network6020022

Submission received: 10 December 2025 / Revised: 18 March 2026 / Accepted: 1 April 2026 / Published: 8 April 2026

Download

Browse Figures

Versions Notes

Abstract

Antivirus (AV) solutions remain a core defence mechanism against malicious software. However, many of these engines struggle to detect metamorphic malware, which continually alters its internal form in unpredictable ways. To address this limitation, we present an adversarially oriented approach that automatically generates novel malicious variants of existing malware that evade detection by a substantial proportion of AV systems, thereby providing material for strengthening defensive techniques. In this work, an Evolutionary Algorithm (EA) is used to evolve undetectable variants, guided by three fitness criteria: the evasiveness of the produced samples, and their behavioural and structural similarity to the original malware. The proposed method is assessed across three malware families to evaluate the effectiveness of the EA-generated variants. Results indicate that the EA produces diverse mutant variants capable of evading up to 94% of AV detectors for a given malware family, significantly surpassing the evasion rate of the original malware. Furthermore, we evaluated whether the mutants produced by the EA could enhance the training of machine learning models. In this context, a pretrained Natural Language Processing (NLP) transformer was employed within a transfer learning framework to improve the classification of metamorphic malware. When the evolved variants were incorporated into the training data, the approach achieved classification accuracies of up to 93%. These results highlight the value of using diverse EA-generated samples to strengthen malware classifiers, thereby improving the robustness of security systems against evolving threats.

Keywords:

metamorphic malware; antivirus engines; machine learning; adversarial learning; evolutionary algorithm

1. Introduction

The recent 2024 Threat Intelligence Report by CrowdStrike [1] revealed that there has been an increase in malicious attacks over the past year, with their insights on the cyber threat landscape showing that in 2023, 34 newly identified adversaries emerged. The fastest eCrime breakout time recorded was 2 min and 7 s. Additionally, there was a 75% increase in cloud intrusions. Furthermore, there was a significant surge in threats, such as identity threats. Leveraging generative AI, adversaries like SCATTERED SPIDER have adopted novel tactics to expedite infiltration, including phishing, social engineering, and purchasing genuine credentials from access brokers. In addition, the 2024 Threat Report by Sophos [2] indicated that over 90% of their reported attacks entail some form of data or credential compromise, spanning various methods such as ransomware incursions, data extortion schemes, unauthorised remote access, or straightforward data pilferage. There has been a notable rise observed by Sophos in the incidence of macOS-targeted information theft malware, indicating a trend likely to persist in the foreseeable future. Their earlier report in 2022 [3] also tracked the detection of 180 attack tools launched between 2020 and 2021 and revealed that several Android malware families went undetected by scanning tools employed by the Google Play Store.

Metamorphic malware has been the subject of substantial research, as it represents one of the most sophisticated and challenging categories of malicious software. These malware samples modify their internal code structure as they spread, enabling them to evade traditional static, signature-based detection techniques, and in extreme cases can give rise to variants that are statistically very difficult to detect [4]. In addition to structural transformation, they frequently employ code obfuscation strategies that hinder deeper static inspection and can circumvent dynamic analysis tools, such as emulators, by altering their runtime behaviour when they recognise that execution is taking place in an instrumented or controlled environment [5].

Although metamorphic malware affects many computing platforms, mobile devices have become a particularly common target. The majority of such devices run the Android operating system, which remains vulnerable to malicious applications. Reports such as [6] indicate that nearly all newly observed malicious programs in recent years originate from the Android ecosystem. This trend is partially attributable to the platform’s open-source nature and the limited built-in safeguards available to counter sophisticated attacks [7].

A wide range of malware detection approaches has been explored in the literature (e.g., [8,9]), with numerous studies focusing specifically on Android threats [10,11,12]. Among these, adversarial learning has emerged as a prominent technique for evaluating and improving malware detectors [13]. In this paradigm, systems are intentionally exposed to carefully crafted malicious inputs referred to as adversarial samplesto identify vulnerabilities and increase model robustness. A key challenge, however, is the generation of suitable adversarial samples that can meaningfully support this training process.

This paper describes a complete end-to-end pipeline for generating a suite of novel, malicious malware samples and using these samples to improve the training of ML detection models. The proposed pipeline employs an Evolutionary Algorithm (EA) [14], a population-based meta-heuristic search method, to evolve previously unseen malicious variants from known malware, to produce samples capable of circumventing existing detection mechanisms. The newly generated mutants are used to augment existing datasets and to improve trained ML models. Some preliminary findings of this research have been disseminated in [15,16,17]. This article serves as the inaugural comprehensive exposition of the entire pipeline. All components of the proposed framework are cohesively integrated and presented for the first time.

In addition to providing a complete description of the framework, this paper extends the work in [15] by examining how an EA can generate variants from three malicious families selected based on their malicious payload. Furthermore, supplementary analyses were conducted to evaluate the diversity of the variants generated from multiple executions of the EA. This investigation aimed to determine whether the variants produced across different runs exhibited distinct characteristics and to assess their diversity based on three specific metrics. These metrics pertain to: (1) structural dissimilarity among the variants, (2) behavioural dissimilarity, and (3) the variants’ efficacy in evading several established detection mechanisms. By examining these dimensions, insights were gained into the variability and adaptability of the variants produced by the EA under varying conditions. The performance of the EA in locating evasive variants is also evaluated against a Random Search baseline that attempts to identify such variants without guided evolutionary pressure in order to justify the need for the meta-heuristic search algorithm. Additionally, this study extends the work in [16] by examining whether training data augmented with EA-generated mutant samples can further improve the classification of metamorphic malware. Unlike [16], which evaluated only binary classifiers, the present analysis compares both binary and multi-class models to assess the broader impact of incorporating evolved variants into the training process. Then, a transformer—Bidirectional Encoder Representations from Transformers (BERT) that had been pretrained on a large NLP dataset was used to improve the classification of metamorphic malware. Unlike other works such as [18], this work focused on the generation of mutable malware capable of evading conventional detection methods and the use of these samples to train models for identifying emerging malware variants. By leveraging transfer learning, the approach achieved effective detection performance even with a constrained training dataset.

The study was designed to investigate the following five central research questions:

1.: How does a fitness function used in the EA influence the capability of the proposed method to discover new evasive variants?
2.: How diverse is a set of newly-produced, novel variants with respect to the range of their behavioural signatures and their structural similarity in comparison with the original malware?
3.: How do the 63 mainstream antivirus products (i.e., antivirus engines) differ in their ability to detect the newly generated malware variants across each of the malware families evaluated?
4.: Which well-known Machine Learning (ML) models are improved more significantly, when trained with the newly-produced mutants, in the classification of metamorphic malware?
5.: Can a transformer model—such as BERT, which has been pretrained on large-scale NLP corpora, be applied in a transfer learning setting to enhance the classification of metamorphic malware when the newly generated mutants are incorporated into the training data?

Throughout this work, the term “adversarial” is used in the broader cybersecurity sense of generating malicious variants to stress-test defensive systems, rather than in the strict machine-learning sense of crafting classifier-aware perturbations. The proposed method is therefore best viewed as an evolutionary, mutation-based malware diversification process whose outputs serve as diverse malicious samples for training and evaluation, rather than adversarial examples defined with respect to a model’s decision boundary.

The remainder of this paper is organised as follows. Section 2 outlines the background and context of the study. In Section 3, the learning process involved in defending against metamorphic malware using adversarial samples is described, while Section 4 explains the experimental approach and discusses the results obtained. In Section 5, the findings are concluded, and potential future research is presented.

2. Related Work

The application of machine learning (ML) to malware detection has progressed rapidly in recent years, driven in large part by developments in deep learning and the increasing use of adversarial learning methodologies. Existing studies generally fall into two major directions: (1) the generation of adversarial or mutation-based malware variants that can evade detection systems, and (2) the design of more resilient detection approaches capable of identifying evolving malware under constrained data conditions [16,19,20,21].

Metamorphic malware represents one of the most sophisticated and dangerous categories of malicious software. Unlike traditional malware that relies on static signatures, metamorphic malware continuously alters its internal code structure across generations while retaining its original malicious intent. Two malware instances,

m a l

and

m a l^{'}

, are considered metamorphic if both execute equivalent malicious actions despite having distinct binary or code representations [5]. This transformation is achieved using several mutation techniques, such as swapping registers, substituting instructions, reordering operations, permuting subroutines, inserting junk or padding code, and applying grammar-based mutation rules [22]. Such techniques make static or signature-based detection unreliable, as each new generation of malware appears structurally different from its predecessors.

Various ML-based methods have been proposed to detect metamorphic malware, spanning deep learning approaches, examples include Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs), as well as ensemble-based classification methods that combine multiple weak classifiers for improved robustness [23]. However, these approaches often struggle with generalisation due to code obfuscation and stochastic transformations. Prior studies have sought to mitigate these challenges through alternative feature extraction and analysis methods. For example, some researchers employ text-mining algorithms and dynamic behavioural analysis to derive aggregated signatures [24], while others use structural entropy [25] and malware normalisation strategies [26] to identify transformation patterns. Although these techniques achieve high accuracy on known malware datasets, their performance degrades significantly when facing unseen metamorphic variants that modify their code between generations.

To evaluate the robustness of antivirus systems, several benchmarking platforms have been proposed. The ADAM framework [27] automatically generates obfuscated malware variants to assess antivirus engines, while DroidChameleon [28] performs similar stress tests on Android malware using a wider range of automated mutation techniques. These tools reveal that even modern antivirus systems are vulnerable to polymorphic and metamorphic code obfuscations. Traditional heuristic and ML-based methods, such as Sequential Pattern Mining [29], Support Vector Machines (SVMs) [30], and DNNs [18], have been applied in metamorphic malware detection. Many of these models operate as “black boxes,” under the assumption that attackers lack insight into their internal mechanisms. However, this assumption no longer holds, as adversaries increasingly employ probing and adversarial strategies to exploit model vulnerabilities and generate evasive malware [31].

Adversarial learning has become a viable method for strengthening the robustness of malware detection systems. By generating adversarial variants that expose weaknesses in existing ML models and retraining detectors on these challenging samples, it is possible to improve their resilience against evasive malware [32]. Some studies have explored evolutionary algorithms, including Genetic Programming (GP) and Genetic Algorithms (GA), to create new metamorphic variants for adversarial training [33,34,35]. For instance, GP has been used to mutate mobile and PDF malware, optimising fitness functions based on detection evasion rates. Similarly, GA-based approaches have evolved Windows malware variants to assess and enhance detector robustness. However, these approaches primarily focus on evasiveness rather than diversity among generated variants. Consequently, the resulting samples may improve short-term robustness but fail to enhance generalisation across unseen malware families.

Recent advances have shown that transfer learning can significantly enhance malware detection performance when limited labelled data are available [36,37]. Leveraging pre-trained models has led to effective feature reuse and reduced training costs in malware classification. Despite this progress, two major research gaps persist. First, limited work has focused on systematically generating diverse mutable malware samples that both evade detection and strengthen ML model generalisation. Second, few studies have effectively integrated transfer learning with adversarial generation to achieve robust malware detection under data scarcity. The present study addresses these gaps by (1) generating metamorphic malware variants through evolutionary and adversarial learning strategies to enrich detector training datasets, and (2) applying transfer learning to enable efficient and generalisable malware detection with limited training samples.

3. Methodology (Learning to Defend Against Metamorphic Malware Using Adversarial Samples)

This section presents a generic platform-independent framework, as shown in Figure 1, for the generation of executable malicious mutants to predict the future behaviours of metamorphic malware. The framework is adaptable in nature, and its individual components can be modified to suit a range of different applications. The process of generating adversarial samples includes the following steps:

Step one: An understanding of the current samples is essential. This includes understanding what kind of data is required for analysis. For instance, the samples might comprise mobile malware, such as an Android malware source. The framework can also be adapted to other classes of malware, depending on the underlying platform in which the samples run, for example, desktop-focused threats, or to other operating systems, including iOS, where suitable intermediate representations and analysis tools are available.
Step two: This includes reverse engineering the malware to a form in which variants can be easily created. This is where the need for a disassembly tool, such as apktool comes in. Apktool [38] is employed in this work to disassemble Android applications (APK files) into their smali representation. Alternative disassembly utilities may be used depending on the platform on which the malware executes. For example, Portable Executable (PE) Viewers (e.g., CFF Explorer [39]) for desktop-based malware.
Step three: At this stage, the disassembled malware is modified under the guidance of a fitness function to produce new mutant variants. The smali representation is treated as editable code, and mutation operators are applied to introduce controlled transformations. Although this process can, in principle, be generalised to any code-like intermediate representation, the specific operators must be tailored to the syntax and semantics of the target platform. In this work, transformations were informed by properties of the malware, such as structural layout, behavioural profile, and its ability to evade existing detection engines. Other characteristics could equally be incorporated to guide the search for malicious and diverse mutants.
Step four: The final stage employs the mutants produced in Step three to train machine-learning models to improve protection against future variants. This stage does not require designing new ML architectures; standard models with well-understood complexities can be employed. In this study, both feature-based and sequence-based classifiers were used, although alternative ML models could also be applied within the same framework.

The specific details of the software components used in this research are given below. The original malware used to evolve variants is packaged as an APK file. This is disassembled using apktool to obtain a smali file [40] (the assembly language used by the Android Dalvik Virtual Machine). The smali file is derived from the decompilation of the .dex file from APK.

To test whether variants are executable, the modified smali code is rebuilt into an APK using apktool, followed by signing with apksigner and alignment via zipalign. The aforementioned steps are necessary to generate an APK. The resulting APK is then run on an Android emulator to check that it executes. Finally, once a set of mutants has been evolved, they are tested to determine whether they are still malicious using Droidbox [41]. Droidbox is an Android-oriented sandbox that supports the dynamic analysis of APK files. It observes the runtime behaviour of an application by executing it within an instrumented environment. When an APK is submitted to Droidbox, it is run inside the sandbox, which monitors and records a range of behavioural indicators, including file creation, deletion, and downloads, network connections initiated, and traces of invoked API calls, among other relevant events.

In this work, the mutants are generated using an EA-based mutation engine. It can be seen from Figure 2 that the original malware is first disassembled and then converted to smali files using apktool. The EA module is responsible for generating the mutated samples and optimising them with respect to evasiveness and their behavioural and structural resemblance to the parent malware. Once candidate APKs are produced, they are subjected to a maliciousness check to confirm that they still exhibit harmful behaviour. Only those variants that retain their malicious characteristics are stored and later used as training inputs for the machine-learning models.

The EA operates by applying smali-level mutation operators to evolve executable malware variants; it does not perturb classifier decision surfaces directly. The generated mutants are thus evolved malware samples rather than adversarial examples in the strict ML sense. Their usefulness for ML stems from the behavioural and structural diversity they introduce into the training distribution, rather than optimisation against a classifier boundary.

3.1. Evolutionary Algorithm

The EA used in this work is given in Algorithm 1 as presented in previous work in [15].

Algorithm 1 Evolutionary Algorithm [15]

1:

P \leftarrow

initialise_random_population() ▹ created by mutating original malware

2: evaluate(

P

) ▹ score each malware variant based on the chosen fitness measure

3:

f_{w}

= min(

P

) ▹ the worst member of the population

4: while Maximum number of iterations not reached do

5:

p \leftarrow

select_parent(P)

6:

c \leftarrow

mutate(p)

7:

f_{c} \leftarrow

evaluate(c)

8: if

f_{c} > f_{w}

then

9:

P \leftarrow P \cup c

▹ add child to the population

10: remove_worst() ▹ remove least fit mutant

11:

f_{w}

= min(

P

)

12: end if

13: end whilereturn

P

The EA starts with an initial population composed of randomly generated malicious mutants. The algorithm then proceeds iteratively: parents are chosen from this population, mutated to produce offspring, and the resulting variants are evaluated. The mutation operators employed in the EA are described in Section 3.4.

3.2. Initialisation

To construct the initial population, a single mutation operator (as detailed in Section 3.4) is applied once to the original malware sample. The operator is selected uniformly at random, and this procedure is repeated p times, each time beginning from the unmodified parent, to yield p distinct initial mutants.

3.3. Selection of Parent Malware

Parent selection is designed to favour higher-quality candidates so that fitter individuals are more likely to generate offspring. In this study, tournament selection [42] is employed, as it provides a simple mechanism for adjusting selection pressure. Under this scheme, a subset of k candidates is sampled at random, and the individual with the highest fitness among them is chosen as the parent to produce the next offspring.

3.4. Smali-Level Mutation Operators and Execution Constraints

Scope and representation—To create a mutant (i.e., a child) from a parent, we mutate smali (the Dalvik/ART assembly recovered via apktool) rather than Java source, because smali offers a stable, low-level intermediate representation (IR) that supports precise, semantics-preserving edits to Android applications.

Operators (one chosen uniformly per mutation)—We employ three smali-specific operators as implemented in our engine:

1.: M1—Instruction reordering (control-flow no-op insertion). We inject inert control-flow blocks immediately after local-register declarations (e.g., .locals 3), using label patterns goto :goto_k, :cond_k, :goto_k and a benign const-string assignment. This perturbs basic-block layout and CFG structure without affecting semantics (instructionreordering).
2.: M2—Garbage (junk) code insertion. Following a .locals line, we inject benign statements such as .line 0 and const-string v0, “ ” that alter the textual structure but not the program’s logic (garbagecodeinsert).
3.: M3—Variable/register renaming. We rename registers within method scope while respecting .locals bounds and type uses, preserving method signatures and avoiding register overflow (variablerename).

Operator-smali interaction—Operators are pattern-driven: regular expressions identify insertion points (e.g., .locals N) and ensure label uniqueness within a method’s scope.

Execution constraints—Offspring must (i) recompile, sign, and align (apktool → apksigner → zipalign), (ii) launch on an emulator, and (iii) exhibit malicious behaviour under Droidbox. Any failure yields the worst fitness.

Operator justification and expected contributions—The three operators target distinct dimensions of metamorphic variation. Instruction reordering perturbs basic-block boundaries and control-flow graphs with inert jumps, which increases structural dissimilarity while often reducing signature match confidence in AV heuristics, consistent with the elevated evasiveness observed in Section 4.3 (Figure 3, Figure 4 and Figure 5). Garbage code insertion primarily alters textual/code-level features (e.g., token counts, line directives) and therefore drives structural diversity with minimal impact on behaviour. Variable/register renaming preserves semantics while subtly modifying register allocation patterns; its effect is most visible in source/text similarity and, to a lesser degree, in detectors that incorporate register-aware signatures. These complementary effects motivated the operator set and help explain why evolving for BS/SS can indirectly yield evasiveness.

Implementation specifics—Operators are applied on smali with regex-anchored insertion points (e.g., after .locals N), unique label scoping within method boundaries, and register-safety checks consistent with .locals. M1 injects inert control-flow blocks and benign const-string; M2 inserts .line/const-string statements; M3 renames registers within scope while preserving signatures and avoiding overflow.

On crossover—Given smali structure and strict executability checks (rebuild/sign/align; emulator launch; Droidbox), mutation-only steady-state search proved sufficient to induce diversity; nevertheless, crossover over compatible method blocks is feasible and can be integrated without altering the validation pipeline.

3.5. Fitness Functions (Quantitative Definitions)

We run separate EAs, each minimising one property under executability constraints (lower is better):

Detection Rate (DR)—For variant x, let

d_{i} (x) \in {0, 1}

indicate whether VirusTotal engine [43] — i (of 63) flags x. Define

DR (x) = \frac{1}{63} \sum_{i = 1}^{63} d_{i} (x) .

Lower DR indicates higher evasiveness.

Behavioural Similarity (BS)—Let

b (x) \in R^{251}

be the system-call frequency vector for x; with

b (m)

the parent’s vector, we compute

BS (x) = cos (b (x), b (m)) = \frac{b (x) \cdot b (m)}{∥ b (x) ∥ ∥ b (m) ∥} .

Minimising BS encourages behavioural diversification while remaining executable.

Note: BS relies on frequency vectors and cosine similarity, which emphasise behavioural composition rather than temporal ordering or higher-level semantics. Sequence-aware characteristics are addressed in evaluation via LSTM/BERT (Section 4.4.2). Future assessments may incorporate sequence metrics (e.g., n-gram divergences or alignment-based distances) or graph-based behavioural metrics to complement BS without altering the current experimental results.

Structural Similarity (SS)—Let

s_{j} (x) \in [0, 1]

be normalised text/code similarity scores (text: cosine, Levenshtein, fuzzy matching; code: JPlag [44], Sherlock) and NCD (7-Zip). We aggregate

SS (x) = \frac{1}{J} \sum_{j = 1}^{J} s_{j} (x) .

Lower SS promotes text/code-level diversification.

Optimising BS or SS drives variants away from the parent in runtime behaviour or structure; combined with executability checks, this promotes evolutionary diversification and frequently yields evasiveness as a by-product (Section 4.3.1).

Computation and weighting—Evasiveness is measured as

DR (x) = \frac{1}{E} \sum_{i = 1}^{E} d_{i} (x)

where

E = 63

engines at experiment time, and evasiveness is

1 - DR

. Behavioural similarity is

BS (x) = cos (b (x), b (m))

over 251-d syscall frequency vectors. Structural similarity is

SS (x) = \frac{1}{J} \sum_{j = 1}^{J} s_{j} (x)

after normalising heterogeneous text/code metrics and NCD. Each fitness was optimised independently (no cross-objective weighting); this isolates objective-specific evolutionary pressure and keeps executability constraints unchanged.

On Single-Objective vs. Multi-Objective Optimisation

The optimisation of DR, BS, and SS was carried out in separate single-objective evolutionary runs to isolate the specific evolutionary pressures associated with each property. This design makes it possible to attribute observed evasiveness, behavioural changes, and structural variation directly to the selected objective. However, metamorphic malware generation is inherently multi-objective: evasiveness, behavioural preservation, and structural dissimilarity interact and can involve conflicting requirements. A multi-objective EA (e.g., NSGA-II [45]) would therefore provide an explicit characterisation of these trade-offs and represents a natural methodological extension of this work. Our formulation thus offers objective-specific insight while acknowledging that a full Pareto-based treatment is an important direction for future development. We further clarify that a combined multi-objective configuration was not evaluated in this study; the choice to optimise each objective independently was deliberate, allowing us to attribute the resulting evasive, behavioural, and structural characteristics directly to the objective under optimisation rather than to combined interactions.

MOEA integration—A multi-objective EA can optimise

min (DR, BS, SS)

subject to executability feasibility, with dominance-based selection (e.g., Pareto ranking and crowding) and steady-state replacement. Feasibility filters (rebuild/sign/align; emulator launch; Droidbox maliciousness) act as hard constraints; infeasible individuals receive the worst rank. This preserves the current validation while exposing explicit trade-offs.

EA pipeline (pseudocode)—Algorithm 2 summarises the mutation-only steady-state EA with executability and maliciousness validation.

Algorithm 2 Steady-State EA for Smali-level Malware Mutation

1: Input: parent malware m; objective

f \in {DR, BS, SS}

; population size P; iterations T

2: Output: validated mutants with fitness values and per-engine signatures

3: Initialise population

P

by mutating m once; validate (rebuild/sign/align → emulator

→ Droidbox); keep only feasible individuals

4: for

t \leftarrow 1

to T do

5: Select parent

p \in P

(e.g., tournament)

6: Sample operator from {M1, M2, M3} uniformly; apply to p to obtain x

7: Validate x: rebuild/sign/align; emulator launch; Droidbox maliciousness

8: if x infeasible then

9: assign worst fitness and discard

10: else

11: evaluate

f (x)

; insert x into

P

with steady-state replacement

12: end if

13: end for

14: return

P

4. Experiments and Discussion

4.1. Malware Samples for Evaluation

The malware samples used in this study were drawn from the Contagio Minidump [46] and the MalGenome dataset [47]. The Contagio Minidump collection contains 237 malware instances, while the MalGenome repository provides 1260 Android samples spanning 49 distinct malware families. For this work, we selected samples from these datasets based on the specific malicious payloads they exhibit, ensuring that the resulting dataset reflects a range of well-established and representative attack behaviours. Following the categorisation outlined by the authors in [48], four groups are used: (i) privilege escalation attacks, (ii) remote control malware such as DroidKungfu [49], (iii) financially motivated malware such as GGTracker [50], and (iv) personal information stealing malware such as Dougalek [51]. This categorisation captures a broad range of malicious behaviours that are central to Android malware research, thereby providing a balanced and meaningful dataset for evaluation. While the dataset may not cover every malware family or non-Android variant, the chosen samples are sufficient to illustrate the practical applicability and effectiveness of the proposed method across multiple key categories of real-world attacks.

Family-to-category mapping—In this study, we categorise: Dougalek as personal-information stealing (spyware), DroidKungFu as remote-control/backdoor, and GGTracker as financially motivated (premium SMS billing fraud), following the cited vendor reports.

Family selection rationale—Dougalek, DroidKungFu, and GGTracker were selected because they represent distinct and well-documented payload categories (personal-information theft, remote-control/backdoor, and financial fraud), offered enough functional instances for repeated evolutionary runs, and exhibited behaviours amenable to emulator-driven tracing. This ensured representative behavioural coverage while maintaining reproducibility across runs.

4.1.1. Ethical and Security Safeguards

All experimentation for this research was conducted within a strictly controlled, isolated virtual machine (VM) environment using legally sourced malware samples. The project’s adversarial techniques are applied solely to stress-test and improve defensive AI models, not to create offensive tools. Accordingly, all findings are framed for defensive cybersecurity purposes, with any disseminated code being sanitised to prevent misuse, in full acknowledgement of the research’s dual-use potential.

4.1.2. Dataset Composition and Validation

Sources and counts—Malware samples come from Contagio Minidump and MalGenome. We use two training sets: (i) 6020combo: 60 benign (20 entertainment, 20 security, 20 communication from Google Play) and 60 malicious (20 each from Dougalek, DroidKungFu, GGTracker); (ii) 6050combo: the same 60 benign plus 157 malicious (50 Dougalek, 55 DroidKungFu, 52 GGTracker). Test sets: for 6050combo → 27 benign, 23 malicious (10/5/8), and for 6020combo → 27 benign, 16 malicious (10/3/3). Details of the dataset can be found in [52].

Validation—Each mutant must (a) recompile/sign/align and launch on an emulator; (b) show malicious behaviour under Droidbox; and (c) is cross-checked with VirusTotal for per-engine outcomes (used for DR). Benign apps were sourced from Google Play.

4.2. Evolutionary Algorithm-Method

Experiments are performed on the three selected malware families (Dougalek, DroidKungfu, and GGTracker). For each family, the EA is executed once per fitness objective, giving rise to nine experimental treatments. Owing to the stochastic nature of the EA, each treatment is repeated ten times, and the best fitness value obtained in each run is recorded.

4.2.1. Evolutionary Parameters

The configuration of the evolutionary process used to generate evasive variants is summarised in Table 1, with parameter choices based on empirical tuning. The EA employs uniform mutation with a mutation rate of one, ensuring that a mutation operator is applied at every generation, as mutation is the sole variation operator in this study. Crossover is omitted because, in preliminary trials, it frequently produced non-executable artefacts. Tournament selection with a tournament size of five is used to provide a balanced level of selection pressure. Given the substantial runtime cost of evaluating each mutant, the number of iterations is capped at 100, and the population size is set to 20.

Random Search Baseline

We benchmark evolutionary search against a Random Search (RS) that generates and evaluates 120 one-step mutants (uniformly sampling a single operator per variant), matching the number of search points visited by the EA (population

20 \times 100

iterations with steady-state replacement). RS variants undergo the same executability checks and fitness evaluation as EA variants.

For parity, RS applies exactly one uniformly sampled operator per variant, the same mutation depth and operator distribution used by the EA, so differences observed isolate the effect of evolutionary selection pressure rather than unequal mutation budgets.

4.2.2. Collection of Relevant Metrics

In carrying out the experimental work, the following tools and libraries were employed.

AV engine: The AV engine used in this work is VirusTotal, and the function DR(x) derives its value from it. VirusTotal is an online malware-scanning platform that aggregates the results of more than 70 antivirus engines, and it is used in this work to evaluate the evasiveness of the generated mutants. The fitness value produced by this function represents the percentage of antivirus engines the variants evaded. It is normalised to a value between 0 and 1. Where 0 means the variant was able to evade being detected by all the antivirus engines and 1 means the variant was detected by all antivirus engines. It also retains the information regarding the mutant’s evasion score. Note that though VirusTotal now aggregates over 70 antivirus/analysis engines; however, at the time our experiments were executed, 63 engines were consistently exposed via our scanning workflow, and all DR computations were normalised to 63 for comparability across treatments.
Collecting the behavioural trace: Strace [53] is employed to capture the runtime behaviour of each variant by logging its system calls. The primary activity of the application is triggered using MonkeyRunner [54] to emulate user interaction. The output produced by Strace consists of a log containing the process ID together with the invoked system calls and their arguments. This log is subsequently converted into a fixed-length vector in which each element corresponds to the frequency of a specific system call. As the analysis considers 251 system calls, the resulting vector contains 251 entries. The behavioural similarity BS(x) is then computed as the cosine similarity between the system-call frequency vector of a variant and that of its corresponding original malware sample.
Libraries for structural similarity SS(x): To assess text-level similarity between the original malware samples and their mutants, the following metrics are employed:
–
Cosine similarity: Computes the cosine of the angle between two non-zero vectors.
–
Levenshtein distance: Measures the minimum number of deletions, insertions, or substitutions required to transform file A into file B [55].
–
FuzzyWuzzy: Performs approximate string matching to identify near-similar text patterns [56].
For source-code-level similarity, we apply:
–
JPlag: Tokenises both programs and identifies matching token sequences from the largest to the smallest [44].
–
Sherlock: Generates signatures of both programs and computes similarity between these signatures [44].
–
Normalised Compression Distance (NCD): Estimates similarity by comparing compression lengths of files. Given the original malware m and a variant v, the compression distance is defined as [57]:

$N C D_{z} (m, v) = \frac{Z (m v) - min {Z (m), Z (v)}}{max {Z (m), Z (v)}},$

(1)

where $Z (m)$ is the compressed size of file m under compressor Z. In this work, 7-Zip is used as the compression engine.

Each of the similarity metrics produces a value between 0 and 1, where a score of 1 means the original malware and the variant are identical, and 0 means the original malware and the variant are completely different. The average of these metrics is computed, and that represents the structural similarity. These metrics were implemented using various Python 2.7 and glibc 2.23 libraries.

For DR, we retain each variant’s 63-bit detection signature for diversity statistics and for the t-SNE projection in Section 4.3.2.

Structural Similarity Aggregation and Sensitivity

We aggregate heterogeneous, normalised structural metrics (textual and code-level) using an unweighted mean to avoid privileging any one family of measures. This yields a single interpretable SS score per variant. Qualitatively, the conclusions reported in Section 4.3 are robust to the omission of any single component (i.e., treatment rankings remain unchanged in our checks), supporting the use of equal weighting as a conservative aggregator.

Executability success rates—For each treatment, we track (i) the proportion of mutants that rebuild/sign/align successfully, (ii) the proportion that launch on the emulator, and (iii) the proportion confirmed malicious by Droidbox.

VirusTotal Temporal Stability and Reproducibility

VirusTotal signatures evolve continuously as engines update their databases. To mitigate temporal drift when computing DR for a given treatment (family × fitness), all scans were conducted within the same experimental window. This ensures that each set of variants is evaluated under a consistent AV-signature state. Nonetheless, we acknowledge that repeated scans performed days or weeks apart may lead to slightly different DR values due to signature updates across the 63 engines. We therefore treat DR as a temporally bounded measurement rather than a stable or independent probability estimate and recommend re-running scans within a narrow time window when reproducing our results, in line with standard practice for dynamic VirusTotal-based evaluation.

Engine dependence and interpretation of evasion percentages—VirusTotal’s aggregated antivirus engines are correlated rather than independent, and their signatures evolve continuously. Consequently, the percentage of undetected engines for a given sample should not be interpreted as an independent probability of real-world evasion. Instead, it serves as a heuristic stress-testing signal that allows consistent comparison across evolutionary treatments, all normalised to the 63 engines available at experiment time.

Executability validation and treatment of failures - Every candidate mutant is subjected to (i) rebuild/sign/align, (ii) emulator launch, and (iii) Droidbox maliciousness verification. Only variants that pass all stages are retained for analysis; any failure is treated as infeasible and assigned worst fitness, preventing selection and propagation. This ensures that all reported detection and diversity results are based exclusively on executable, behaviourally valid artefacts, and that non-executable cases cannot inflate diversity or bias DR. We record pass/fail outcomes at each stage during all runs to enforce these feasibility constraints; per-treatment counts are not reported in this manuscript.

Practical Runtime and Scalability Considerations

The end-to-end EA evaluation requires approximately 55.5 h when executed sequentially on a single emulator. Fitness evaluation (rebuild/sign/align, emulator execution, Droidbox tracing, and VirusTotal queries) is independent for each candidate, which makes the workload embarrassingly parallel. When distributed across k emulator instances, the expected wall-clock time reduces to roughly

55.5 / k

hours; for example, running eight emulators in parallel results in an estimated total runtime of about 7 h, aside from minor coordination overheads. Lightweight pre-filters (e.g., static sanity checks before emulation) can further reduce the validation burden by discarding clearly invalid mutants. The experiments used a fixed-cost configuration to maintain comparability across malware families, and reduced-budget EA runs were not explored; examining cost-performance trade-offs remains a promising direction for future work.

4.3. Evolutionary Algorithm—Results and Analysis

4.3.1. Influence of Fitness Function on Evasiveness of Evolved Mutants

This section examines how each of the three fitness functions affects the EA’s ability to identify new evasive variants. Using the 10-run framework described in Section 3.5, each generated mutant is subsequently checked to ensure it remains malicious, following the procedure outlined in Section 3. The number of variants that retain their malicious behaviour is summarised in Table 2.

Table 2 indicates that at least 70% of all runs yield mutants that remain malicious, irrespective of the malware family or fitness function. The EA successfully generates malicious variants in all 10 runs when optimising DR(x) for the DroidKungfu family, and likewise achieves a perfect success rate for the Dougalek family when optimising SS(x).

To analyse evasiveness, we compute the percentage of antivirus engines that fail to detect each malicious mutant across the x runs. This process is performed separately for each malware family and fitness function. In Figure 3, Figure 4 and Figure 5, the red line denotes the detection failure rate for the original malware. We also compare the EA’s effectiveness with that of a Random Search baseline.

Dougalek: Figure 3 shows that all three fitness functions produce mutants that are consistently more evasive than the original malware. While 40.3% of AV engines fail to detect the original Dougalek sample, the best EA-generated variants evade 72%, 66.7%, and 67.3% of engines under DR(x), BS(x), and SS(x), respectively. Notably, even when the EA does not explicitly optimise for evasiveness (i.e., using BS(x) or SS(x)), it still yields highly evasive variants. The median detection-failure rates for EA-generated mutants range from 62.1% to 69.4%, depending on the objective. In contrast, the Random Search baseline achieves a median of 58.5%, demonstrating that the EA provides a measurable advantage.
Droidkungfu: A similar pattern appears in Figure 4. The original sample evades 65% of engines, whereas the strongest mutants produced by the EA evade 94%, 82.1%, and 83% of engines for DR(x), BS(x), and SS(x), respectively. Median evasion rates for EA-generated mutants lie between 73.2% and 87.3%, while the Random Search baseline achieves only 63.1%, again showing that guided evolution offers a clear benefit.
GGtracker: Figure 5 demonstrates that all objectives again produce mutants that outperform the original malware in terms of evasiveness. Only 38.3% of engines fail to detect the original GGTracker sample, while the best evolved variants evade 73.3%, 62.1%, and 62.1% of engines for DR(x), BS(x), and SS(x), respectively. The similarity of the boxplots for all three fitness functions indicates that substantial evasion can arise even when the EA primarily targets behavioural or structural diversification.
From Figure 5, the median detection-failure rate for GGTracker mutants produced by the EA ranges from 61.4% to 66%, depending on the fitness function. In contrast, the Random Search baseline (RS(x)) attains a median rate of 61.4%. The similarity between RS(x) and the outcomes for BS(x) and SS(x) indicates that, for this family, these two objectives do not surpass the random baseline when used in an evolutionary setting. However, Figure 5 also shows that EA runs optimising DR(x) yield clearly higher evasiveness than RS(x).

Objective-specific effects—The patterns in Table 3 also reflect the distinct evolutionary pressures exerted by each fitness objective. DR-guided evolution consistently drives populations toward detector blind spots, whereas BS and SS encourage behavioural and structural drift that can indirectly yield evasion, particularly for families such as GGTracker. These objective-specific tendencies complement the family-level trends discussed below and help explain why different fitness functions achieve different evasion profiles across the three malware families.

Run-to-run variability—For each family×objective treatment, we report distributions across repeated runs: medians, IQRs, and 95% bootstrap CIs for evasiveness (1−DR), alongside the best-of-run values. This complements Table 3 and ensures stochastic robustness is evidenced beyond single best points.

Statistical analysis across all malware families is summarised in Table 3, and the data reveals consistent and significant performance advantages for the specialized fitness metrics (DR(x), BS(x), SS(x)) over Random Search (RS(x)). As expected, the DR (Detection Rate) metric, which is the only fitness metric that evolves directly for evasiveness, emerged as the most effective and consistent strategy, achieving the highest evasiveness scores in every dataset, with a remarkable peak of 88.0% evasiveness for the Droidkungfu family, meaning nearly 9 out of 10 detection engines failed to identify the malware. The BS(x) and SS(x) metrics also substantially outperformed the random baseline across the board. In contrast, the RS(x) method was the least effective, with its 95% confidence intervals for mean evasiveness (36.6–41.5%) showing no overlap with the ranges of the other methods, providing strong statistical evidence that its lower performance is a real effect and not due to chance. This pattern confirms that structured evasion techniques are critically more successful at bypassing detection engines than random changes, with the Droidkungfu family showing the greatest susceptibility to these specialised methods.

4.3.2. Analysis of the Evasion Characteristics of the New Mutants

This subsection examines which antivirus engines were successfully bypassed by the evolved mutants but continued to detect the original malware. For each of the m malicious variants produced over the 10 EA runs for a particular fitness function, we count the number of times, f, that a detector d fails to identify the sample (

0 < f < 10

). Figure 6, Figure 7 and Figure 8 display these results: blue bars correspond to DR(x), orange bars to BS(x), and green bars to SS(x).

Figure display note (engines fooled by variants but not by the original malware)—All computations use the full engine set available at the time of experimentation (63 engines). The purpose of the analysis in Figure 6, Figure 7 and Figure 8 is to identify those detectors that were fooled by at least one evolved mutant while still detecting the original malware. Accordingly, we selectively label only engines that exhibit this mis-detection shift (parent detected → mutant undetected) in at least one of the ten evolutionary runs. Engines that never showed this shift are included in the underlying calculations (and in the 63-bit detection vectors used for DR and signature diversity) but are not individually labelled in the figure to preserve legibility.

Dougalek: Figure 6 shows the extent to which each engine fails against mutants when evolved under the three objectives. Under DR(x), fourteen engines, including AVG and Tencent, detect all mutants consistently, whereas seventeen engines, such as AVware and McAfee, fail to detect every mutant. For BS(x), nineteen engines (e.g., AVG and Fortinet) remain robust, while eleven engines (including GData and McAfee) fail on all mutants. Under SS(x), sixteen engines (e.g., Symantec Mobile and Fortinet) detect all variants, whereas seventeen engines, such as McAfee-GW and BitDefender, miss every mutant.
Droidkungfu: As shown in Figure 7, three engines, including Avast Mobile and NANO Antivirus, detect all DR(x) mutants, while twelve engines (e.g., Fortinet and Kaspersky) fail to detect any of them. For BS(x), six engines (including Avast Mobile and NANO Antivirus) succeed on all mutants, whereas seven engines (e.g., Symantec Mobile and Tencent) fail completely. Under SS(x), seven engines such as AVG and Cyren detect all variants, while eleven engines (including Kaspersky and ZoneAlarm) miss every mutant.
GGtracker: Figure 8 shows that nine engines (e.g., CAT-QuickHeal and DrWeb) detect all DR(x) mutants, whereas thirteen (such as Arcabit and BitDefender) fail consistently. Under BS(x), fifteen engines, including K7GW and Kaspersky, detect every mutant, while sixteen engines (e.g., AVware and TrendMicro) miss them all. Similarly, SS(x) yields sixteen engines (such as Avast and AVG) detecting all mutants, while eighteen (including Cyren and McAfee) fail across all samples.

Across the Dougalek and Droidkungfu families, DR(x) consistently results in the largest number of engines being fooled, which aligns with its objective of directly minimising detection. Interestingly, the GGTracker family shows a different pattern: mutants evolved under BS(x) and SS(x) are more often responsible for fooling larger numbers of engines than those produced under DR(x).

It is essential not only that mutants are evasive, but also that they exhibit diversity. Diversity may refer to behavioural variation (differences in system-call activity relative to both parent and sibling variants), structural diversity (variation in code-level or semantic characteristics), or differences in detection signatures across antivirus engines.

Section 4.3.2 focused on maximising one fitness metric at a time, behavioural dissimilarity, structural dissimilarity, or evasiveness, to generate variants that differ meaningfully from the parent malware. This subsection extends that analysis by assessing whether repeated EA runs genuinely produce diverse variants and how that diversity manifests across three dimensions.

The diversity of mutants for each malware family is quantified using:

percentage of unique detection signatures,
percentage of unique behavioural signatures,
structural similarity.

(i): Detection-signature diversity: A detection signature is represented as a 63-dimensional vector d, where $d_{i} = 1$ if engine i detects the variant and $d_{i} = 0$ otherwise. For each subset $(m, f)$ , denoting variants from malware family m evolved under fitness objective f, we compute the proportion of unique detection signatures. Of the 76 malicious mutants generated across all classes and objectives, removal of duplicates yields 46 unique detection vectors. Results in Table 4 show that Droidkungfu produces the greatest uniqueness for DR(x) (90%) and BS(x) (89%), while Dougalek produces the highest uniqueness under SS(x) (50%). To visualise the dispersion of these signatures, we project all 63-dimensional vectors into two dimensions using t-SNE [58]. This method preserves local neighbourhood structure while maintaining global relationships, facilitating degree-of-diversity interpretation. Figure 9 illustrates the resulting clusters. Droidkungfu mutants form a distinguishable cluster in the upper-left region of the plot, whereas variants from Dougalek and GGTracker are more intermixed but still exhibit sub-clustering based on the generating fitness function.
(ii): Behavioural-signature diversity: Each variant’s behavioural signature is a system-call frequency vector b of length 251. Using the same process applied to detection signatures, we count unique behavioural vectors per $(m, f)$ subset. Table 5 shows that Dougalek achieves 100% uniqueness under both DR(x) and BS(x). GGTracker is most diverse under SS(x). Notably, across all experiments, only 33 out of 251 system calls ever appear with non-zero frequency, regardless of malware family or fitness function, yet these combinations still yield substantial behavioural differentiation. Overall, behavioural signatures show greater variance than detection signatures, though both exhibit clear and meaningful diversity.
(iii): Structural diversity: Structural similarity scores between all pairwise combinations of mutants are computed, taking values between 0 (completely different) and 1 (identical). Heatmaps of these pairwise scores are shown in Figure 10, Figure 11 and Figure 12. The results indicate that for GGTracker, SS(x) produces the highest structural diversity. For Droidkungfu, BS(x) yields the most variation. For Dougalek, SS(x) again produces the most structurally distinct mutants. Across families, Droidkungfu consistently shows the greatest spread in structural differences, whereas Dougalek mutants exhibit the least structural variation overall.
Future direction—towards unified diversity metrics—In this paper, we examine diversity along three axes, detection-signature diversity, behavioural dispersion, and structural dispersion, using the quantitative measures already reported in Table 4 and Table 5, and Figure 10, Figure 11 and Figure 12. While these metrics are analysed separately to preserve methodological clarity, an interesting future direction would be to explore unified or composite diversity indices that combine these components into a single descriptive measure. Such an extension could enable more direct quantitative comparisons with evasiveness across objectives and families, but it lies outside the scope of the present analysis.
Quantitative metrics and visualisation—In addition to the t-SNE visualisations, our interpretation of diversity relies on the quantitative measures already reported above: (i) the proportion of unique 63-bit detection signatures, (ii) the mean pairwise behavioural distance based on (1−cosine) over syscall-frequency vectors, and (iii) the mean pairwise structural distance based on (1−SS). These numerical indicators form the basis of our comparative analysis across evolutionary treatments, while the t-SNE plots serve only as an illustrative projection of these underlying relationships rather than a source of statistical evidence.

4.3.3. Explicit Diversity Metrics and Evolution Benefits Matrix

We quantify diversity along three axes:

(a) Detection-signature diversity—For variant x, let

d (x) \in {0, 1}^{63}

be the per-engine detection vector. We report the percentage of unique detection signatures per (family, objective) subset (Table 4) and visualise dispersion with t-SNE (Figure 9).

(b) Behavioural diversity—With system-call vectors

b (x)

, we summarise pairwise distances

1 - cos (b (x), b (y))

and report the percentage of unique behavioural signatures per subset (Table 5).

(c) Structural diversity—Using the structural similarity aggregate

SS

, we discuss pairwise distances

1 - SS

(median/IQR) and provide heatmaps in Figure 10, Figure 11 and Figure 12.

Evolution Benefits Matrix—Table 6 compiles, per family × objective, (i) evasiveness (mean

1 - DR

), (ii) % unique detection signatures, (iii) % unique behavioural signatures, and (iv) a qualitative structural-diversity summary based on the heatmaps. Notably, EA–DR attains the strongest evasiveness overall (e.g., DroidKungFu), while novelty-oriented objectives (BS/SS) achieve comparable evasiveness with higher uniqueness for GGTracker.

4.3.4. Contextualizing Our EA Method Within Existing Literature

While prior work has demonstrated the efficacy of evolutionary algorithms in generating evasive malware [33,34,35], their fitness functions were predominantly and in some cases exclusively oriented toward maximising detection evasion against specific classifiers (e.g., PDFrate/Hidost [34] or a Windows malware detector [35]). This narrow objective, while successful in creating evasive samples, overlooks a critical factor: the genetic diversity of the variant population. Consequently, the generated malware, though evasive, is often homogeneous and fails to provide the broad, feature-rich data necessary for robust machine learning generalisation. In contrast, our approach, as seen in Section 4.3.1 and Section 4.3.2, incorporates diversity as a core objective within a multi-faceted fitness function, thereby generating variants that not only achieve high evasiveness but also significantly enhance the generalisability and accuracy of the ML models they are designed to improve.

4.4. Machine Learning—Method

This section describes the machine-learning component used to classify metamorphic malware from behavioural evidence, addressing the fourth and fifth research questions. We train three complementary classifiers, Naïve Bayes (NB), Long Short-Term Memory (LSTM) networks, and a transformer (BERT), on features derived from system-call traces. The feature extraction and preprocessing pipeline is detailed in Section 4.4.1 and the model architectures and training settings are summarised in Section 4.4.2.

4.4.1. Feature Extraction and Preprocessing Pipeline

Behavioural traces—Each APK is executed on an emulator while strace logs system calls; we stimulate the main activity with MonkeyRunner to trigger application behaviour. From each run, we obtain a sequence and a frequency profile over a closed vocabulary of 251 system calls.

Non-sequential features (for Naïve Bayes)—We construct a 251-dimensional bag-of-system-calls (BoSC) vector

b (x)

whose ith component is the frequency of the ith system call in the trace. These vectors are used for binary (benign vs. malicious) and multi-class (benign + 3 families) tasks.

Sequential features (for LSTM/BERT)—We tokenise the trace into syscall mnemonics drawn from the 251-term vocabulary. Sequences are right-padded with a PAD token and truncated to model limits. For LSTM, sequences are embedded and passed to stacked LSTM layers (see Table 7); for BERT, tokens are fed to the multilingual cased BERT tokeniser with max_seq_length

= 512

and then fine-tuned as below. The choice of behavioural sequences (rather than static smali sequences) is deliberate: it captures runtime characteristics that persist across code mutations.

4.4.2. Model Architectures and Training

The machine-learning classifiers used in this study comprise Naïve Bayes (NB), a Long Short-Term Memory network (LSTM), and a transformer-based model (BERT). NB was implemented using the Scikit-learn library and serves as a lightweight, data-efficient baseline for both binary and multi-class classification, operating on the 251-dimensional bag-of-system-calls representation described earlier.

The LSTM classifier was implemented using the Keras library [60]. Its hyperparameters were selected following empirical tuning (Table 7). The Adam optimiser [59] was chosen due to its well-documented effectiveness in sequence-classification tasks. Models with one and two LSTM layers and batch sizes ranging from 10 to 500 were explored; the final configuration used two LSTM layers of 128 units each, a batch size of 50, and three epochs, as additional epochs led to rapid overfitting. Binary classification employs a sigmoid output with binary cross-entropy, whereas multi-class classification uses softmax activation with sparse categorical cross entropy.

The BERT model (Table 7) was implemented using the ktrain [61] interface to Keras. Input samples were loaded using texts_from_folder and preprocessed using the built-in “bert” pipeline. We fine-tuned Google’s multilingual cased BERT base model [62], selected because its subword vocabulary robustly supports arbitrary symbolic tokens (such as system calls) without requiring custom vocabulary training, and because no pretrained domain-specific syscall models are publicly available. The model was trained using the fit_onecycle policy for three epochs with a batch size of 50, which provided a balance between stability and computational efficiency while avoiding overfitting.

On BERT for system-call tokens—The syscall vocabulary is discrete and closed; each system call is mapped to a unique token before encoding. BERT’s tokeniser produces stable embeddings for these tokens, and its positional encoding scheme preserves the original ordering of system-call events, allowing execution flow and temporal structure to be captured effectively. The self-attention mechanism models long-range dependencies (e.g., call–response relations, contextual interactions, and control-flow transitions), which support strong classification performance even though the model was pretrained on natural-language corpora.

Collectively, these models provide complementary learning paradigms: NB offers fast, interpretable baselines; LSTM captures temporal patterns in system-call sequences; and BERT provides deep contextual representations through self-attention. This diversity enables a broad evaluation of how evolved malware variants interact with classifiers of differing complexity.

Explanation of Computational Characteristics

Table 8 outlines the fundamental computational trade-offs between the models employed. Naive Bayes operates with negligible

O (d \cdot c)

complexity, serving as a high-efficiency baseline for static feature classification. The LSTM model introduces a linear

O (L)

dependency on sequence length L, offering a balance between capturing temporal order and computational cost, which is managed by a compact architecture and limited epochs. In contrast, the BERT model achieves deep contextual understanding at a high

O (L^{2})

cost due to its self-attention mechanism, making GPU acceleration and a low epoch count essential. The Evolutionary Algorithm occupies a distinct category, where its runtime is not defined by input scaling but by a fixed, empirical cost driven by the fitness evaluation time and search parameters. Its sequential runtime of approximately 55.5 h necessitates explicit parallelisation strategies (e.g., concurrent fitness evaluation) to become feasible, contrasting sharply with the input-dependent scaling of the other models. This selection of models, from lightweight classifiers to compute-intensive neural networks and optimisation algorithms, represents a diverse methodological approach designed to evaluate performance across the efficiency-effectiveness spectrum.

4.5. Machine Learning—Results and Analysis

4.5.1. Enhancing Metamorphic Malware Classification with Evolved Mutants

In this subsection, we evaluate how effectively different machine-learning models classify metamorphic malware when their training data are augmented with the variants produced by the EA described as well as a Quality Diversity EA described in [19].

A comparison is done between binary class (benign and malicious classes) and multi-class (benign and the three malicious families—Dougalek, Droidkungfu and GGtracker) classifiers for both sequential (LSTM) and non-sequential (Naive Bayes) ML models to see which model does better in improving the classification accuracy using the evolved data as part of the training set. The experiments done in this section employ the 6020combo and 6050combo datasets described in Section 4.1.2.

The results in Table 9 show consistent patterns across both datasets. For the Naïve Bayes model, binary classification achieves higher accuracy than the multi-class setting for both 6020combo and 6050combo. In contrast, the LSTM model performs better in the multi-class configuration than in the binary one. Across both datasets and both classification modes, Naïve Bayes outperforms the LSTM model. These results provide the empirical basis for the more detailed interpretation that follows.

Interpreting Table 9 compares Naïve Bayes (NB) and LSTM. NB performs strongly in the binary setting because its inductive bias favours frequency-based behavioural distinctions that separate benign from malicious activity. LSTM, however, benefits in the multi-class setting where temporal ordering helps differentiate families such as DroidKungFu and GGTracker. NB remains consistently robust under class overlap, whereas LSTM provides gains when sequential patterns carry a discriminative signal. The two models, therefore, exhibit complementary strengths shaped by their respective feature assumptions.

A closer inspection of the confusion matrices for both the 6020combo and 6050combo datasets provides further insight into model behaviour. Beginning with Figure 13, which summarises the 6020combo performance, the first pair of plots corresponds to the Naïve Bayes classifier. In the binary case, 23 benign samples are correctly identified, while four are misclassified; all malicious samples are detected correctly. This yields strong binary performance for Naïve Bayes, with high precision (0.85), perfect recall (1.0), and an F1-score of 0.92. In the multi-class setting, 23 benign samples are correctly recognised, whereas one is incorrectly assigned to Droidkungfu and three to GGTracker. The Dougalek family is classified perfectly, but only one Droidkungfu sample is correctly labelled; the remaining two are misclassified as Dougalek and GGTracker. Similarly, only one GGTracker instance is correctly recognised, with two samples misclassified as Dougalek and Droidkungfu. As a result, multi-class performance is mixed: while Dougalek achieves strong scores (Precision = 0.83, Recall = 1.0, F1 = 0.91), Droidkungfu and GGTracker obtain much lower F1-scores of 0.33 and 0.25.

Turning to the LSTM model for the 6020combo, the binary classifier incorrectly labels all malicious instances as benign, producing 27 true negatives and 16 false negatives. Although recall for benign samples remains high, the overall F1-score falls to 0.77 due to the large number of misclassified malware samples. For multi-class classification, the LSTM correctly labels 23 benign samples, while two are misassigned to Dougalek and two to Droidkungfu. It correctly identifies nine Dougalek samples, with one misclassified as Droidkungfu. However, performance on the other families is weaker: only one Droidkungfu sample is correctly labelled, and none of the GGTracker instances are classified correctly (misclassified as benign, Dougalek, or Droidkungfu). Consequently, the LSTM model struggles considerably on multi-class tasks for Droidkungfu (F1 = 0.25) and GGTracker (F1 = 0.0).

The 6050combo results in Figure 14 follow similar trends. For Naïve Bayes in the binary configuration, 23 benign samples are correctly detected, and four are misclassified, while all malicious instances are correctly labelled. The binary performance mirrors that of the 6020combo dataset (Precision = 0.85, Recall = 1.0, F1 = 0.92). In the multi-class case, 23 benign samples are correctly labelled, with one benign instance misclassified as Droidkungfu and three as GGTracker. The Dougalek family is again recognised perfectly. For Droidkungfu, two samples are correctly identified, while three are misclassified as either Dougalek or GGTracker. GGTracker performance improves relative to 6020combo, five correct classifications with three misclassifications, leading to an increased F1-score (0.59 versus 0.25). Still, Droidkungfu remains challenging (F1 = 0.50).

For the LSTM classifier on 6050combo, all 50 instances in the binary setting are labelled benign, resulting in 27 true negatives and 23 false negatives. This produces a lower F1-score of 0.70, reflecting the sensitivity to false positives observed in the smaller dataset as well. In the multi-class setting, 23 benign samples are correctly identified, while one is misclassified as Dougalek and three as GGTracker. Among malicious classes, nine Dougalek samples are correctly classified, with one misclassified as GGTracker. For Droidkungfu, two samples are correctly labelled, while three are distributed across Dougalek and GGTracker. For GGTracker, four instances are correctly classified, and four are misassigned as Dougalek. Although the multi-class LSTM displays improved precision for some classes, its recall remains weak for Droidkungfu (F1 = 0.57) and GGTracker (F1 = 0.44).

Overall, Figure 13 and Figure 14 show that benign samples are generally well classified across all models and datasets. Naïve Bayes consistently performs robustly in binary classification for both datasets, while LSTM struggles to separate benign from malicious samples in the binary setting. In contrast, multi-class classification provides a performance benefit for LSTM, allowing it to distinguish more effectively between benign instances and malicious families. The remaining misclassifications primarily arise from functional overlap between families, for example, both Dougalek and GGTracker involve personal-information theft, leading to similar system-call patterns that make discrimination more difficult.

The parallel coordinates visualisation in Figure 15 provides an analytical perspective on the difficulty binary classifiers face in distinguishing benign from malicious instances within the 6020 and 6050 datasets when using the LSTM model. This visualisation, designed for high-dimensional data exploration [63], illustrates the relationship between system call features over time. Despite the colour distinction between benign (blue) and malicious (red) samples, the plots exhibit substantial overlap across several dimensions (system call IDs on the x-axis and their time-ordered sequences on the y-axis). This overlap indicates that the temporal patterns of system calls for benign and malicious behaviours share similar structural characteristics, thereby reducing the model’s discriminative capability. Consequently, the visualisation supports the observed performance limitations of the LSTM classifier in separating the two classes.

4.5.2. Enhancing Malware Classification via Transfer Learning with BERT and the Evolved Mutants

This subsection addresses the research question: “Can a transformer model—such as BERT, which has been pretrained on large-scale NLP corpora, be applied in a transfer learning setting to enhance the classification of metamorphic malware when the newly generated mutants are incorporated into the training data?” The corresponding results are presented in Table 10. The experiments compare the performance of Naïve Bayes, LSTM, and BERT across both the 6020combo and 6050combo datasets, and demonstrate that utilising BERT leads to measurable improvements in classification accuracy.

The results in Table 10 highlight distinct performance trends across models and tasks, reflecting the interaction between model architecture and data characteristics. For the 6020combo dataset, the BERT model achieves the highest binary classification accuracy (93%), surpassing Naïve Bayes (91%) and significantly outperforming LSTM (63%). This suggests that BERT’s contextual representation of system call sequences captures discriminative patterns more effectively than the statistical assumptions of Naïve Bayes or the temporal dependencies modelled by LSTM. In contrast, for multi-class classification, Naïve Bayes attains the highest accuracy (81%), with BERT and LSTM performing comparably at 77%, indicating that the simpler probabilistic model generalises better when multiple behavioural classes are introduced.

For the 6050combo dataset, a similar pattern is partially observed: Naïve Bayes again achieves strong binary classification accuracy (92%), slightly higher than BERT (90%), while LSTM underperforms substantially (54%). However, in the multi-class setting, BERT’s accuracy drops to 50%, compared to 80% for Naïve Bayes and 76% for LSTM, suggesting that BERT’s contextual embeddings may be less effective when class boundaries become less distinct. Overall, these results indicate that while transformer-based models like BERT excel in binary discrimination tasks due to their contextual encoding capabilities, traditional models such as Naïve Bayes remain more robust in multi-class malware classification, where feature overlap and class imbalance are more pronounced.

Effect of mutants on model generalisation—The inclusion of EA-generated mutants in the training data improves the robustness of all three models by exposing them to structurally and behaviourally varied malicious instances that do not appear in the original dataset. This is reflected in the performance patterns shown in Table 9 and Table 10: models trained on datasets augmented with mutants achieve higher binary detection accuracy and exhibit greater stability across folds than those trained only on original samples. The same trend persists in the multi-class setting, where the additional diversity helps distinguish similar families (e.g., Dougalek vs. GGTracker) by widening the behavioural and structural variation observed during training. These results indicate that the mutants act as an effective regulariser, mitigating overfitting to the original dataset and improving generalisation to unseen samples without requiring changes to the underlying architectures or optimisation procedures.

4.5.3. Superior Cost-Efficiency Through Transfer Learning

While prior research has demonstrated that ML techniques can achieve high accuracy (>90%) in detecting malicious variants [18,29,30], their effectiveness is often contingent upon large, labelled training datasets—a requirement that is prohibitively expensive and impractical for highly mutable malware. This work directly addresses this limitation as seen in Section 4.5.2. By leveraging transfer learning to adapt a pre-trained BERT model for metamorphic malware detection, the dependency on vast volumes of evolved variants for training is significantly reduced. This approach yields a substantial reduction in data acquisition and curation costs while maintaining a high detection rate, thereby offering a more scalable and cost-effective solution for real-world deployment.

Positioning vs. prior work. Prior studies often pursue either deep classifiers trained on large datasets or single-objective mutant generation focused on evasiveness. Our contribution combines evolutionary diversification (producing structurally and behaviourally varied, executable mutants validated by Droidbox) with transfer learning (multilingual BERT) to achieve strong performance with constrained training data. This diversity-first augmentation complements sequence-aware models and reduces data dependence, providing a practical route to improved robustness without assuming access to large, up-to-date corpora.

4.5.4. Statistical Reporting for Diversity and Classification

We report 95% confidence intervals for evasiveness in Table 3. For diversity analyses, we present descriptive summaries aligned with the underlying data: (i) detection-signature diversity via percentage of unique 63-bit signatures (Table 4) and t-SNE visualisation (Figure 9); (ii) behavioural dispersion using cosine-based distances and uniqueness rates (Table 5); and (iii) structural dispersion via pairwise

1 - SS

heatmaps (Figure 10, Figure 11 and Figure 12). For machine-learning performance, we report overall accuracies and confusion matrices (Table 9 and Table 10, Figure 13 and Figure 14). Given the per-treatment sample sizes and heterogeneity across families and objectives, we emphasise effect-direction consistency rather than formal hypothesis testing for diversity and classification metrics.

5. Conclusions and Future Work

In this paper, we have presented a complete end-to-end framework for generating new adversarial variants from existing malware and incorporating them into the training of machine-learning models. The central objective was to produce a diverse collection of malicious mutants that evade existing detection mechanisms, with diversity assessed in terms of both structural and behavioural differences from the original malware samples. Using three malware families as representative case studies, we demonstrated that the evolved variants are substantially more evasive than their parent samples and exhibit a broad spectrum of behavioural patterns. The findings indicate that incentivising the evolutionary algorithm to produce mutants that exhibit structural or behavioural dissimilarity from the original malware inadvertently leads to the generation of evasive mutants. Thus, it appears that explicit evolution for evasiveness is not a prerequisite; rather, high-quality mutants that possess evasive characteristics can be derived indirectly through this approach.

Furthermore, our results demonstrate that the mutants generated by the EA can be successfully incorporated into training datasets to enhance the performance of machine-learning classifiers. By augmenting existing datasets with these evolved samples, we observed consistent improvements in classification accuracy. A comparison between binary and multi-class models indicates that the binary classifiers generally achieve higher accuracy when identifying the evolved mutants. In addition, the use of BERT, a transformer model pretrained on large-scale NLP corpora, further improves performance in some evaluation scenarios, highlighting the benefits of transfer learning for metamorphic-malware detection.

The proposed framework is sufficiently generic to support the generation of new adversarial variants across different malware classes. There remains considerable potential to refine the EA component, for example, through the design of new mutation and crossover operators that produce runnable APKs, parameter optimisation, increased population size and the development of a multi-objective framework that jointly optimises all fitness functions for enhanced robustness.

This study represents an initial step toward automated generation of structurally and behaviourally diverse malware variants for evaluating and strengthening detection methods. Several promising research directions emerge from the limitations inherent to this first exploration.

A natural extension is the integration of co-evolutionary or quality-diversity frameworks, in which the evolutionary generator and the detector co-adapt over time. Such formulations, inspired by adversarial dynamics in generative models, would allow improvements in generated mutants to drive corresponding improvements in detection strategies, exposing the arms-race characteristics of real-world malware evolution while still preserving the executability and behavioural-integrity constraints central to this work.

Future work may also explore specialised deep learning architectures designed for code and program-structure analysis, such as Graph Neural Networks operating over control-flow or call-graph representations. These models could capture structural relationships beyond those reflected in syscall frequencies or smali-level patterns, enabling a richer characterisation of evasion behaviours and a more rigorous assessment of the limits of static or dynamic feature-based classifiers.

Although the proposed pipeline is conceptually platform-agnostic, comprising disassembly to an intermediate representation, mutation, executability and maliciousness checks, and behavioural evaluation, our empirical validation covers only three Android malware families at the smali level. Adapting the framework to other platforms (e.g., PE or ELF binaries) will require selecting an appropriate intermediate representation, establishing a platform-specific build/sign/deploy toolchain to preserve executability, and implementing a behaviour-capture mechanism comparable to syscall tracing. Demonstrating robust performance across additional families and platforms remains a priority for assessing the generality of the approach. Consistent with this, we make no claims of temporal prediction or family-level generalisation; empirically validating such generalisation forms an important direction for future investigation.

Finally, advancing the real-world applicability of the approach requires examining the long-term evolution of mutants across multiple generations, as well as measuring how detectors adapt to repeated exposure to diversity-augmented training data. Such studies would help quantify longer-horizon evasion behaviours and inform the development of more resilient detection systems capable of withstanding evolving, adaptive malware.

Author Contributions

Conceptualisation, K.O.B. and Z.T.; methodology, K.O.B. and Z.T.; software, K.O.B.; data curation, K.O.B.; investigation, K.O.B.; validation, K.O.B.; writing—original draft preparation, K.O.B.; writing—review & editing, Z.T.; supervision, Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Some of the datasets used in this work can be found at https://github.com/KehindeOloye/Improving-Classification-of-Metamorphic-Malware.

Conflicts of Interest

The authors declare no conflicts of interest.

References

CrowdStrike. CrowdStrike 2024 Global Threat Report. Available online: https://www.crowdstrike.com/en-us/resources/reports/crowdstrike-2024-global-threat-report/ (accessed on 15 September 2024).
SOPHOS. Sophos 2024 Threat Report: Cyberthreats to Small Businesses Are Expanding Beyond Ransomware. Here’s What You Need to Know. Available online: https://assets.sophos.com/X24WTUEQ/at/wwf5phjtj9bjvmpqqsbfxc/sophos-2024-threat-report.pdf (accessed on 15 September 2024).
SOPHOS. Sophos 2022 Threat Report: Gravitational Force of Ransomware Black Hole Pulls in Other Cyberthreats to Create One Massive, Interconnected Ransomware Delivery System. Available online: https://www.sophos.com/en-us/press/press-releases/2021/11/sophos-2022-threat-report#:~:text=Sophos%2C%20a%20global%20leader%20in,significant%20implications%20for%20IT%20security (accessed on 25 May 2022).
Brezinski, K.; Ferens, K. Metamorphic Malware and Obfuscation: A Survey of Techniques, Variants, and Generation Kits. Secur. Commun. Netw. 2023, 2023, 8227751. [Google Scholar] [CrossRef]
Zuo, Z.H.; Zhu, Q.X.; Zhou, M.T. On the time complexity of computer viruses. IEEE Trans. Inf. Theory 2005, 51, 2962–2966. [Google Scholar] [CrossRef]
F-Secure. 2014 Mobile Threat Report. Available online: https://www.infopoint-security.de/medien/f_secure_mobile_threat_report_q1_2014_print_version.pdf (accessed on 19 July 2019).
Maiorca, D.; Ariu, D.; Corona, I.; Aresu, M.; Giacinto, G. Stealth attacks: An extended insight into the obfuscation effects on Android malware. Comput. Secur. 2015, 51, 16–31. [Google Scholar] [CrossRef]
Hasan, R.; Biswas, B.; Samiun, M.; Saleh, M.A.; Prabha, M.; Akter, J.; Joya, F.H.; Abdullah, M. Enhancing malware detection with feature selection and scaling techniques using machine learning models. Sci. Rep. 2025, 15, 9122. [Google Scholar] [CrossRef]
Hawana, A.; Hassan, E.S.; El-Shafai, W.; El-Dolil, S.A. Enhancing malware detection with deep learning convolutional neural networks: Investigating the impact of image size variations. Secur. Priv. 2025, 8, e70000. [Google Scholar] [CrossRef]
Roy, S.; Bhanja, S.; Das, A. AndyWar: An intelligent android malware detection using machine learning. Innov. Syst. Softw. Eng. 2025, 21, 303–311. [Google Scholar] [CrossRef]
Alomar, A.; AlJarullah, A.; Abu-Ghazalah, S. Permissions-based Android malware detection using machine learning. Neural Comput. Appl. 2025, 37, 5255–5270. [Google Scholar] [CrossRef]
Wasif, M.S.; Miah, M.P.; Hossain, M.S.; Alenazi, M.J.; Atiquzzaman, M. CNN-ViT synergy: An efficient Android malware detection approach through deep learning. Comput. Electr. Eng. 2025, 123, 110039. [Google Scholar] [CrossRef]
Lunghi, D.; Simitsis, A.; Caelen, O.; Bontempi, G. Adversarial Learning in Real-World Fraud Detection: Challenges and Perspectives. In DEC ’23: Proceedings of the Second ACM Data Economy Workshop; Association for Computing Machinery: New York, NY, USA, 2023; pp. 27–33. [Google Scholar] [CrossRef]
Eiben, A.E.; Smith, J.E. What is an Evolutionary Algorithm? In Introduction to Evolutionary Computing; Springer: Berlin/Heidelberg, Germany, 2003; pp. 15–35. [Google Scholar] [CrossRef]
Babaagba, K.O.; Tan, Z.; Hart, E. Nowhere Metamorphic Malware Can Hide—A Biological Evolution Inspired Detection Scheme. In Proceedings of the Dependability in Sensor, Cloud, and Big Data Systems and Applications; Wang, G., Bhuiyan, M.Z.A., De Capitani di Vimercati, S., Ren, Y., Eds.; Springer: Singapore, 2019; pp. 369–382. [Google Scholar] [CrossRef]
Babaagba, K.O.; Tan, Z.; Hart, E. Improving Classification of Metamorphic Malware by Augmenting Training Data with a Diverse Set of Evolved Mutant Samples. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC); IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar] [CrossRef]
Babaagba, K.O. Application of Evolutionary Machine Learning in Metamorphic Malware Analysis and Detection. Ph.D. Thesis, Edinburgh Napier University, Edinburgh, UK, 2022. [Google Scholar]
Habib, F.; Shirazi, S.H.; Aurangzeb, K.; Khan, A.; Bhushan, B.; Alhussein, M. Deep Neural Networks for Enhanced Security: Detecting Metamorphic Malware in IoT Devices. IEEE Access 2024, 12, 48570–48582. [Google Scholar] [CrossRef]
Babaagba, K.O.; Tan, Z.; Hart, E. Automatic Generation of Adversarial Metamorphic Malware Using MAP-Elites. In Proceedings of the Applications of Evolutionary Computation: 23rd European Conference, EvoApplications 2020, Held as Part of EvoStar 2020, Seville, Spain, 15–17 April 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 117–132. [Google Scholar] [CrossRef]
Zhan, D.; Liu, X.; Bai, W.; Li, W.; Guo, S.; Pan, Z. GAME-RL: Generating Adversarial Malware Examples Against API Call Based Detection via Reinforcement Learning. IEEE Trans. Dependable Secur. Comput. 2025, 22, 5431–5447. [Google Scholar] [CrossRef]
Manju; Rana, C. Application of Deep Reinforcement Learning in Adversarial Malware Detection. In Deep Reinforcement Learning and Its Industrial Use Cases; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2024; Chapter 5; pp. 91–113. [Google Scholar] [CrossRef]
Lee, J.; Austin, T.H.; Stamp, M. Compression-based analysis of metamorphic malware. Int. J. Secur. Netw. 2015, 10, 124–136. [Google Scholar] [CrossRef]
Charoenthanakitkul, A.; Viboonsang, P.; Kosolsombat, S. Optimizing Malware Detection with Random Forest, XGBoost, LightGBM, and LLM-Reporting. In Proceedings of the 2025 IEEE International Conference on Cybernetics and Innovations (ICCI); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
Choudhary, S.P.; Vidyarthi, M.D. A Simple Method for Detection of Metamorphic Malware using Dynamic Analysis and Text Mining. Procedia Comput. Sci. 2015, 54, 265–270. [Google Scholar] [CrossRef]
Baysa, D.; Low, R.M.; Stamp, M. Structural entropy and metamorphic malware. J. Comput. Virol. 2013, 9, 179–192. [Google Scholar] [CrossRef]
Armoun, S.E.; Hashemi, S. A General Paradigm for Normalizing Metamorphic Malwares. In Proceedings of the 2012 10th International Conference on Frontiers of Information Technology; IEEE: Piscataway, NJ, USA, 2012; pp. 348–353. [Google Scholar] [CrossRef]
Zheng, M.; Lee, P.P.C.; Lui, J.C.S. ADAM: An Automatic and Extensible Platform to Stress Test Android Anti-virus Systems. In Proceedings of the Detection of Intrusions and Malware, and Vulnerability Assessment; Flegel, U., Markatos, E., Robertson, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 82–101. [Google Scholar] [CrossRef]
Rastogi, V.; Chen, Y.; Jiang, X. DroidChameleon: Evaluating Android Anti-malware Against Transformation Attacks. In ASIA CCS ’13: Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2013; pp. 329–334. [Google Scholar] [CrossRef]
Nawaz, M.S.; Fournier-Viger, P.; Nawaz, M.Z.; Chen, G.; Wu, Y. Metamorphic Malware Behavior Analysis Using Sequential Pattern Mining. In Proceedings of the Machine Learning and Principles and Practice of Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2021; pp. 90–103. [Google Scholar] [CrossRef]
Jha, A.K.; Vaish, A.; Patil, S. A Novel Framework for Metamorphic Malware Detection. SN Comput. Sci. 2022, 4, 10. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Li, Z.; Wang, P.; Wang, Z. FlowGANAnomaly: Flow-Based Anomaly Network Intrusion Detection with Adversarial Learning. Chin. J. Electron. 2024, 33, 58–71. [Google Scholar] [CrossRef]
Aydogan, E.; Sen, S. Automatic Generation of Mobile Malwares Using Genetic Programming. In Applications of Evolutionary Computation; Springer: Cham, Switzerland, 2015; pp. 745–756. [Google Scholar] [CrossRef]
Xu, W.; Qi, Y.; Evans, D. Automatically Evading Classifiers—A Case Study on PDF Malware Classifier. NDSS 2016, 2016, 1–15. [Google Scholar] [CrossRef]
Javaheri, D.; Lalbakhsh, P.; Hosseinzadeh, M. A Novel Method for Detecting Future Generations of Targeted and Metamorphic Malware Based on Genetic Algorithm. IEEE Access 2021, 9, 69951–69970. [Google Scholar] [CrossRef]
Bala, Z.; Zambuk, F.U.; Ya’u Imam, B.; Ya’u Gital, A.; Shittu, F.; Aliyu, M.; Abdulrahman, M.L. Transfer Learning Approach for Malware Images Classification on Android Devices Using Deep Convolutional Neural Network. Procedia Comput. Sci. 2022, 212, 429–440. [Google Scholar] [CrossRef]
Raza, A.; Qaisar, Z.H.; Aslam, N.; Faheem, M.; Ashraf, M.W.; Chaudhry, M.N. TL-GNN: Android Malware Detection Using Transfer Learning. Appl. AI Lett. 2024, 5, e94. [Google Scholar] [CrossRef]
APKTOOL. APKTOOL. Available online: http://ibotpeaches.github.io/Apktool (accessed on 26 February 2019).
NTCore. CFF Explorer. Available online: https://ntcore.com/?page_id=388 (accessed on 10 June 2021).
NTCore. Structure of a Smali. Available online: https://pysmali.readthedocs.io/en/latest/api/smali/language.html (accessed on 3 June 2020).
The Honeynet Project. Droidbox. Available online: https://github.com/pjlantz/droidbox (accessed on 19 February 2019).
Bakurov, I.; Murphy, A.; Ofria, C.; Banzhaf, W. A comparison of tournament and lexicase selection paradigms in regression problems: Error-based fitness versus correlation fitness. In GECCO ’25: Proceedings of the Proceedings of the Genetic and Evolutionary Computation Conference; Association for Computing Machinery: New York, NY, USA, 2025; pp. 970–979. [Google Scholar] [CrossRef]
VTDOC. VirusTotal. Available online: https://developers.virustotal.com/reference#getting-started (accessed on 10 October 2023).
Heres, D. Source Code Plagiarism Detection using Machine Learning. Ph.D. Thesis, Utrecht University, Utrecht, The Netherlands, 2017. [Google Scholar]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
García-Teodoro, P.; Gómez-Hernández, J.; Abellán-Galera, A. Multi-labeling of complex, multi-behavioral malware samples. Comput. Secur. 2022, 121, 102845. [Google Scholar] [CrossRef]
Zhou, Y.; Jiang, X. Android Malware Genome Project. Available online: http://www.malgenomeproject.org/ (accessed on 19 July 2019).
Zhou, Y.; Jiang, X. Dissecting Android Malware: Characterization and Evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy; IEEE: Piscataway, NJ, USA, 2012; pp. 95–109. [Google Scholar] [CrossRef]
F-Secure. Trojan: Android/DroidKungFu.C. Available online: https://www.f-secure.com/v-descs/trojan-android-droidkungfu-c.shtml (accessed on 19 July 2019).
F-Secure. Trojan: Android/GGTracker.A. Available online: https://www.f-secure.com/v-descs/trojan_android_ggtracker.shtml (accessed on 19 July 2019).
TRENDMICRO. ANDROIDOS_DOUGALEK.A. Available online: https://www.trendmicro.com/vinfo/us/threat-encyclopedia/malware/androidos_dougalek.a (accessed on 19 July 2019).
Babaagba, K.O.; Tan, Z.; Hart, E. Improving Classification of Metamorphic Malware. Available online: https://github.com/KehindeOloye/Improving-Classification-of-Metamorphic-Malware.git (accessed on 3 February 2020).
Linux.die.net. Strace(1)-Linux Man Page. Available online: https://linux.die.net/man/1/strace (accessed on 12 April 2023).
Developers. UI/Application Exerciser Monkey. Available online: https://developer.android.com/studio/test/monkey (accessed on 12 April 2023).
Arockiya Jerson, J.; Preethi, N. An Analysis of Levenshtein Distance Using Dynamic Programming Method. In Proceedings of the 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications; Gunjan, V.K., Zurada, J.M., Eds.; Springer: Singapore, 2023; pp. 525–532. [Google Scholar] [CrossRef]
Dhakal, A.; Poudel, A.; Pandey, S.; Gaire, S.; Baral, H.P. Exploring Deep Learning in Semantic Question Matching. In Proceedings of the 2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS); IEEE: Piscataway, NJ, USA, 2018; pp. 86–91. [Google Scholar] [CrossRef]
Ragkhitwetsagul, C.; Krinke, J.; Clark, D. A comparison of code similarity analysers. Empir. Softw. Eng. 2018, 23, 2464–2519. [Google Scholar] [CrossRef]
Gove, R.; Cadalzo, L.; Leiby, N.; Singer, J.M.; Zaitzeff, A. New guidance for using t-SNE: Alternative defaults, hyperparameter selection automation, and comparative evaluation. Vis. Inform. 2022, 6, 87–97. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–13. [Google Scholar]
Keras_Team. Keras. Available online: https://github.com/keras-team/keras (accessed on 21 August 2022).
Arun S. Maiya. Ktrain. Available online: https://github.com/amaiya/ktrain (accessed on 19 June 2024).
Jacob Devlin. Google’s Multi-lingual Bert Model. Available online: https://github.com/google-research/bert/blob/master/multilingual.md (accessed on 11 March 2020).
Firat, E.E.; Swallow, B.; Laramee, R.S. PCP-Ed: Parallel coordinate plots for ensemble data. Vis. Inform. 2023, 7, 56–65. [Google Scholar] [CrossRef]

Figure 1. Adversarial learning framework.

Figure 2. The Mutation Engine component driven by an Evolutionary Algorithm (EA).

Figure 3. The percentage of detectors that failed to identify the evolved variants across the x malicious runs for each fitness function in the Dougalek family, with the red line indicating the detection-failure rate of the original malware sample.

Figure 4. The percentage of detectors that failed to identify the evolved variants across the x malicious runs for each fitness function in the Droidkungfu family, with the red line indicating the detection-failure rate of the original malware sample.

Figure 5. The percentage of detectors that failed to identify the evolved variants across the x malicious runs for each fitness function in the GGtracker family, with the red line indicating the detection-failure rate of the original malware sample.

Figure 6. Frequency f with which each detector d fails to detect the malware (

0 < f < 10

) under fitness functions DR(x), BS(x) and SS(x) for the Dougalek family.

Figure 6. Frequency f with which each detector d fails to detect the malware (

0 < f < 10

) under fitness functions DR(x), BS(x) and SS(x) for the Dougalek family.

Figure 7. Frequency f with which each detector d fails to detect the malware (

0 < f < 10

) under fitness functions DR(x), BS(x) and SS(x) for the Droidkungfu family.

Figure 7. Frequency f with which each detector d fails to detect the malware (

0 < f < 10

) under fitness functions DR(x), BS(x) and SS(x) for the Droidkungfu family.

Figure 8. Frequency f with which each detector d fails to detect the malware (

0 < f < 10

) under fitness functions DR(x), BS(x) and SS(x) for the GGtracker family.

Figure 8. Frequency f with which each detector d fails to detect the malware (

0 < f < 10

) under fitness functions DR(x), BS(x) and SS(x) for the GGtracker family.

Figure 9. t-SNE visualisation of evolved mutants, coloured according to the fitness function used to generate each variant.

Figure 10. Structural-diversity analysis for the Dougalek family across DR(x), BS(x), and SS(x). (a) Structural diversity of Dougalek variants under DR(x); (b) Structural diversity of Dougalek variants under BS(x); (c) Structural diversity of Dougalek variants under SS(x).

Figure 11. Structural-diversity analysis for the Droidkungfu family across DR(x), BS(x), and SS(x). (a) Structural diversity of Droidkungfu variants under DR(x); (b) Structural diversity of Droidkungfu variants under BS(x); (c) Structural diversity of Droidkungfu variants under SS(x).

Figure 12. Structural-diversity analysis for the GGtracker family across DR(x), BS(x), and SS(x). (a) Structural diversity of GGtracker variants under DR(x); (b) Structural diversity of GGtracker variants under BS(x); (c) Structural diversity of GGtracker variants under SS(x).

Figure 13. Confusion matrices for the 6020combo dataset using Naïve Bayes and LSTM models under both binary and multi-class settings. (a) Confusion matrix for the 6020combo using Naïve Bayes under binary setting; (b) Confusion matrix for the 6020combo using Naïve Bayes under multi-class setting; (c) Confusion matrix for the 6020combo using LSTM under binary setting; (d) Confusion matrix for the 6020combo using LSTM under multi-class setting.

Figure 14. Confusion matrices for the 6050combo dataset using Naïve Bayes and LSTM models under both binary and multi-class configurations. (a) Confusion matrix for the 6050combo using Naïve Bayes under binary setting; (b) Confusion matrix for the 6050combo using Naïve Bayes under multi-class setting; (c) Confusion matrix for the 6050combo using LSTM under binary setting; (d) Confusion matrix for the 6050combo using LSTM under multi-class setting.

Figure 15. Parallel-coordinates plots for the 6020combo and 6050combo datasets in the binary setting. (a) Parallel-coordinates visualisation for the 6020combo dataset under binary classification. (b) Parallel-coordinates visualisation for the 6050combo dataset under binary classification.

Table 1. Parameter settings for the Evolutionary Algorithm.

Parameters	Values
Selection	Tournament Selection, k = 5
Population Size	20
Iterations	100

Table 2. Count of malicious variants obtained across 10 EA runs for each of the three fitness functions.

Fitness Function	Dougalek	Droidkungfu	GGtracker
DR(x)	7	10	9
BS(x)	7	9	8
SS(x)	10	9	7

Table 3. Data summary and 95% Confidence Intervals for the evasiveness data for the three fitness metrics (DR(x), BS(x) and SS(x)) across all malware families—Dougalek, Droidkungfu and GGtracker.

	Metric	Mean	Std Dev	95% CI Lower	95% CI Upper
Dougalek	DR(x)	0.697	0.012	0.688	0.705
	BS(x)	0.624	0.043	0.589	0.659
	SS(x)	0.663	0.011	0.655	0.670
	RS(x)	0.415	0.018	0.412	0.418
Droidkungfu	DR(x)	0.880	0.024	0.862	0.897
	BS(x)	0.729	0.065	0.680	0.778
	SS(x)	0.810	0.009	0.803	0.817
	RS(x)	0.366	0.017	0.363	0.369
GGtracker	DR(x)	0.681	0.028	0.661	0.701
	BS(x)	0.610	0.032	0.585	0.635
	SS(x)	0.616	0.034	0.593	0.639
	RS(x)	0.384	0.036	0.378	0.391

Table 4. Percentage of evolved malicious variants with a unique detection signature, grouped by fitness function and malware family.

	Dougalek	Droidkungfu	GGtracker
Detection	43	90	78
Behavioural Similarity	71	89	50
Structural Similarity	50	33	29

Table 5. Percentage of evolved malicious variants with a unique behavioural signature, grouped by fitness function and malware family.

	Dougalek	Droidkungfu	GGtracker
Detection	100	70	89
Behavioural Similarity	100	78	78
Structural Similarity	80	75	100

Table 6. Evolution Benefits Matrix by family and objective. Evasiveness reported as mean

(1 - DR)

from Table 3; uniqueness from Table 4 and Table 5; structural diversity summarised qualitatively from Figure 10, Figure 11 and Figure 12.

Table 6. Evolution Benefits Matrix by family and objective. Evasiveness reported as mean

(1 - DR)

from Table 3; uniqueness from Table 4 and Table 5; structural diversity summarised qualitatively from Figure 10, Figure 11 and Figure 12.

Family–Objective	Evasiveness (Mean)	% Unique Detection	% Unique Behaviour	Structural Diversity
Dougalek–DR	0.303	43	100	Low–Moderate
Dougalek–BS	0.376	71	100	Moderate
Dougalek–SS	0.337	50	80	High
DroidKungFu–DR	0.120	90	70	Moderate
DroidKungFu–BS	0.271	89	78	High
DroidKungFu–SS	0.190	33	75	Moderate–High
GGTracker–DR	0.319	78	89	Moderate
GGTracker–BS	0.390	50	78	Moderate
GGTracker–SS	0.384	29	100	High

Table 7. Hyper-parameter settings for the LSTM and BERT models.

Hyper-Parameter	Value
LSTM Model
Optimiser	Adam [59]
LSTM Layers	2
Neurons per LSTM Layer	128
Batch Size	50
Epochs	3
Binary Classification
Loss Function	Binary Cross Entropy
Output Layer Activation	Sigmoid
Multi-class Classification
Loss Function	Sparse Categorical Cross Entropy
Output Layer Activation	Softmax
BERT Model (via ktrain)
Framework	ktrain (interface to Keras)
Pre-trained Model	Multilingual BERT, Cased
Batch Size	50
Epochs	3
Training Method	fit_onecycle
Data Loading	texts_from_folder
Preprocessing	BERT Tokenizer
Classifier Wrapper	text_classifier

Table 8. Theoretical Runtime Complexity and Practical Implementation Summary of Models.

Model	Primary Use Case	Key Implementation Parameters	Theoretical Training Complexity	Practical Runtime Factor
Naive Bayes	Non-sequential classification	Features: 251 (system calls). Classes: 2 or 4.	$O (d \cdot c)$	Feature Count ( $d = 251$ ). Extremely fast.
LSTM	Sequential classification	Layers: 2. Neurons/Layer: 128. Batch: 50. Seq. Length: L. Epochs: 3.	$O (L)$ per sequence	Sequence Length (L) & Architecture. Moderate.
BERT (Multilingual)	Deep contextual classification	Base Model: ∼110 M params. Batch: 50. Seq. Length: ≤512. Epochs: 3.	$O (L^{2})$ per layer	Sequence Length (L) Squared. Very high (GPU).
Evolutionary Algorithm (EA)	Hyperparameter/Feature Optimisation	Population: 20. Generations: 100. Runs: 10. Eval Time: 4 min.	∼55.5 h (sequential)	Fitness Eval. Time & Population. Massive, requires parallelisation.

Table 9. Comparison of classification accuracy for the 6020combo and 6050combo datasets using Naïve Bayes and LSTM models under both binary and multi-class configurations.

Models	6020		6050
Models	Binary	Multiclass	Binary	Multiclass
NB	0.91	0.81	0.92	0.80
LSTM	0.63	0.77	0.54	0.76

Table 10. Comparison of test-set classification accuracy for the 6020combo and 6050combo datasets using Naïve Bayes, LSTM, and BERT models under both binary and multi-class settings.

Models	6020		6050
Models	Binary	Multiclass	Binary	Multiclass
NB	0.91	0.81	0.92	0.8
LSTM	0.63	0.77	0.54	0.76
BERT	0.93	0.77	0.9	0.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Babaagba, K.O.; Tan, Z. Mitigating Metamorphic Malware Through Adversarial Learning Techniques. Network 2026, 6, 22. https://doi.org/10.3390/network6020022

AMA Style

Babaagba KO, Tan Z. Mitigating Metamorphic Malware Through Adversarial Learning Techniques. Network. 2026; 6(2):22. https://doi.org/10.3390/network6020022

Chicago/Turabian Style

Babaagba, Kehinde O., and Zhiyuan Tan. 2026. "Mitigating Metamorphic Malware Through Adversarial Learning Techniques" Network 6, no. 2: 22. https://doi.org/10.3390/network6020022

APA Style

Babaagba, K. O., & Tan, Z. (2026). Mitigating Metamorphic Malware Through Adversarial Learning Techniques. Network, 6(2), 22. https://doi.org/10.3390/network6020022

Article Menu

Mitigating Metamorphic Malware Through Adversarial Learning Techniques

Abstract

1. Introduction

2. Related Work

3. Methodology (Learning to Defend Against Metamorphic Malware Using Adversarial Samples)

3.1. Evolutionary Algorithm

3.2. Initialisation

3.3. Selection of Parent Malware

3.4. Smali-Level Mutation Operators and Execution Constraints

3.5. Fitness Functions (Quantitative Definitions)

On Single-Objective vs. Multi-Objective Optimisation

4. Experiments and Discussion

4.1. Malware Samples for Evaluation

4.1.1. Ethical and Security Safeguards

4.1.2. Dataset Composition and Validation

4.2. Evolutionary Algorithm-Method

4.2.1. Evolutionary Parameters

Random Search Baseline

4.2.2. Collection of Relevant Metrics

Structural Similarity Aggregation and Sensitivity

VirusTotal Temporal Stability and Reproducibility

Practical Runtime and Scalability Considerations

4.3. Evolutionary Algorithm—Results and Analysis

4.3.1. Influence of Fitness Function on Evasiveness of Evolved Mutants

4.3.2. Analysis of the Evasion Characteristics of the New Mutants

4.3.3. Explicit Diversity Metrics and Evolution Benefits Matrix

4.3.4. Contextualizing Our EA Method Within Existing Literature

4.4. Machine Learning—Method

4.4.1. Feature Extraction and Preprocessing Pipeline

4.4.2. Model Architectures and Training

Explanation of Computational Characteristics

4.5. Machine Learning—Results and Analysis

4.5.1. Enhancing Metamorphic Malware Classification with Evolved Mutants

4.5.2. Enhancing Malware Classification via Transfer Learning with BERT and the Evolved Mutants

4.5.3. Superior Cost-Efficiency Through Transfer Learning

4.5.4. Statistical Reporting for Diversity and Classification

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI