LLM-Generated Samples for Android Malware Detection

Rollinson, Nik; Polatidis, Nikolaos

doi:10.3390/digital6010005

Open AccessArticle

LLM-Generated Samples for Android Malware Detection

by

Nik Rollinson

and

Nikolaos Polatidis

^*

School of Architecture, Technology and Engineering, University of Brighton, Brighton BN2 4GJ, UK

^*

Author to whom correspondence should be addressed.

Digital 2026, 6(1), 5; https://doi.org/10.3390/digital6010005

Submission received: 22 November 2025 / Revised: 12 January 2026 / Accepted: 16 January 2026 / Published: 18 January 2026

Download

Browse Figures

Versions Notes

Abstract

Android malware continues to evolve through obfuscation and polymorphism, posing challenges for both signature-based defenses and machine learning models trained on limited and imbalanced datasets. Synthetic data has been proposed as a remedy for scarcity, yet the role of Large Language Models (LLMs) in generating effective malware data for detection tasks remains underexplored. In this study, we fine-tune GPT-4.1-mini to produce structured records for three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS, using the KronoDroid dataset. After addressing generation inconsistencies with prompt engineering and post-processing, we evaluate multiple classifiers under three settings: training with real data only, real-plus-synthetic data, and synthetic data alone. Results show that real-only training achieves near-perfect detection, while augmentation with synthetic data preserves high performance with only minor degradations. In contrast, synthetic-only training produces mixed outcomes, with effectiveness varying across malware families and fine-tuning strategies. These findings suggest that LLM-generated tabular malware feature records can enhance scarce datasets without compromising detection accuracy, but remain insufficient as a standalone training source.

Keywords:

Android; malware detection; large language models; LLMs; synthetic data

1. Introduction

Android malware continues to outpace signature-based defenses, with families evolving through obfuscation, polymorphism, and rapid variant creation. Although machine learning (ML) and deep learning (DL) pipelines have advanced detection accuracy on benchmark datasets, their effectiveness depends heavily on timely, high-quality labels and balanced class distributions, conditions rarely achieved in operational settings. Data scarcity is particularly acute for specific malware families and behaviors, limiting classifier robustness and hindering reproducibility across studies [1,2].

Recent work has explored different ways to mitigate these issues, yet there remains no empirical evaluation of how synthetically generated Android malware, produced specifically by LLMs, influences detection performance [3]. This research addresses that gap by generating structured malware records with a fine-tuned LLM and systematically assessing their utility in training and augmenting ML classifiers. Using the KronoDroid dataset for its hybrid static–dynamic feature space and balanced malicious/benign splits, we fine-tune GPT-4.1-mini to emulate the characteristics of three families, BankBot, Locker/SLocker, and Airpush/StopSMS; resolve generation pathologies through prompt engineering and post hoc filtering; and evaluate the downstream impact across five classifiers under three scenarios: real-only training, real-plus-synthetic augmentation, and synthetic-only generalization.

This paper provides the following contributions:

Fine-tuning pipeline for structured malware feature vectors: A reproducible workflow to fine-tune an LLM to generate fixed schema Android malware records, including schema sanitization, prompt constraints, and post-processing to remove malformed outputs.
Evaluation protocol for synthetic utility in malware detection: A three-scenario benchmarking design (real-only, real and synthetic, synthetic-to-real generalization) with leakage checks, consistent pre-processing, and bootstrap confidence intervals.
Family-level evidence across three Android malware families and five classifiers: Quantitative results showing (i) augmentation preserves near baseline detection, and (ii) synthetic-only training is unreliable and family dependent.
Sensitivity evidence on fine-tuning depth and richer data: The Airpush/StopSMS configuration (where we used more fine-tuning samples and more epochs) narrows the synthetic-to-real gap compared to smaller fine-tuning regimes.

The rest of the paper is structured as follows: Section 2 contains the related works, Section 3 delivers the methodology, Section 4 presents the experimental results, and Section 5 contains the conclusions and future work.

2. Related Work

Artificial Intelligence (AI) has emerged as a transformative force in the field of cybersecurity, enabling new methods for detecting, preventing, and even simulating cyber threats [4,5]. This review explores the intersection of AI and cybersecurity across several key areas. Section 2.1 discusses foundational approaches using ML, DL, and LLMs for threat detection. Section 2.2 delves into the potential misuse of LLMs in malware generation and automation. Section 2.3 examines AI-based techniques for malware detection and prevention. Section 2.4 focuses on the use of synthetic data in cybersecurity. Finally, Section 2.5 identifies a key research gap related to the use of LLM-generated synthetic malware for training detection models.

2.1. AI and Cybersecurity Foundations

With the increasing popularity of Large Language Models, the past few years have seen a growing number of applications for them that can be employed in both defensive and offensive strategies [6]. With the escalating sophistication and frequency of cybersecurity threats, conventional defense mechanisms that depend predominantly on manual analysis are proving progressively insufficient. Artificial intelligence technologies present promising advancements by enabling more resilient threat detection through intelligent data analysis, facilitating pattern recognition and the anticipation of future attacks, thereby mitigating the inherent constraints of traditional approaches [7]. As some attacks are highly complex, their manual detection requires monitoring many security alerts, and their analysis may be slow [8]. Conventional approaches frequently overlook attacks that deviate from established signatures, whereas AI techniques possess the capacity to generalize from prior examples [9]. AI is also able to detect subtle anomalies that might be challenging to notice manually. Furthermore, among other useful functionalities, the authors specifically highlight the capability of LLMs in generating synthetic data [10].

2.2. LLMs for Malware Generation and Automation

Botacin evaluated OpenAI’s GPT-3 model for malware generation, finding that it struggles to produce fully functional malware from broad descriptions but succeeds when prompts are decomposed into specific tasks. He asserts that existing models are incapable of generating malware from generic prompts but can alter existing malware to create novel variants. This raises concerns about the increasing accessibility of AI tools for malware development [11]. Ubavić et al. provide an example where a hacking community user created functional data-stealing malware using GPT-3 model in ChatGPT, automating a task that would otherwise require programming skills. However, the process still required iterative prompting [12]. On the other hand, Pa et al. evaluated several models and tools, including Auto GPT, an autonomous agent built upon OpenAI’s GPT-3.5 and GPT-4, which can generate prompts and decompose broad tasks into manageable subtasks. The authors demonstrated that Auto-GPT can create malware from a generic prompt [13]. Another capability of LLMs is to alter existing malware to generate novel variants. A practical implementation was demonstrated by CyberArk, where the researchers successfully produced polymorphic malware by repeatedly instructing ChatGPT to mutate code, thereby generating new variants with each iteration and incorporating additional constraints to circumvent signature-based detection [14].

2.3. AI-Based Malware Detection and Prevention

Berrios et al. conducted a thorough literature review on AI-based malware detection approaches, covering deep learning models, generative adversarial networks, behavioral and anomaly-based detection, large language models, and their combinations. They note that the field is advancing with improving detection rates, and describe two principal approaches: static analysis, which involves the examination of a file’s features, and dynamic analysis, which monitors events and behaviors within a controlled environment [15]. In another work it is reported that the integration of static and dynamic malware analysis methods by employing convolutional neural networks for feature extraction followed by support vector machines for classification, achieving a 16.5% improvement in accuracy on the Malimg dataset compared to conventional techniques [16]. Gyamfi et al. state that machine learning-based detection systems can analyze vast data volumes to identify malicious patterns, including zero-day vulnerabilities and polymorphic malware, offering rapid and adaptive detection that addresses limitations of traditional signature-based methods. However, they highlight challenges such as limited or imbalanced datasets and poor generalization to novel or manipulated malware, leading to models performing well on benchmarks but struggling in real-world conditions, underscoring the need for greater robustness [17]. The literature in this field agrees that AI methods, from machine learning to deep neural networks, effectively detect and prevent malware but must be regularly updated with quality data to counter evolving attacks [18].

2.4. Synthetic Data in Cybersecurity

Research conducted by Chalé and Bastian generated synthetic network flow data using CTGAN and TVAE for intrusion. The researchers found that ML classifiers that were trained both on real and synthetic data had the same performance as classifiers trained only on the real data. However, they also identified that purely synthetic training underperforms unless at least 15% of real data is retained. They also state that synthetic data does improve statistic learning, but it does not introduce new predictive information [19]. Ammara, Ding, and Tutschku reinforce these points in their comparative analysis of synthetic data generation in the cybersecurity field. They caution that similarity scores or synthetic-only results cannot substitute for real-world validation and agree that synthetic data is useful for augmenting but not replacing real data [20].

In contrast, Rahman et al. conducted research on using only synthetic data for Network Intrusion Detection Systems. GAN-based synthetic data was then used to train classifiers, and their performance was assessed against datasets with real data. They were able to achieve high accuracy rates, proving the possibility of detecting intrusion with classifiers trained only on synthetic data. This directly contradicts findings that concluded that purely synthetic training was not viable [21]. Almorjan, Basheri, and Almasre conducted research on using LLMs to create synthetic data of social media interactions that could help in identifying Indicators of Compromise (IoC), where they were looking for messages in social media chats that contain IP addresses or other information that could indicate a threat. The research used OpenAI’s GPT 3.5 model with fine-tuning. The study resulted in two synthetic datasets, and the best accuracy of identifying the correct classes of IoC was 77% for the first dataset and 82% for the second. However, unlike studies evaluating against real data, they did not assess generalization to real-world scenarios. The study suggests that purely synthetic data may be sufficient; however, the authors do not provide an external validation for that [22].

2.5. Research Gap

The literature review indicates that there is a gap in available research. While prior work has explored LLMs for malware-related generation and for synthesizing cybersecurity text, we found limited empirical evaluation of LLM-generated fixed schema, tabular Android malware feature records used to train and assess detection models under both augmentation and synthetic-only transfer settings.

3. Methodology

3.1. Dataset Selection

As we have discussed above, the quality of the original data is crucial. Therefore, it is essential to analyze and select the right dataset with Android malware samples to use for further synthetic data generation and accuracy evaluations.

Elnashar, White, and Schmidt examined prompt styles and output formats for GPT-4o with different file formats such as JSON, YAML, and Hybrid CSV/Prefix. While many use JSON and YAML because they are structured and readable, the research investigated Hybrid CSV/Prefix as a less resource-intensive alternative. The Hybrid CSV/Prefix format combines CSV and prefixed identifiers. They found that CSV/Prefix, the format with fixed schema and row per sample organization, delivers the best results for synthetic data generation tasks with large language models and is the most efficient in terms of token usage and processing time [23]. Based on this, we prioritize datasets that offer row per sample organization.

Because Shaukat, Luo, and Varadharajan showed improvement in the detection rate with a hybrid approach, we should consider prioritizing a dataset that has both static and dynamic features [16]. Furthermore, even though this research aims to be a proof of concept, we should have scalability in mind, and therefore, should aim at a well-balanced dataset so that the application discussed in this research could be applied to the whole dataset. And finally, dataset features should be human-interpretable to provide a better understanding of the underlying data.

The Drebin dataset is often considered the baseline for Android malware datasets. However, it is not well balanced, with only 5560 samples being malware from the total of 123,453 apps analyzed, and it only uses static features [24]. MalGenome contributed early behavioral insights but is malware-only, containing 1260 pieces of malware from 49 families [25]. AndroZoo provides the largest number of analyzed APKs, making it ideal as a source repository; however, it is not a feature table ready for use out of the box, and it lacks uniform labels and dynamic traces [26].

AndroDex explicitly targets obfuscation, releasing 24,746 samples by converting DEX bytecode into images. Although it is robust for some use cases, there is no tabular feature matrix [27].

Two newer datasets explicitly target reproducible, human-readable features. MaDroid offers 50,429 labeled system call sequences collected over 14 years across 10 Android marketplaces and is maliciousness-aware using VirusTotal rating as an attribute. It has a healthy class balance with 24,789 benign and 25,640 malicious samples. Features for each event span over several rows; they are human interpretable to analysts but are not a compact table [28]. Another example is KronoDroid. It combines per-sample timestamps with a fixed hybrid table of 489 columns combining static and aggregated dynamic features. Splits are well balanced: the emulator dataset contains 28,745 malware and 35,246 benign samples, while the real device dataset contains 41,382 malware and 36,755 benign samples. Many features are human-readable and systematically structured within CSV files [29]. Based on our requirements, KronoDroid is the most suitable dataset for our research, because unlike Drebin, which is primarily static and highly imbalanced, and repositories, such as AndroZoo/AndroDex that do not provide a standardized, ready-to-train tabular feature matrix out of the box, KronoDroid provides a consistent row per sample representation with a fixed schema. In addition, KronoDroid’s hybrid design (static and aggregated dynamic features) and well-balanced benign/malicious splits across emulator and real device subsets allow for the evaluation of the effect of synthetic augmentation without confounding shifts in class priors, while keeping features human-interpretable for analysis.

3.2. Data Preparation

The dataset encompasses static and dynamic analysis features extracted from malware samples executed on virtual and physical Android devices, providing a realistic representation of behavioral artifacts such as system calls, permissions, and network interactions. Specifically, we focused on the real device-malware subset, which includes detailed feature vectors from infected applications, and the benign samples subset, comprising legitimate Android applications. The benign subset totals 36,755 samples, from which random selections (without fixed random state, via Python 3.9 scripting) were drawn to maintain experimental balance.

To ensure behavioral diversity, we selected three distinct malware families from the KronoDroid dataset: BankBot (1297 samples, active primarily from 2013 to 2020), Locker/SLocker (1846 samples, spanning 2008 to 2020, with some anomalous timestamps potentially indicating extraction errors, e.g., pre 2008 dates), and Airpush/StopSMS (7775 samples, active primarily from 2008 until 2016, with a variety of anomalous timestamps present in the dataset). BankBot, an Android banking trojan, primarily employs overlay attacks to intercept user credentials on financial applications, often tries to appear as benign software with delayed activation mechanisms [30]. In contrast, Locker/SLocker represents an early form of Android ransomware, characterized by device UI locking via full-screen overlays, activity hijacking, key disabling, and exploitation of device administrator privileges, including file encryption of media and documents on external storage [31]. Airpush, on the other hand, is an advertising library that obfuscates its ad library packages to avoid identification by package names, while SMS trojans disguise themselves as legitimate apps and secretly send SMS messages without user consent, sometimes also reading and making phone calls [32]. Families were treated separately to assess detection performance across different malware types.

A 1:1 ratio between malware and benign samples was maintained across configurations to mitigate class imbalance. Benign samples were selected randomly from the entire subset, without temporal filtering, as date-related features were excluded from analysis.

3.3. Synthetic Data Generation Using Large Language Models

In this study, “LLM” refers specifically to a transformer-based GPT-family model used as a structured record generator; we do not analyze internal attention patterns or model interpretability mechanisms. Instead, we treat the LLM as a black-box generator and evaluate its utility via downstream detection performance under controlled fine-tuning regimes (50 samples/1 epoch for BankBot and Locker/SLocker; 150 samples/3 epochs for Airpush/StopSMS). Although we do not conduct mechanistic interpretability analyses (e.g., attention-weight inspection), we also do not treat the LLM as an unexamined component of the pipeline. Rather, we model it as a conditional generator and study it through controlled interventions, including variations in fine-tuning set size and epochs, explicit schema constraints in the system prompt, sampling temperature, and post-generation validation. We then quantify how these design choices affect both synthetic data fidelity and downstream detection outcomes; for example, the Airpush/StopSMS configuration deliberately increases fine-tuning depth (150 samples, 3 epochs) to assess the sensitivity of synthetic-to-real generalization to training regime and family characteristics.

Synthetic malware samples were generated using OpenAI’s GPT-4.1-mini model (base version dated 14 April 2025) to augment the merged real malware set, addressing data scarcity in specialized families while preserving privacy and ethical constraints on real sample distribution. We selected GPT-4.1-mini because it supports fine-tuning for domain adaptation and offers a cost–throughput profile that enables iterative experimentation and larger-scale synthetic sampling compared with larger proprietary models. Importantly, our objective is not APK or code generation, but the generation of fixed schema tabular malware feature records, for which instruction following and format consistency are the primary model requirements. Direct attempts to emulate tabular structures via CSV input and output yielded highly inconsistent results, such as erroneous column counts ranging from 179 to over 7000, necessitating a conversion to JSON format for improved structural fidelity. This aligns with prior research highlighting limitations of LLMs in handling nonsequential, tabular data formats due to their text-based training paradigms [23].

Fine-tuning on JSONL data failed OpenAI’s moderation due to policy violations inferred from terms like ‘malware’, which triggered safety filters. To circumvent this, feature names were sanitized with neutral equivalents: for BankBot, ‘malware’ to ‘app’, ‘Malware’ to ‘AppType’, ‘BankBot’ to ‘FinTech’, ‘MalFamily’ to ‘AppFamily’, ‘kill’ to ‘stop’, and ‘ptrace’ to ‘trace’; for Locker/SLocker, ‘Locker/SLocker Ransomware’ to ‘HiddenTech’ in place of ‘FinTech’. This enabled passage through moderation. Cost considerations estimated at USD 74.21 for the full 1297 BankBot samples led to subsampling 50 representative records per family (total 100) for fine-tuning and using one epoch. To test whether larger fine-tuning sets and deeper training improve synthetic fidelity, we increased the fine-tuning set to 150 samples and ran 3 epochs for the Airpush/StopSMS family, keeping other settings constant. This creates an intentional, controlled departure from the BankBot and Locker setup (50 samples, 1 epoch) to study method sensitivity (family size, sample count, epochs) on downstream generalization.

Despite successful training, the fine-tuned model exhibited instability, generating extraneous ‘SYS_n’ columns. These were mitigated by converting the data into the JSONL format. While that solved the original problem, the model returned repetitive outputs with identical data. This was then resolved by incorporating an exemplar record in the prompt and iterative refinement using OpenAI’s prompt optimization tool, which provided feedback on clarity and adherence without CLI invocation. Fine-tuning examples followed a structured message format as shown in Figure 1 below. Each training example uses a standard chat fine-tuning format: a system message constraining schema and valid ranges, a user message requesting a single record in JSON, and an assistant message containing the target record. While some related work emphasizes autonomous agentic pipelines, our approach is a scripted and reproducible workflow for structured record synthesis and evaluation rather than autonomous task execution. In our implementation, the individual stages (generation, schema validation, deduplication, preprocessing, and fidelity checks) are executed via separate scripts that are manually orchestrated, but each stage is deterministic given the same inputs and configuration. This design prioritizes transparency and repeatability, and can be integrated into a single automated pipeline in future work.

Generation prompts were configured with temperature = 0.7 for variety and max_tokens = 16,384. The system prompt enforced structure and realism as shown in Figure 2 and Figure 3.

This yielded 392 synthetic records for BankBot and 301 for Locker/SLocker after removal of 12 duplicates. Data quality issues included repeated sha256 hashes (non-unique, with unrealistic patterns like multiple zeros) and occasional omissions in the AppType label (expected as 1 for all synthetic samples). Duplicates were minimal, and as labels are excluded from training, these did not impact detection evaluations.

To address potential LLM hallucinations (e.g., unrealistic syscall patterns), we performed an automated fidelity evaluation between the real and synthetic datasets. Each JSON record was transformed into a fixed-length representation by flattening nested fields and summarizing variable-length lists using simple statistics, followed by consistent preprocessing across both datasets. We then quantified multivariate similarity using a K-Nearest-Neighbor (KNN) distance analysis, comparing the distribution of real-to-real KNN distances (baseline intra-real variability) with synthetic-to-real KNN distances (proximity of synthetic samples to the real data manifold). In addition, we computed a kernel-based two-sample distance (maximum mean discrepancy, MMD) and estimated its significance via a permutation test. Finally, we visualized the joint structure of the two datasets using a two-dimensional PCA projection as shown in Figure 4, Figure 5 and Figure 6 below, plotting real and synthetic samples together to provide an interpretable overview of their overlap in feature space. This pipeline produces the metrics and figures automatically, ensuring that the fidelity assessment is objective, repeatable, and suitable for deployment under real operating conditions. Prompts explicitly instructed uniqueness, labeling, and absence of nulls, but adherence varied, underscoring LLM limitations for tabular synthesis.

3.4. Experimental Setup and Evaluation

After the synthetic data generation and model training, the dataset underwent processing to ensure consistency and suitability for machine learning. From the original 484 features, we excluded identifiers and metadata that could induce overfitting or biases: Malware (binary label), Detection_Ratio (antivirus detection percentage), MalFamily (family identifier), Scanners (detecting antivirus list), TimesSubmitted (submission count), NrContactedIps (network contacts), Package (application package name), sha256 (hash identifier), EarliestModDate, and HighestModDate (modification timestamps). This reduced the feature set to 474 numeric columns.

Non-numeric artifacts were addressed: ‘None’ values in columns such as Activities, NrIntServices, NrIntServicesActions, NrIntActivities, NrIntActivitiesActions, NrIntReceivers, NrIntReceiversActions, TotalIntentFilters, and NrServices, intended as counts of Android component invocations, were imputed to 0. We then filtered out all columns in which more than 70% of the values were zeros. We selected 70% after inspecting the distribution of per-feature zero ratios and choosing a conservative cutoff that removes the sparsest tail while retaining sufficient coverage across syscall and permission groups. We tested performance with those columns intact and with them removed, confirming that there was no major difference in classifier performance. This resulted in dropping the additional 87 columns, reducing the feature set to 387 feature columns. The dataset details are shown in Table 1 below.

Outliers were retained without removal, as elevated counts in features, such as system calls, likely reflect legitimate behavioral variations in Android applications rather than errors. Feature scaling was applied using StandardScaler from scikit-learn to normalize numeric values, addressing the wide variance in scales. Then, we evaluated 3 scenarios per family (BankBot, Locker/SLocker, and Airpush/StopSMS) across 5 classifiers: 3 scenarios × 5 algorithms × 3 families = 45 total combinations, to assess synthetic data’s impact on detection accuracy.

Real malware vs. benign: Real malware samples (label 1) paired with an equal number of random benign samples (label 0) for a 1:1 balance; stratified 80/20 train/test split on the combined balanced set.
Real + synthetic malware vs. benign: Concatenation of real and synthetic malware (both label 1) balanced 1:1 against benign (label 0); stratified 80/20 train/test split on the combined balanced set.
Train on synthetic, validate/test on real (synthetic → real generalization): Training on synthetic malware; real malware split 50/50 into validation and test sets (disjoint). Benign split 40/30/30 into train/validation/test (disjoint). Each split undersampled benign to match malware count for a 1:1 balance.

Comparability notes: Because Airpush/StopSMS used 150 fine-tuning samples and 3 epochs, synthetic to real results for Airpush are not directly comparable to Bank-Bot/Locker synthetic to real under a 50-sample, 1-epoch regime; this contrast is by design to assess sensitivity to finetuning depth and data size.

Finally, five classifiers were tested: K-Nearest Neighbors (KNNs), Decision Trees (DTs), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Random Forest (RF), all from scikit-learn. Models were wrapped in a pipeline: StandardScaler → Classifier. We focus on a set of widely used baseline classifiers to isolate the effect of synthetic data augmentation on detection performance while keeping the learning pipeline simple and interpretable. Because these models already achieved strong performance on the real and augmented settings in our preliminary experiments, we did not incorporate more complex deep learning architectures in this study. Hyperparameter tuning used grid search with 5-fold StratifiedKFold cross-validation (shuffle enabled) on the training set only, selecting by accuracy. The best estimator was re-fit on the full training set. Data leakage prevention included hashing feature rows (pandas.util.hash_pandas_object) to verify no train/test intersection in 80/20 scenarios, and explicit disjoint subsets in synthetic to real. On the held-out test set, we reported accuracy, ROC AUC (from predict_proba), precision, recall, F1, and false-positive rate (from confusion matrix). Nonparametric 95% confidence intervals (CIs) for test accuracy were computed via bootstrap resampling (B = 1000). Confusion matrices were saved for qualitative analysis. For synthetic-to-real, metrics were reported for validation and test, with test as the final basis. Splits were stratified to preserve a 50/50 balance; randomization reflected script defaults with no random state used. Design choices such as balanced evaluation, uniform preprocessing, isolation, cross-validation, and CLs ensured robust assessment.

4. Experimental Evaluation

Experiments were run on an Apple M3 Pro processor with 18 GB RAM using macOS Sonoma 14.3. The Python programming language was used with pandas, numpy, sklearn, matplotlib, and seaborn libraries. The public version of the KronoDroid dataset was used for the experimentation.

The following sections demonstrate the detailed breakdown of experimental evaluation across five classifiers. Each of the individual classifier sections describes the settings used and outcomes of the evaluations. Later in the section, we discuss the results.

4.1. Evaluation Metrics

For this study, we used the Accuracy, ROC AUC, Precision, Recall, F1 Score, False Positive Rate, and Confidence Intervals metrics as shown by Equations (1)–(7) below. In the equations, TP stands for True Positives, TN for True Negatives, FP for False Positives, and FN for False Negatives.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

R O C A U C = \int_{0}^{1} T P R (F P R^{\{- 1\} (t)}) d t

(2)

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

R e c a l l = \frac{T P}{T P + F N}

(4)

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

F P R = \frac{F P}{F P + T N}

(6)

C I_{95 \ %} = [\ t h e t a_{\{2.5\}}^{\{*\}}, \ t h e t a_{\{97.5\}}^{\{*\}}]

(7)

4.2. K-Nearest Neighbors (KNNs)

The K-Nearest Neighbor (KNN) algorithm is a supervised learning approach that classifies a new instance by comparing it to already labeled examples and assigning the class most common among its k-closest neighbors. In practice, an application is converted into a feature vector, such as permissions, API calls, or intents, and its similarity to other applications is calculated, often with Euclidean distance. The decision is then based on the majority label of those nearest points [33]. Its strength lies in the fact that it does not rely on prior assumptions about data distribution, allowing it to capture subtle patterns that might emerge from evolving malware families [34].

Experimental results confirm the usefulness of KNN in malware detection, outperforming several other classifiers [33]. This combination of simplicity, interpretability, and accuracy highlights why KNN is a valuable tool in malware analysis pipelines. The detailed performance metrics for KNN are provided in Table 2, Table 3 and Table 4 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.

In the pure synthesis training scenario, KNN retains very high precision (0.9811), but recall drops sharply (0.4807), resulting in only 0.7357 accuracy, which suggests the synthetic BankBot records capture some strong signatures but miss substantial real family variability.

Under pure synthesis training, Locker/SLocker performance is close to random (0.5119 accuracy; 0.0542 recall), indicating a major synthetic-to-real mismatch where generated samples fail to represent the real family’s behavior distribution.

Airpush/StopSMS generalizes best in the pure synthesis setting, achieving 0.8925 accuracy and 0.7852 recall while maintaining near-perfect precision (0.9997), implying the synthetic samples capture more of the real family’s stable patterns than for the other families.

4.3. Decision Trees

Decision Trees (DTs) are a popular supervised learning approach in malware detection, known for their clear structure and interpretability. They build a hierarchical model by splitting data into branches based on selected features, leading to a final decision such as classifying a record as benign or malicious. This rule-based process is straightforward to follow, which makes DTs particularly useful in security contexts where analysts benefit from understanding why a classification was made [35]. DTs are applied across a variety of malware detection studies, using features that can be static, dynamic, or hybrid in nature [34].

Despite these advantages, DTs can be less robust against advanced malware techniques such as obfuscation or polymorphism. This makes them valuable when combined with other models, for example, in ensemble methods, to strengthen resilience against evasion strategies [35]. The detailed performance metrics for DTs are provided in Table 5, Table 6 and Table 7 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.

In the pure synthesis setting, DT achieves moderate accuracy (0.6834) with recall still low (0.3960) despite strong precision (0.9312), again suggesting the synthetic BankBot records are “conservative” and miss many real positives.

For Locker/SLocker, pure synthesis training remains weak (0.5672 accuracy; 0.1842 recall), reinforcing that synthetic Locker/SLocker samples do not adequately cover the real family’s feature space diversity.

Airpush/StopSMS remains the strongest under pure synthesis training (0.8813 accuracy; 0.7670 recall) with very high precision (0.9943), showing substantially better synthetic-to-real transfer than BankBot and Locker/SLocker.

4.4. Logistic Regression

Logistic Regression (LR) has been widely studied as a machine learning approach for malware detection because of its suitability for large datasets and its efficiency in binary classification tasks. It models the likelihood of an application being malicious or benign and has been applied to both software- and hardware-based detection scenarios [36].

From a broader perspective, machine learning techniques like LR are valuable in modern malware detection because traditional methods, such as signature-based detection, are often ineffective against polymorphic and evolving threats. ML enables the automated identification of malicious behavior and reduces detection latency in complex and dynamic environments, which is critical given the growth of malware across PCs, mobile, IoT, and cloud platforms [35]. The detailed performance metrics for KNN are provided in Table 8, Table 9 and Table 10 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.

With pure synthesis training, LR shows a pronounced generalization gap (0.5824 accuracy; 0.1787 recall) despite high precision (0.9280), indicating the synthetic BankBot data is not sufficiently representative for a linear boundary to recover real positives.

Locker/SLocker is effectively non-functional under pure synthesis training (0.5054 accuracy; 0.0314 recall), which suggests the generated records do not match the real Locker/SLocker feature distribution in a way LR can exploit.

Airpush/StopSMS again transfers noticeably better in the pure synthesis scenario (0.8495 accuracy; 0.7058 recall) while maintaining high precision (0.9906), reinforcing that this family’s synthetic records align more closely with real data than the other families.

4.5. Multi-Layer Perceptron (MLP)

A Multi-Layer Perceptron (MLP) is a supervised feedforward artificial neural network and is often described as the base architecture of deep learning models. It is structured as a fully connected network that contains an input layer, an output layer, and one or more hidden layers that perform the main computational tasks. The network generates outputs through activation functions such as ReLU, Tanh, Sigmoid, or Softmax, and is typically trained using backpropagation with optimization methods including stochastic gradient descent and Adam. Adjusting hyperparameters like the number of hidden layers or neurons can be computationally demanding, but MLPs provide the important advantage of modeling non-linear relationships effectively [37].

In malware detection, deep neural networks such as MLPs are valuable because their layered structure can learn and extract abstract features from complex, high-dimensional data. This reduces the dependence on manual feature engineering, which is often necessary in traditional machine learning methods. By processing raw input through successive layers, they can identify hidden patterns in behaviors or structures that may indicate malicious activity [38].

Research shows that neural networks have achieved high levels of accuracy in malware detection across multiple platforms. Detection rates with such models frequently exceed 95%, demonstrating their adaptability and strong performance against evolving and polymorphic malware threats [35]. The detailed performance metrics for MLP are provided in Table 11, Table 12 and Table 13 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.

In the pure synthesis setting, the MLP achieves only 0.6410 accuracy with low recall (0.2881), even though precision remains high (0.9791), indicating limited synthetic coverage of real BankBot positives.

For Locker/SLocker, pure synthesis training collapses to near-random performance (0.5005 accuracy) with extremely low recall (0.0130), confirming that synthetic Locker/SLocker records are not realistic enough to support generalization.

Airpush/StopSMS is substantially stronger under pure synthesis training (0.8036 accuracy; 0.6101 recall) with very high precision (0.9954), suggesting the synthetic Airpush/StopSMS records are more faithful (and/or less diverse in a learnable way) than for BankBot and Locker/SLocker.

4.6. Random Forest

Random Forest (RF) is an ensemble learning method that constructs multiple decision trees and aggregates their outcomes through majority voting or averaging. This reduces classification errors and helps prevent overfitting, making it more robust than many traditional single classifiers. Each tree is built on random subsets of features, which increases diversity among the trees and strengthens the model’s generalization capacity [39].

In intrusion and malware detection contexts, Random Forest has been applied successfully due to its ability to handle high-dimensional data and its resilience against noise. Studies show that it can achieve both high detection rates and low false alarm rates [39]. Furthermore, ensemble methods like Random Forest offer substantial accuracy and robustness by combining multiple models, complementing deep learning techniques while remaining computationally efficient [35,38]. Additionally, hybrid approaches have integrated RF with feature selection or deep learning frameworks to improve detection of complex threats, such as Android malware and PowerShell-based attacks [35,38]. The detailed performance metrics for RF are provided in Table 14, Table 15 and Table 16 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.

Under pure synthesis training, RF attains 0.6926 accuracy with perfect precision (1.0000) but low recall (0.3852), implying it learns a narrow set of synthetic patterns that do not cover a large portion of real BankBot variants.

Locker/SLocker remains the weakest in pure synthesis training (0.5005 accuracy; 0.0076 recall), indicating that synthetic Locker/SLocker samples are strongly mismatched to the real family (the model rarely flags real malware).

Airpush/StopSMS again yields the best pure synthesis transfer (0.8589 accuracy; 0.7191 recall) while keeping precision near-perfect (0.9982), consistent with this family benefiting most from the synthetic generation regime and producing more usable synthetic variety than the other families.

4.7. Discussion

Before training, we applied an identical pre-processing pipeline to both real and synthetic records, including removal of identifier and metadata fields, coercion to numeric types with None values imputed as zero for component-count features, sparsity-based feature pruning (zero-ratio > 70%), and feature standardization using StandardScaler within a unified training pipeline. For synthetic records, we additionally performed post-processing to enforce schema conformity (fixed feature set), remove duplicates, and discard malformed outputs prior to model fitting. These controls help ensure that the performance differences reported below primarily reflect distributional alignment between synthetic and real feature vectors, rather than artifacts of formatting or feature availability.

The findings highlight both the promise and limitations of using LLM-generated synthetic data for malware detection. Models trained exclusively on real malware and benign data achieved exceptionally high detection performance across BankBot, Locker, and Airpush families. For BankBot, multiple classifiers, including Logistic Regression and Random Forest, achieved perfect performance (Accuracy = 1.000, ROC AUC = 1.000, Precision = 1.000, Recall = 1.000, F1 = 1.000). Other classifiers, such as KNN and MLP, also produced near-perfect results, with accuracies above 0.99 and ROC AUC consistently >0.99. For Locker, performance was similarly strong, however, slightly lower than BankBot. Across classifiers, test accuracy ranged between 0.9715 and 0.9857, with ROC AUC values of 0.986–0.998. For Airpush/StopSMS, real-only models were likewise strong across classifiers (Accuracy ≈ 0.976–0.979; ROC AUC ≈ 0.983–0.998), reinforcing that real data yields near-perfect discrimination. These results demonstrate that when sufficient real malware samples are available, classifiers can achieve near-perfect discrimination between malicious and benign applications.

Augmenting real malware data with LLM-generated synthetic samples produced results that were also strong, but generally slightly lower than using real data alone. For BankBot, test accuracy for the best models ranged from 0.9911 to 0.9985, with ROC AUC between 0.9910 and 0.9998. For Locker, accuracies ranged from 0.9601 to 0.9751, with ROC AUC between 0.9606 and 0.9869. For Airpush/StopSMS, the performance remained excellent and very close to that of real-only (Accuracy ≈ 0.975–0.982; ROC AUC ≈ 0.982–0.997), again slightly below the real-only baseline. While these results remain strong, they consistently fall short of the near-perfect performance observed with real-only training. Importantly, the false-positive rate remained very low (typically <0.04), indicating that synthetic augmentation does not compromise specificity. However, the slight decline in accuracy and recall suggests that the synthetic data introduces some noise or distributional mismatch relative to the real malware, thereby diluting the effectiveness of the models compared to training exclusively on real samples.

When models were trained exclusively on synthetic data and then evaluated on real malware, performance degraded substantially. For BankBot, classifiers achieved moderate test accuracy (≈0.64–0.74) and ROC AUC values up to 0.98, but recall was consistently poor (≈0.29–0.49), indicating that the models missed a large proportion of real malware instances despite maintaining high precision (≈0.93–1.00). For Locker/SLocker, performance was near-random (Accuracy ≈ 0.50–0.57; ROC AUC often < 0.60) with extremely low recall, rendering these models impractical. In contrast, Airpush/StopSMS synthetic-only models generalized substantially better (Accuracy ≈ 0.80–0.89; ROC AUC ≈ 0.86–0.90; Recall ≈ 0.61–0.79), while maintaining very high precision (≈0.99). This improvement coincides with the larger family size and deeper LLM fine-tuning (150 samples, 3 epochs), indicating that synthetic-only generalization is sensitive to family characteristics and finetuning regime.

Our results align with Chalé & Bastian, who report that synthetic augmentation can preserve performance, but purely synthetic training underperforms unless real data is retained, supporting the interpretation that synthetic data often reinforces existing statistical structure rather than introducing new predictive information [19]. Conversely, our synthetic-only results contradict Rahman et al., who report strong synthetic-only intrusion detection; this discrepancy may reflect differences in domain (network flows vs. Android behavioral features), generator type (GAN vs. LLM), and the difficulty of capturing rare family-specific behaviors in a fixed hybrid feature space [21].

Real-only training remains the benchmark, with near-perfect results across all three families. Real with synthetic augmentation preserves high performance but is consistently slightly below real-only. Synthetic-only training is family- and method-dependent: weak for Locker/SLocker, moderate for BankBot, and stronger for Airpush/StopSMS under a larger/deeper fine-tuning setup. These results suggest that improving synthetic fidelity via more fine-tuning data/epochs for larger families can materially narrow the synthetic-to-real generalization gap.

5. Conclusions and Future Work

This research set out to evaluate the feasibility of LLM-generated tabular malware feature records to support Android threat detection. By positioning real-only detection accuracy as a benchmark, the study contextualized the effectiveness of both augmentation and synthetic-only training. The results show that while LLMs can generate structurally consistent malware records and provide meaningful augmentation, they do not yet achieve the realism or diversity needed to serve as a standalone data source. Notably, for Airpush/StopSMS, a larger family that we fine-tuned with 150 samples and 3 epochs, synthetic-only models achieved substantially higher generalization than for the other families, indicating that synthetic utility improves with family size and fine-tuning depth.

The novelty of this work is 2-fold. First, we cast LLM-based synthesis as a fixed-schema, structured record generation task for Android malware detection and evaluate its utility under a three-scenario protocol that cleanly separates augmentation benefits from synthetic-only transfer. Second, we quantify downstream detection effects across multiple malware families and classifiers, showing that augmentation can preserve strong performance while synthetic-only generalization remains family-dependent and sensitive to the fine-tuning regime, thereby delineating when LLM-generated tabular records are useful and when they are not yet sufficient as a standalone training source.

The contribution of this work lies less in raw performance metrics and more in the insights it provides for practice. First, it demonstrates that synthetic augmentation can be applied without undermining specificity, offering a practical way to enrich scarce datasets. Second, it reveals the fragility of synthetic-only training, underscoring the gap between generated records and operationally valid malware. Third, it highlights methodological considerations such as prompt design, post-processing, and validation, which are crucial when applying LLMs to structured cybersecurity data. These findings establish a foundation for future exploration of synthetic data as both a supplement and a research tool in security analytics.

The findings of the study suggest that LLM-generated tabular malware feature records can be a useful augmentation tool but should not yet be relied upon as a primary data source. Synthetic augmentation can increase training volume while largely preserving detection specificity, but its ability to improve performance in truly low-data regimes remains to be evaluated. However, deploying classifiers trained exclusively on synthetic data would be premature, as current LLM outputs do not generalize reliably to real-world threats. In practice, synthetic malware is best applied as a complement to, rather than a substitute for, real data, supporting tasks such as adversarial training, red teaming, and benchmarking in environments where access to sensitive datasets is limited.

Future work will systematically vary fine-tuning factors such as sample count, epochs, and prompt design across malware families to measure their impact on synthetic fidelity and real-world generalization. By scaling dataset size and prompt diversity, the goal is to reduce distribution drift, improve recall, and avoid increased false positives. Planned experiments will assess fidelity using statistical tests and evaluate downstream model performance, especially on rare families. The work will also explore more capable, open-weight LLMs with efficient adapters to generate larger datasets and consider adversarial or sandbox-based pipelines for richer synthetic behavior.

Author Contributions

Conceptualization, N.R. and N.P.; methodology, N.R.; software, N.P.; validation, N.R. and N.P.; resources, N.P.; data curation, N.R.; writing—original draft preparation, N.R.; writing—review and editing, N.P.; visualization, N.R.; supervision, N.P.; project administration, N.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/aleguma/kronodroid (accessed on 5 March 2025), and the derived datasets are openly available in GitHub at https://github.com/nikija/android-malware-detection (accessed on 22 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLMs	Large Language Models
ML	Machine learning
DL	Deep learning
AI	Artificial intelligence
IoC	Indicators of compromise
KNN	K-Nearest Neighbor
DT	Decision Trees
LR	Logistic Regression
MLP	Multi-Layer Perceptron
RF	Random Forest

References

Wu, Q.; Zhu, X.; Liu, B. A Survey of Android Malware Static Detection Technology Based on Machine Learning. Mob. Inf. Syst. 2021, 2021, 8896013. [Google Scholar] [CrossRef]
Nelson, J.; Pavlidis, M.; Fish, A.; Kapetanakis, S.; Polatidis, N. ChatGPT-Driven Machine Learning Code Generation for Android Malware Detection. Comput. J. 2025, 68, 331–345. [Google Scholar] [CrossRef]
Zhao, W.; Wu, J.; Meng, Z. AppPoet: Large Language Model Based Android Malware Detection via Multi-View Prompt Engineering. Expert Syst. Appl. 2025, 262, 125546. [Google Scholar] [CrossRef]
Achuthan, K.; Ramanathan, S.; Srinivas, S.; Raman, R. Advancing Cybersecurity and Privacy with Artificial Intelligence: Current Trends and Future Research Directions. Front. Big Data 2024, 7, 1497535. [Google Scholar] [CrossRef] [PubMed]
Polatidis, N.; Kapetanakis, S.; Trovati, M.; Korkontzelos, I.; Manolopoulos, Y. FSSDroid: Feature Subset Selection for Android Malware Detection. World Wide Web 2024, 27, 50. [Google Scholar] [CrossRef]
Mudassar Yamin, M.; Hashmi, E.; Ullah, M.; Katt, B. Applications of LLMs for Generating Cyber Security Exercise Scenarios. IEEE Access 2024, 12, 143806–143822. [Google Scholar] [CrossRef]
Sarker, I.H. Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects. Ann. Data Sci. 2023, 10, 1473–1498. [Google Scholar] [CrossRef]
Ali, G.; Shah, S.; ElAffendi, M. Enhancing Cybersecurity Incident Response: AI-Driven Optimization for Strengthened Advanced Persistent Threat Detection. Results Eng. 2025, 25, 104078. [Google Scholar] [CrossRef]
Ferrag, M.A.; Alwahedi, F.; Battah, A.; Cherif, B.; Mechri, A.; Tihanyi, N.; Bisztray, T.; Debbah, M. Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities. Internet Things Cyber-Phys. Syst. 2025, 5, 1–46. [Google Scholar] [CrossRef]
Ankalaki, S.; Atmakuri, A.R.; Pallavi, M.; Hukkeri, G.S.; Jan, T.; Naik, G.R. Cyber Attack Prediction: From Traditional Machine Learning to Generative Artificial Intelligence. IEEE Access 2025, 13, 44662–44706. [Google Scholar] [CrossRef]
Botacin, M. Gpthreats-3: Is Automatic Malware Generation a Threat? In Proceedings of the 2023 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 25 May 2023; IEEE: New York, NY, USA, 2023; pp. 238–254. [Google Scholar]
Ubavić, V.; Jovanović-Milenković, M.; Popović, O.; Boranijašević, M. The Use of the ChatGPT Language Model in the Creation of Malicious Programs. BizInfo 2023, 14, 127–136. [Google Scholar] [CrossRef]
Pa, Y.M.P.; Tanizaki, S.; Kou, T.; van Eeten, M.; Yoshioka, K.; Matsumoto, T. An Attacker’s Dream? Exploring the Capabilities of ChatGPT for Developing Malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop, Marina del Rey, CA, USA, 7 August 2023; ACM: Marina del Rey, CA, USA, 2023; pp. 10–18. [Google Scholar] [CrossRef]
Hilario, E.; Azam, S.; Sundaram, J.; Mohammed, K.I.; Shanmugam, B. Generative AI for Pentesting: The Good, the Bad, the Ugly. Int. J. Inf. Secur. 2024, 23, 2075–2097. [Google Scholar] [CrossRef]
Berrios, S.; Leiva, D.; Olivares, B.; Allende-Cid, H.; Hermosilla, P. Systematic Review: Malware Detection and Classification in Cybersecurity. Appl. Sci. 2025, 15, 7747. [Google Scholar] [CrossRef]
Shaukat, K.; Luo, S.; Varadharajan, V. A Novel Deep Learning-Based Approach for Malware Detection. Eng. Appl. Artif. Intell. 2023, 122, 106030. [Google Scholar] [CrossRef]
Gyamfi, N.K.; Goranin, N.; Ceponis, D.; Čenys, H.A. Automated System-Level Malware Detection Using Machine Learning: A Comprehensive Review. Appl. Sci. 2023, 13, 11908. [Google Scholar] [CrossRef]
Salem, A.H.; Azzam, S.M.; Emam, O.E.; Abohany, A.A. Advancing Cybersecurity: A Comprehensive Review of AI-Driven Detection Techniques. J. Big Data 2024, 11, 105–138. [Google Scholar] [CrossRef]
Chalé, M.; Bastian, N.D. Generating Realistic Cyber Data for Training and Evaluating Machine Learning Classifiers for Network Intrusion Detection Systems. Expert Syst. Appl. 2022, 207, 117936. [Google Scholar] [CrossRef]
Ammara, D.A.; Ding, J.; Tutschku, K. Synthetic Data Generation in Cybersecurity: A Comparative Analysis. arXiv 2024, arXiv:2410.16326v1. Available online: https://go.exlibris.link/3xzCrxHL (accessed on 25 August 2025). [CrossRef]
Rahman, S.; Pal, S.; Mittal, S.; Chawla, T.; Karmakar, C. SYN-GAN: A Robust Intrusion Detection System Using GAN-Based Synthetic Data for IoT Security. Internet Things 2024, 26, 101212. [Google Scholar] [CrossRef]
Almorjan, A.; Basheri, M.; Almasre, M. Large Language Models for Synthetic Dataset Generation of Cybersecurity Indicators of Compromise. Sensors 2025, 25, 2825. [Google Scholar] [CrossRef]
Elnashar, A.; White, J.; Schmidt, D.C. Enhancing Structured Data Generation with GPT-4o: Evaluating Prompt Efficiency Across Prompt Styles. Front. Artif. Intell. 2025, 8, 1558938. [Google Scholar] [CrossRef]
Arp, D.; Spreitzenbarth, M.; Hubner, M.; Gascon, H.; Rieck, K.; Siemens, C. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. In Proceedings of the NDSS, San Diego, CA, USA, 23–26 February 2014; pp. 23–26. [Google Scholar]
Zhou, Y.; Jiang, X. Dissecting Android Malware: Characterization and Evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2012; IEEE: New York, NY, USA, 2012; pp. 95–109. [Google Scholar]
Alecci, M.; Jiménez, P.J.R.; Allix, K.; Bissyandé, T.F.; Klein, J. AndroZoo: A Retrospective with a Glimpse into the Future. In Proceedings of the 21st International Conference on Mining Software Repositories, Lisbon, Portugal, 15–16 April 2024; pp. 389–393. [Google Scholar]
Aurangzeb, S.; Aleem, M.; Khan, M.T.; Loukas, G.; Sakellari, G. AndroDex: Android Dex Images of Obfuscated Malware. Sci. Data 2024, 11, 212. [Google Scholar] [CrossRef]
Duan, G.; Liu, H.; Cai, M.; Sun, J.; Chen, H. MaDroid: A Maliciousness-Aware Multifeatured Dataset for Detecting Android Malware. Comput. Secur. 2024, 144, 103969. [Google Scholar] [CrossRef]
Guerra-Manzanares, A.; Bahsi, H.; Nõmm, S. KronoDroid: Time-Based Hybrid-Featured Dataset for Effective Android Malware Detection and Characterization. Comput. Secur. 2021, 110, 102399. [Google Scholar] [CrossRef]
Bai, C.; Han, Q.; Mezzour, G.; Pierazzi, F.; Subrahmanian, V.S. DBank: Predictive Behavioral Analysis of Recent Android Banking Trojans. IEEE Trans. Dependable Secur. Comput. 2021, 18, 1378–1393. [Google Scholar] [CrossRef]
Su, D.; Liu, J.; Wang, X.; Wang, W. Detecting Android Locker-Ransomware on Chinese Social Networks. IEEE Access 2019, 7, 20381–20393. [Google Scholar] [CrossRef]
Rastogi, V.; Shao, R.; Chen, Y.; Pan, X.; Zou, S.; Riley, R. Are These Ads Safe: Detecting Hidden Attacks Through the Mobile App–Web Interfaces. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 21–24 February 2016. [Google Scholar]
Babbar, H.; Rani, S.; Sah, D.K.; AlQahtani, S.A.; Bashir, A.K. Detection of Android Malware in the Internet of Things through the K-Nearest Neighbor Algorithm. Sensors 2023, 23, 7256. [Google Scholar] [CrossRef]
Ucci, D.; Aniello, L.; Baldoni, R. Survey of Machine Learning Techniques for Malware Analysis. Comput. Secur. 2019, 81, 123–147. [Google Scholar] [CrossRef]
Ferdous, J.; Islam, R.; Mahboubi, A.; Islam, M.Z. A Survey on ML Techniques for Multi-Platform Malware Detection: Securing PC, Mobile Devices, IoT, and Cloud Environments. Sensors 2025, 25, 1153. [Google Scholar] [CrossRef]
Farooq, M.S.; Akram, Z.; Alvi, A.; Omer, U. Role of Logistic Regression in Malware Detection: A Systematic Literature Review. VFAST Trans. Softw. Eng. 2022, 10, 36–46. [Google Scholar] [CrossRef]
Sarker, I.H. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef]
Song, Y.; Zhang, D.; Wang, J.; Wang, Y.; Wang, Y.; Ding, P. Application of Deep Learning in Malware Detection: A Review. J. Big Data 2025, 12, 99. [Google Scholar] [CrossRef]
Farnaaz, N.; Jabbar, M.A. Random Forest Modeling for Network Intrusion Detection System. Procedia Comput. Sci. 2016, 89, 213–217. [Google Scholar] [CrossRef]

Figure 1. Fine-tuning message.

Figure 2. Generation prompt.

Figure 3. User prompt.

Figure 4. BankBot real vs. synthetic PCA.

Figure 5. Locker/SLocker real vs. synthetic PCA.

Figure 6. Airpush/StopSMS real vs. synthetic PCA.

Table 1. Resulting dataset.

Feature Category	Description	Exact Count
System Call Features	System call usage indicating app behavior at OS level.	83
SYS_XXX and Misc Syscalls	Obscure or custom syscall variants (SYS_300–SYS_369 and others).	71
Android Permissions	Declared Android permissions (normal, dangerous, signature, etc.).	166
Permission Summary Metrics	Aggregated counts and summaries of permission usage.	7
App Structure and Manifest Features	Manifest and structural metadata from the APK file.	12
Other/Unclassified	Unclassified or uncommon features not in main categories.	48
Total	-	387

Table 2. KNN performance on BankBot.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9923	0.9911	0.7357
ROC AUC	0.9941	0.9985	0.7906
Precision	0.9923	1.0000	0.9811
Recall	0.9923	0.9821	0.4807
F1 score	0.9923	0.9910	0.6453
False-positive rate	0.0077	0.0000	0.0092
95% CI	[0.9903, 1.0000]	[0.9836, 0.9970]	[0.7134, 0.7589]

Table 3. KNN performance on Locker/SLocker.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9715	0.9701	0.5119
ROC AUC	0.9862	0.9789	0.5265
Precision	0.9742	0.9609	0.6410
Recall	0.9687	0.9800	0.0542
F1 score	0.9714	0.9704	0.0999
False-positive rate	0.0257	0.0399	0.0303
95% CI	[0.9846, 1.0000]	[0.9589, 0.9813]	[0.4897, 0.5352]

Table 4. KNN performance on Airpush/StopSMS.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9788	0.9793	0.8925
ROC AUC	0.9919	0.9903	0.8949
Precision	0.9844	0.9838	0.9997
Recall	0.9730	0.9747	0.7852
F1 Score	0.9787	0.9792	0.8796
False Positive Rate	0.0154	0.0160	0.0003
95% CI	[0.9736, 0.9836]	[0.9747, 0.9846]	[0.8855, 0.8994]

Table 5. DT performance on BankBot.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9942	0.9911	0.6834
ROC AUC	0.9942	0.9910	0.6854
Precision	0.9885	0.9882	0.9312
Recall	1.0000	0.9940	0.3960
F1 score	0.9942	0.9911	0.5557
False-positive rate	0.0116	0.0119	0.0293
95% CI	[0.9942, 1.0000]	[0.9821, 0.9970]	[0.6595, 0.7080]

Table 6. DT performance on Locker/SLocker.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9800	0.9601	0.5672
ROC AUC	0.9799	0.9606	0.5822
Precision	0.9746	0.9671	0.7870
Recall	0.9858	0.9526	0.1842
F1 score	0.9802	0.9598	0.2985
False-positive rate	0.0257	0.0324	0.0498
95% CI	[0.9686, 0.9900]	[0.9464, 0.9726]	[0.5439, 0.5899]

Table 7. DT performance on Airpush/StopSMS.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9781	0.9747	0.8813
ROC AUC	0.9829	0.9819	0.8835
Precision	0.9775	0.9824	0.9943
Recall	0.9788	0.9667	0.7670
F1 score	0.9781	0.9745	0.8660
False-positive rate	0.0225	0.0173	0.0044
95% CI	[0.9727, 0.9830]	[0.9694, 0.9799]	[0.8737, 0.8884]

Table 8. LR performance on BankBot.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	1.0000	0.9911	0.5824
ROC AUC	1.0000	0.9949	0.5889
Precision	1.0000	0.9940	0.9280
Recall	1.0000	0.9881	0.1787
F1 score	1.0000	0.9910	0.2997
False-positive rate	0.0000	0.0060	0.0139
95% CI	[1.0000, 1.0000]	[0.9836, 0.9970]	[0.5547, 0.6094]

Table 9. LR performance on Locker/SLocker.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9757	0.9613	0.5054
ROC AUC	0.9906	0.9833	0.3839
Precision	0.9718	0.9579	0.6042
Recall	0.9801	0.9651	0.0314
F1 score	0.9759	0.9615	0.0597
False-positive rate	0.0286	0.0424	0.0206
95% CI	[0.9643, 0.9872]	[0.9488, 0.9738]	[0.4821, 0.5293]

Table 10. LR performance on Airpush/StopSMS.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9756	0.9765	0.8495
ROC AUC	0.9942	0.9909	0.8631
Precision	0.9907	0.9911	0.9906
Recall	0.9601	0.9617	0.7058
F1 score	0.9752	0.9762	0.8243
False-positive rate	0.0090	0.0086	0.0067
95% CI	[0.9701, 0.9810]	[0.9713, 0.9815]	[0.8418, 0.8575]

Table 11. MLP performance on BankBot.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9961	0.9926	0.6410
ROC AUC	0.9979	0.9991	0.9108
Precision	0.9961	0.9911	0.9791
Recall	0.9961	0.9940	0.2881
F1 score	0.9961	0.9926	0.4452
False-positive rate	0.0039	0.0089	0.0062
95% CI	[0.9903, 1.0000]	[0.9851, 0.9985]	[0.6133, 0.6672]

Table 12. MLP performance on Locker/SLocker.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9857	0.9751	0.5005
ROC AUC	0.9941	0.9869	0.3483
Precision	0.9885	0.9635	0.5217
Recall	0.9829	0.9875	0.0130
F1 score	0.9857	0.9754	0.0254
False-positive rate	0.0114	0.0374	0.0119
95% CI	[0.9715, 0.9900]	[0.9638, 0.9850]	[0.4778, 0.5238]

Table 13. MLP performance on Airpush/StopSMS.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9775	0.9775	0.8036
ROC AUC	0.9948	0.9939	0.8557
Precision	0.9793	0.9813	0.9954
Recall	0.9756	0.9735	0.6101
F1 score	0.9774	0.9774	0.7565
False-positive rate	0.0206	0.0185	0.0028
95% CI	[0.9720, 0.9826]	[0.9722, 0.9824]	[0.7951, 0.8128]

Table 14. RF performance on BankBot.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	1.0000	0.9985	0.6926
ROC AUC	1.0000	0.9998	0.9831
Precision	1.0000	1.0000	1.0000
Recall	1.0000	0.9970	0.3852
F1 score	1.0000	0.9985	0.5562
False-positive rate	0.0000	0.0000	0.0000
95% CI	[1.0000, 1.0000]	[0.9955, 1.0000]	[0.6680, 0.7173]

Table 15. RF performance on Locker/SLocker.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9815	0.9913	0.5005
ROC AUC	0.9988	0.9988	0.7391
Precision	0.9856	1.0000	0.5385
Recall	0.9771	0.9825	0.0076
F1 score	0.9813	0.9912	0.0150
False-positive rate	0.0142	0.0000	0.0065
95% CI	[0.9715, 0.9900]	[0.9838, 0.9963]	[0.4772, 0.5233]

Table 16. RF performance on Airpush/StopSMS.

Metric	Real vs. Benign	Real + Synthetic vs. Benign	Trained on Synthetic, Tested on Real vs. Benign
Accuracy	0.9785	0.9815	0.8589
ROC AUC	0.9977	0.9974	0.9001
Precision	0.9934	0.9949	0.9982
Recall	0.9633	0.9679	0.7191
F1 score	0.9781	0.9812	0.8360
False-positive rate	0.0064	0.0049	0.0013
95% CI	[0.9733, 0.9830]	[0.9769, 0.9861]	[0.8509, 0.8668]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rollinson, N.; Polatidis, N. LLM-Generated Samples for Android Malware Detection. Digital 2026, 6, 5. https://doi.org/10.3390/digital6010005

AMA Style

Rollinson N, Polatidis N. LLM-Generated Samples for Android Malware Detection. Digital. 2026; 6(1):5. https://doi.org/10.3390/digital6010005

Chicago/Turabian Style

Rollinson, Nik, and Nikolaos Polatidis. 2026. "LLM-Generated Samples for Android Malware Detection" Digital 6, no. 1: 5. https://doi.org/10.3390/digital6010005

APA Style

Rollinson, N., & Polatidis, N. (2026). LLM-Generated Samples for Android Malware Detection. Digital, 6(1), 5. https://doi.org/10.3390/digital6010005

Article Menu

LLM-Generated Samples for Android Malware Detection

Abstract

1. Introduction

2. Related Work

2.1. AI and Cybersecurity Foundations

2.2. LLMs for Malware Generation and Automation

2.3. AI-Based Malware Detection and Prevention

2.4. Synthetic Data in Cybersecurity

2.5. Research Gap

3. Methodology

3.1. Dataset Selection

3.2. Data Preparation

3.3. Synthetic Data Generation Using Large Language Models

3.4. Experimental Setup and Evaluation

4. Experimental Evaluation

4.1. Evaluation Metrics

4.2. K-Nearest Neighbors (KNNs)

4.3. Decision Trees

4.4. Logistic Regression

4.5. Multi-Layer Perceptron (MLP)

4.6. Random Forest

4.7. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI