1. Introduction
Android malware continues to outpace signature-based defenses, with families evolving through obfuscation, polymorphism, and rapid variant creation. Although machine learning (ML) and deep learning (DL) pipelines have advanced detection accuracy on benchmark datasets, their effectiveness depends heavily on timely, high-quality labels and balanced class distributions, conditions rarely achieved in operational settings. Data scarcity is particularly acute for specific malware families and behaviors, limiting classifier robustness and hindering reproducibility across studies [
1,
2].
Recent work has explored different ways to mitigate these issues, yet there remains no empirical evaluation of how synthetically generated Android malware, produced specifically by LLMs, influences detection performance [
3]. This research addresses that gap by generating structured malware records with a fine-tuned LLM and systematically assessing their utility in training and augmenting ML classifiers. Using the KronoDroid dataset for its hybrid static–dynamic feature space and balanced malicious/benign splits, we fine-tune GPT-4.1-mini to emulate the characteristics of three families, BankBot, Locker/SLocker, and Airpush/StopSMS; resolve generation pathologies through prompt engineering and post hoc filtering; and evaluate the downstream impact across five classifiers under three scenarios: real-only training, real-plus-synthetic augmentation, and synthetic-only generalization.
This paper provides the following contributions:
Fine-tuning pipeline for structured malware feature vectors: A reproducible workflow to fine-tune an LLM to generate fixed schema Android malware records, including schema sanitization, prompt constraints, and post-processing to remove malformed outputs.
Evaluation protocol for synthetic utility in malware detection: A three-scenario benchmarking design (real-only, real and synthetic, synthetic-to-real generalization) with leakage checks, consistent pre-processing, and bootstrap confidence intervals.
Family-level evidence across three Android malware families and five classifiers: Quantitative results showing (i) augmentation preserves near baseline detection, and (ii) synthetic-only training is unreliable and family dependent.
Sensitivity evidence on fine-tuning depth and richer data: The Airpush/StopSMS configuration (where we used more fine-tuning samples and more epochs) narrows the synthetic-to-real gap compared to smaller fine-tuning regimes.
The rest of the paper is structured as follows:
Section 2 contains the related works,
Section 3 delivers the methodology,
Section 4 presents the experimental results, and
Section 5 contains the conclusions and future work.
3. Methodology
3.1. Dataset Selection
As we have discussed above, the quality of the original data is crucial. Therefore, it is essential to analyze and select the right dataset with Android malware samples to use for further synthetic data generation and accuracy evaluations.
Elnashar, White, and Schmidt examined prompt styles and output formats for GPT-4o with different file formats such as JSON, YAML, and Hybrid CSV/Prefix. While many use JSON and YAML because they are structured and readable, the research investigated Hybrid CSV/Prefix as a less resource-intensive alternative. The Hybrid CSV/Prefix format combines CSV and prefixed identifiers. They found that CSV/Prefix, the format with fixed schema and row per sample organization, delivers the best results for synthetic data generation tasks with large language models and is the most efficient in terms of token usage and processing time [
23]. Based on this, we prioritize datasets that offer row per sample organization.
Because Shaukat, Luo, and Varadharajan showed improvement in the detection rate with a hybrid approach, we should consider prioritizing a dataset that has both static and dynamic features [
16]. Furthermore, even though this research aims to be a proof of concept, we should have scalability in mind, and therefore, should aim at a well-balanced dataset so that the application discussed in this research could be applied to the whole dataset. And finally, dataset features should be human-interpretable to provide a better understanding of the underlying data.
The Drebin dataset is often considered the baseline for Android malware datasets. However, it is not well balanced, with only 5560 samples being malware from the total of 123,453 apps analyzed, and it only uses static features [
24]. MalGenome contributed early behavioral insights but is malware-only, containing 1260 pieces of malware from 49 families [
25]. AndroZoo provides the largest number of analyzed APKs, making it ideal as a source repository; however, it is not a feature table ready for use out of the box, and it lacks uniform labels and dynamic traces [
26].
AndroDex explicitly targets obfuscation, releasing 24,746 samples by converting DEX bytecode into images. Although it is robust for some use cases, there is no tabular feature matrix [
27].
Two newer datasets explicitly target reproducible, human-readable features. MaDroid offers 50,429 labeled system call sequences collected over 14 years across 10 Android marketplaces and is maliciousness-aware using VirusTotal rating as an attribute. It has a healthy class balance with 24,789 benign and 25,640 malicious samples. Features for each event span over several rows; they are human interpretable to analysts but are not a compact table [
28]. Another example is KronoDroid. It combines per-sample timestamps with a fixed hybrid table of 489 columns combining static and aggregated dynamic features. Splits are well balanced: the emulator dataset contains 28,745 malware and 35,246 benign samples, while the real device dataset contains 41,382 malware and 36,755 benign samples. Many features are human-readable and systematically structured within CSV files [
29]. Based on our requirements, KronoDroid is the most suitable dataset for our research, because unlike Drebin, which is primarily static and highly imbalanced, and repositories, such as AndroZoo/AndroDex that do not provide a standardized, ready-to-train tabular feature matrix out of the box, KronoDroid provides a consistent row per sample representation with a fixed schema. In addition, KronoDroid’s hybrid design (static and aggregated dynamic features) and well-balanced benign/malicious splits across emulator and real device subsets allow for the evaluation of the effect of synthetic augmentation without confounding shifts in class priors, while keeping features human-interpretable for analysis.
3.2. Data Preparation
The dataset encompasses static and dynamic analysis features extracted from malware samples executed on virtual and physical Android devices, providing a realistic representation of behavioral artifacts such as system calls, permissions, and network interactions. Specifically, we focused on the real device-malware subset, which includes detailed feature vectors from infected applications, and the benign samples subset, comprising legitimate Android applications. The benign subset totals 36,755 samples, from which random selections (without fixed random state, via Python 3.9 scripting) were drawn to maintain experimental balance.
To ensure behavioral diversity, we selected three distinct malware families from the KronoDroid dataset: BankBot (1297 samples, active primarily from 2013 to 2020), Locker/SLocker (1846 samples, spanning 2008 to 2020, with some anomalous timestamps potentially indicating extraction errors, e.g., pre 2008 dates), and Airpush/StopSMS (7775 samples, active primarily from 2008 until 2016, with a variety of anomalous timestamps present in the dataset). BankBot, an Android banking trojan, primarily employs overlay attacks to intercept user credentials on financial applications, often tries to appear as benign software with delayed activation mechanisms [
30]. In contrast, Locker/SLocker represents an early form of Android ransomware, characterized by device UI locking via full-screen overlays, activity hijacking, key disabling, and exploitation of device administrator privileges, including file encryption of media and documents on external storage [
31]. Airpush, on the other hand, is an advertising library that obfuscates its ad library packages to avoid identification by package names, while SMS trojans disguise themselves as legitimate apps and secretly send SMS messages without user consent, sometimes also reading and making phone calls [
32]. Families were treated separately to assess detection performance across different malware types.
A 1:1 ratio between malware and benign samples was maintained across configurations to mitigate class imbalance. Benign samples were selected randomly from the entire subset, without temporal filtering, as date-related features were excluded from analysis.
3.3. Synthetic Data Generation Using Large Language Models
In this study, “LLM” refers specifically to a transformer-based GPT-family model used as a structured record generator; we do not analyze internal attention patterns or model interpretability mechanisms. Instead, we treat the LLM as a black-box generator and evaluate its utility via downstream detection performance under controlled fine-tuning regimes (50 samples/1 epoch for BankBot and Locker/SLocker; 150 samples/3 epochs for Airpush/StopSMS). Although we do not conduct mechanistic interpretability analyses (e.g., attention-weight inspection), we also do not treat the LLM as an unexamined component of the pipeline. Rather, we model it as a conditional generator and study it through controlled interventions, including variations in fine-tuning set size and epochs, explicit schema constraints in the system prompt, sampling temperature, and post-generation validation. We then quantify how these design choices affect both synthetic data fidelity and downstream detection outcomes; for example, the Airpush/StopSMS configuration deliberately increases fine-tuning depth (150 samples, 3 epochs) to assess the sensitivity of synthetic-to-real generalization to training regime and family characteristics.
Synthetic malware samples were generated using OpenAI’s GPT-4.1-mini model (base version dated 14 April 2025) to augment the merged real malware set, addressing data scarcity in specialized families while preserving privacy and ethical constraints on real sample distribution. We selected GPT-4.1-mini because it supports fine-tuning for domain adaptation and offers a cost–throughput profile that enables iterative experimentation and larger-scale synthetic sampling compared with larger proprietary models. Importantly, our objective is not APK or code generation, but the generation of fixed schema tabular malware feature records, for which instruction following and format consistency are the primary model requirements. Direct attempts to emulate tabular structures via CSV input and output yielded highly inconsistent results, such as erroneous column counts ranging from 179 to over 7000, necessitating a conversion to JSON format for improved structural fidelity. This aligns with prior research highlighting limitations of LLMs in handling nonsequential, tabular data formats due to their text-based training paradigms [
23].
Fine-tuning on JSONL data failed OpenAI’s moderation due to policy violations inferred from terms like ‘malware’, which triggered safety filters. To circumvent this, feature names were sanitized with neutral equivalents: for BankBot, ‘malware’ to ‘app’, ‘Malware’ to ‘AppType’, ‘BankBot’ to ‘FinTech’, ‘MalFamily’ to ‘AppFamily’, ‘kill’ to ‘stop’, and ‘ptrace’ to ‘trace’; for Locker/SLocker, ‘Locker/SLocker Ransomware’ to ‘HiddenTech’ in place of ‘FinTech’. This enabled passage through moderation. Cost considerations estimated at USD 74.21 for the full 1297 BankBot samples led to subsampling 50 representative records per family (total 100) for fine-tuning and using one epoch. To test whether larger fine-tuning sets and deeper training improve synthetic fidelity, we increased the fine-tuning set to 150 samples and ran 3 epochs for the Airpush/StopSMS family, keeping other settings constant. This creates an intentional, controlled departure from the BankBot and Locker setup (50 samples, 1 epoch) to study method sensitivity (family size, sample count, epochs) on downstream generalization.
Despite successful training, the fine-tuned model exhibited instability, generating extraneous ‘SYS_n’ columns. These were mitigated by converting the data into the JSONL format. While that solved the original problem, the model returned repetitive outputs with identical data. This was then resolved by incorporating an exemplar record in the prompt and iterative refinement using OpenAI’s prompt optimization tool, which provided feedback on clarity and adherence without CLI invocation. Fine-tuning examples followed a structured message format as shown in
Figure 1 below. Each training example uses a standard chat fine-tuning format: a system message constraining schema and valid ranges, a user message requesting a single record in JSON, and an assistant message containing the target record. While some related work emphasizes autonomous agentic pipelines, our approach is a scripted and reproducible workflow for structured record synthesis and evaluation rather than autonomous task execution. In our implementation, the individual stages (generation, schema validation, deduplication, preprocessing, and fidelity checks) are executed via separate scripts that are manually orchestrated, but each stage is deterministic given the same inputs and configuration. This design prioritizes transparency and repeatability, and can be integrated into a single automated pipeline in future work.
Generation prompts were configured with temperature = 0.7 for variety and max_tokens = 16,384. The system prompt enforced structure and realism as shown in
Figure 2 and
Figure 3.
This yielded 392 synthetic records for BankBot and 301 for Locker/SLocker after removal of 12 duplicates. Data quality issues included repeated sha256 hashes (non-unique, with unrealistic patterns like multiple zeros) and occasional omissions in the AppType label (expected as 1 for all synthetic samples). Duplicates were minimal, and as labels are excluded from training, these did not impact detection evaluations.
To address potential LLM hallucinations (e.g., unrealistic syscall patterns), we performed an automated fidelity evaluation between the real and synthetic datasets. Each JSON record was transformed into a fixed-length representation by flattening nested fields and summarizing variable-length lists using simple statistics, followed by consistent preprocessing across both datasets. We then quantified multivariate similarity using a K-Nearest-Neighbor (KNN) distance analysis, comparing the distribution of real-to-real KNN distances (baseline intra-real variability) with synthetic-to-real KNN distances (proximity of synthetic samples to the real data manifold). In addition, we computed a kernel-based two-sample distance (maximum mean discrepancy, MMD) and estimated its significance via a permutation test. Finally, we visualized the joint structure of the two datasets using a two-dimensional PCA projection as shown in
Figure 4,
Figure 5 and
Figure 6 below, plotting real and synthetic samples together to provide an interpretable overview of their overlap in feature space. This pipeline produces the metrics and figures automatically, ensuring that the fidelity assessment is objective, repeatable, and suitable for deployment under real operating conditions. Prompts explicitly instructed uniqueness, labeling, and absence of nulls, but adherence varied, underscoring LLM limitations for tabular synthesis.
3.4. Experimental Setup and Evaluation
After the synthetic data generation and model training, the dataset underwent processing to ensure consistency and suitability for machine learning. From the original 484 features, we excluded identifiers and metadata that could induce overfitting or biases: Malware (binary label), Detection_Ratio (antivirus detection percentage), MalFamily (family identifier), Scanners (detecting antivirus list), TimesSubmitted (submission count), NrContactedIps (network contacts), Package (application package name), sha256 (hash identifier), EarliestModDate, and HighestModDate (modification timestamps). This reduced the feature set to 474 numeric columns.
Non-numeric artifacts were addressed: ‘None’ values in columns such as Activities, NrIntServices, NrIntServicesActions, NrIntActivities, NrIntActivitiesActions, NrIntReceivers, NrIntReceiversActions, TotalIntentFilters, and NrServices, intended as counts of Android component invocations, were imputed to 0. We then filtered out all columns in which more than 70% of the values were zeros. We selected 70% after inspecting the distribution of per-feature zero ratios and choosing a conservative cutoff that removes the sparsest tail while retaining sufficient coverage across syscall and permission groups. We tested performance with those columns intact and with them removed, confirming that there was no major difference in classifier performance. This resulted in dropping the additional 87 columns, reducing the feature set to 387 feature columns. The dataset details are shown in
Table 1 below.
Outliers were retained without removal, as elevated counts in features, such as system calls, likely reflect legitimate behavioral variations in Android applications rather than errors. Feature scaling was applied using StandardScaler from scikit-learn to normalize numeric values, addressing the wide variance in scales. Then, we evaluated 3 scenarios per family (BankBot, Locker/SLocker, and Airpush/StopSMS) across 5 classifiers: 3 scenarios × 5 algorithms × 3 families = 45 total combinations, to assess synthetic data’s impact on detection accuracy.
Real malware vs. benign: Real malware samples (label 1) paired with an equal number of random benign samples (label 0) for a 1:1 balance; stratified 80/20 train/test split on the combined balanced set.
Real + synthetic malware vs. benign: Concatenation of real and synthetic malware (both label 1) balanced 1:1 against benign (label 0); stratified 80/20 train/test split on the combined balanced set.
Train on synthetic, validate/test on real (synthetic → real generalization): Training on synthetic malware; real malware split 50/50 into validation and test sets (disjoint). Benign split 40/30/30 into train/validation/test (disjoint). Each split undersampled benign to match malware count for a 1:1 balance.
Comparability notes: Because Airpush/StopSMS used 150 fine-tuning samples and 3 epochs, synthetic to real results for Airpush are not directly comparable to Bank-Bot/Locker synthetic to real under a 50-sample, 1-epoch regime; this contrast is by design to assess sensitivity to finetuning depth and data size.
Finally, five classifiers were tested: K-Nearest Neighbors (KNNs), Decision Trees (DTs), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Random Forest (RF), all from scikit-learn. Models were wrapped in a pipeline: StandardScaler → Classifier. We focus on a set of widely used baseline classifiers to isolate the effect of synthetic data augmentation on detection performance while keeping the learning pipeline simple and interpretable. Because these models already achieved strong performance on the real and augmented settings in our preliminary experiments, we did not incorporate more complex deep learning architectures in this study. Hyperparameter tuning used grid search with 5-fold StratifiedKFold cross-validation (shuffle enabled) on the training set only, selecting by accuracy. The best estimator was re-fit on the full training set. Data leakage prevention included hashing feature rows (pandas.util.hash_pandas_object) to verify no train/test intersection in 80/20 scenarios, and explicit disjoint subsets in synthetic to real. On the held-out test set, we reported accuracy, ROC AUC (from predict_proba), precision, recall, F1, and false-positive rate (from confusion matrix). Nonparametric 95% confidence intervals (CIs) for test accuracy were computed via bootstrap resampling (B = 1000). Confusion matrices were saved for qualitative analysis. For synthetic-to-real, metrics were reported for validation and test, with test as the final basis. Splits were stratified to preserve a 50/50 balance; randomization reflected script defaults with no random state used. Design choices such as balanced evaluation, uniform preprocessing, isolation, cross-validation, and CLs ensured robust assessment.
4. Experimental Evaluation
Experiments were run on an Apple M3 Pro processor with 18 GB RAM using macOS Sonoma 14.3. The Python programming language was used with pandas, numpy, sklearn, matplotlib, and seaborn libraries. The public version of the KronoDroid dataset was used for the experimentation.
The following sections demonstrate the detailed breakdown of experimental evaluation across five classifiers. Each of the individual classifier sections describes the settings used and outcomes of the evaluations. Later in the section, we discuss the results.
4.1. Evaluation Metrics
For this study, we used the Accuracy, ROC AUC, Precision, Recall, F1 Score, False Positive Rate, and Confidence Intervals metrics as shown by Equations (1)–(7) below. In the equations, TP stands for True Positives, TN for True Negatives, FP for False Positives, and FN for False Negatives.
4.2. K-Nearest Neighbors (KNNs)
The K-Nearest Neighbor (KNN) algorithm is a supervised learning approach that classifies a new instance by comparing it to already labeled examples and assigning the class most common among its k-closest neighbors. In practice, an application is converted into a feature vector, such as permissions, API calls, or intents, and its similarity to other applications is calculated, often with Euclidean distance. The decision is then based on the majority label of those nearest points [
33]. Its strength lies in the fact that it does not rely on prior assumptions about data distribution, allowing it to capture subtle patterns that might emerge from evolving malware families [
34].
Experimental results confirm the usefulness of KNN in malware detection, outperforming several other classifiers [
33]. This combination of simplicity, interpretability, and accuracy highlights why KNN is a valuable tool in malware analysis pipelines. The detailed performance metrics for KNN are provided in
Table 2,
Table 3 and
Table 4 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.
In the pure synthesis training scenario, KNN retains very high precision (0.9811), but recall drops sharply (0.4807), resulting in only 0.7357 accuracy, which suggests the synthetic BankBot records capture some strong signatures but miss substantial real family variability.
Under pure synthesis training, Locker/SLocker performance is close to random (0.5119 accuracy; 0.0542 recall), indicating a major synthetic-to-real mismatch where generated samples fail to represent the real family’s behavior distribution.
Airpush/StopSMS generalizes best in the pure synthesis setting, achieving 0.8925 accuracy and 0.7852 recall while maintaining near-perfect precision (0.9997), implying the synthetic samples capture more of the real family’s stable patterns than for the other families.
4.3. Decision Trees
Decision Trees (DTs) are a popular supervised learning approach in malware detection, known for their clear structure and interpretability. They build a hierarchical model by splitting data into branches based on selected features, leading to a final decision such as classifying a record as benign or malicious. This rule-based process is straightforward to follow, which makes DTs particularly useful in security contexts where analysts benefit from understanding why a classification was made [
35]. DTs are applied across a variety of malware detection studies, using features that can be static, dynamic, or hybrid in nature [
34].
Despite these advantages, DTs can be less robust against advanced malware techniques such as obfuscation or polymorphism. This makes them valuable when combined with other models, for example, in ensemble methods, to strengthen resilience against evasion strategies [
35]. The detailed performance metrics for DTs are provided in
Table 5,
Table 6 and
Table 7 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.
In the pure synthesis setting, DT achieves moderate accuracy (0.6834) with recall still low (0.3960) despite strong precision (0.9312), again suggesting the synthetic BankBot records are “conservative” and miss many real positives.
For Locker/SLocker, pure synthesis training remains weak (0.5672 accuracy; 0.1842 recall), reinforcing that synthetic Locker/SLocker samples do not adequately cover the real family’s feature space diversity.
Airpush/StopSMS remains the strongest under pure synthesis training (0.8813 accuracy; 0.7670 recall) with very high precision (0.9943), showing substantially better synthetic-to-real transfer than BankBot and Locker/SLocker.
4.4. Logistic Regression
Logistic Regression (LR) has been widely studied as a machine learning approach for malware detection because of its suitability for large datasets and its efficiency in binary classification tasks. It models the likelihood of an application being malicious or benign and has been applied to both software- and hardware-based detection scenarios [
36].
From a broader perspective, machine learning techniques like LR are valuable in modern malware detection because traditional methods, such as signature-based detection, are often ineffective against polymorphic and evolving threats. ML enables the automated identification of malicious behavior and reduces detection latency in complex and dynamic environments, which is critical given the growth of malware across PCs, mobile, IoT, and cloud platforms [
35]. The detailed performance metrics for KNN are provided in
Table 8,
Table 9 and
Table 10 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.
With pure synthesis training, LR shows a pronounced generalization gap (0.5824 accuracy; 0.1787 recall) despite high precision (0.9280), indicating the synthetic BankBot data is not sufficiently representative for a linear boundary to recover real positives.
Locker/SLocker is effectively non-functional under pure synthesis training (0.5054 accuracy; 0.0314 recall), which suggests the generated records do not match the real Locker/SLocker feature distribution in a way LR can exploit.
Airpush/StopSMS again transfers noticeably better in the pure synthesis scenario (0.8495 accuracy; 0.7058 recall) while maintaining high precision (0.9906), reinforcing that this family’s synthetic records align more closely with real data than the other families.
4.5. Multi-Layer Perceptron (MLP)
A Multi-Layer Perceptron (MLP) is a supervised feedforward artificial neural network and is often described as the base architecture of deep learning models. It is structured as a fully connected network that contains an input layer, an output layer, and one or more hidden layers that perform the main computational tasks. The network generates outputs through activation functions such as ReLU, Tanh, Sigmoid, or Softmax, and is typically trained using backpropagation with optimization methods including stochastic gradient descent and Adam. Adjusting hyperparameters like the number of hidden layers or neurons can be computationally demanding, but MLPs provide the important advantage of modeling non-linear relationships effectively [
37].
In malware detection, deep neural networks such as MLPs are valuable because their layered structure can learn and extract abstract features from complex, high-dimensional data. This reduces the dependence on manual feature engineering, which is often necessary in traditional machine learning methods. By processing raw input through successive layers, they can identify hidden patterns in behaviors or structures that may indicate malicious activity [
38].
Research shows that neural networks have achieved high levels of accuracy in malware detection across multiple platforms. Detection rates with such models frequently exceed 95%, demonstrating their adaptability and strong performance against evolving and polymorphic malware threats [
35]. The detailed performance metrics for MLP are provided in
Table 11,
Table 12 and
Table 13 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.
In the pure synthesis setting, the MLP achieves only 0.6410 accuracy with low recall (0.2881), even though precision remains high (0.9791), indicating limited synthetic coverage of real BankBot positives.
For Locker/SLocker, pure synthesis training collapses to near-random performance (0.5005 accuracy) with extremely low recall (0.0130), confirming that synthetic Locker/SLocker records are not realistic enough to support generalization.
Airpush/StopSMS is substantially stronger under pure synthesis training (0.8036 accuracy; 0.6101 recall) with very high precision (0.9954), suggesting the synthetic Airpush/StopSMS records are more faithful (and/or less diverse in a learnable way) than for BankBot and Locker/SLocker.
4.6. Random Forest
Random Forest (RF) is an ensemble learning method that constructs multiple decision trees and aggregates their outcomes through majority voting or averaging. This reduces classification errors and helps prevent overfitting, making it more robust than many traditional single classifiers. Each tree is built on random subsets of features, which increases diversity among the trees and strengthens the model’s generalization capacity [
39].
In intrusion and malware detection contexts, Random Forest has been applied successfully due to its ability to handle high-dimensional data and its resilience against noise. Studies show that it can achieve both high detection rates and low false alarm rates [
39]. Furthermore, ensemble methods like Random Forest offer substantial accuracy and robustness by combining multiple models, complementing deep learning techniques while remaining computationally efficient [
35,
38]. Additionally, hybrid approaches have integrated RF with feature selection or deep learning frameworks to improve detection of complex threats, such as Android malware and PowerShell-based attacks [
35,
38]. The detailed performance metrics for RF are provided in
Table 14,
Table 15 and
Table 16 below, highlighting key indicators across scenarios. Each table represents results for each individual malware family. The Real vs. Benign column represents the baseline scenario where both malware and benign samples were taken from the KronoDroid dataset without synthetic augmentations. The Real + Synthetic vs. Benign column represents results where the malware samples from the KronoDroid dataset were enriched with the synthetic data we generated. The Trained on Synthetic, Tested on Real vs. Benign column represents the results where the model was trained exclusively using synthetic malware samples we generated and real benign samples, and then tested on real malware and benign samples from the KronoDroid dataset. A 95% confidence interval (CI) is a range of values calculated from sample data that, in the long run, would contain the true population parameter in about 95% of repeated samples.
Under pure synthesis training, RF attains 0.6926 accuracy with perfect precision (1.0000) but low recall (0.3852), implying it learns a narrow set of synthetic patterns that do not cover a large portion of real BankBot variants.
Locker/SLocker remains the weakest in pure synthesis training (0.5005 accuracy; 0.0076 recall), indicating that synthetic Locker/SLocker samples are strongly mismatched to the real family (the model rarely flags real malware).
Airpush/StopSMS again yields the best pure synthesis transfer (0.8589 accuracy; 0.7191 recall) while keeping precision near-perfect (0.9982), consistent with this family benefiting most from the synthetic generation regime and producing more usable synthetic variety than the other families.
4.7. Discussion
Before training, we applied an identical pre-processing pipeline to both real and synthetic records, including removal of identifier and metadata fields, coercion to numeric types with None values imputed as zero for component-count features, sparsity-based feature pruning (zero-ratio > 70%), and feature standardization using StandardScaler within a unified training pipeline. For synthetic records, we additionally performed post-processing to enforce schema conformity (fixed feature set), remove duplicates, and discard malformed outputs prior to model fitting. These controls help ensure that the performance differences reported below primarily reflect distributional alignment between synthetic and real feature vectors, rather than artifacts of formatting or feature availability.
The findings highlight both the promise and limitations of using LLM-generated synthetic data for malware detection. Models trained exclusively on real malware and benign data achieved exceptionally high detection performance across BankBot, Locker, and Airpush families. For BankBot, multiple classifiers, including Logistic Regression and Random Forest, achieved perfect performance (Accuracy = 1.000, ROC AUC = 1.000, Precision = 1.000, Recall = 1.000, F1 = 1.000). Other classifiers, such as KNN and MLP, also produced near-perfect results, with accuracies above 0.99 and ROC AUC consistently >0.99. For Locker, performance was similarly strong, however, slightly lower than BankBot. Across classifiers, test accuracy ranged between 0.9715 and 0.9857, with ROC AUC values of 0.986–0.998. For Airpush/StopSMS, real-only models were likewise strong across classifiers (Accuracy ≈ 0.976–0.979; ROC AUC ≈ 0.983–0.998), reinforcing that real data yields near-perfect discrimination. These results demonstrate that when sufficient real malware samples are available, classifiers can achieve near-perfect discrimination between malicious and benign applications.
Augmenting real malware data with LLM-generated synthetic samples produced results that were also strong, but generally slightly lower than using real data alone. For BankBot, test accuracy for the best models ranged from 0.9911 to 0.9985, with ROC AUC between 0.9910 and 0.9998. For Locker, accuracies ranged from 0.9601 to 0.9751, with ROC AUC between 0.9606 and 0.9869. For Airpush/StopSMS, the performance remained excellent and very close to that of real-only (Accuracy ≈ 0.975–0.982; ROC AUC ≈ 0.982–0.997), again slightly below the real-only baseline. While these results remain strong, they consistently fall short of the near-perfect performance observed with real-only training. Importantly, the false-positive rate remained very low (typically <0.04), indicating that synthetic augmentation does not compromise specificity. However, the slight decline in accuracy and recall suggests that the synthetic data introduces some noise or distributional mismatch relative to the real malware, thereby diluting the effectiveness of the models compared to training exclusively on real samples.
When models were trained exclusively on synthetic data and then evaluated on real malware, performance degraded substantially. For BankBot, classifiers achieved moderate test accuracy (≈0.64–0.74) and ROC AUC values up to 0.98, but recall was consistently poor (≈0.29–0.49), indicating that the models missed a large proportion of real malware instances despite maintaining high precision (≈0.93–1.00). For Locker/SLocker, performance was near-random (Accuracy ≈ 0.50–0.57; ROC AUC often < 0.60) with extremely low recall, rendering these models impractical. In contrast, Airpush/StopSMS synthetic-only models generalized substantially better (Accuracy ≈ 0.80–0.89; ROC AUC ≈ 0.86–0.90; Recall ≈ 0.61–0.79), while maintaining very high precision (≈0.99). This improvement coincides with the larger family size and deeper LLM fine-tuning (150 samples, 3 epochs), indicating that synthetic-only generalization is sensitive to family characteristics and finetuning regime.
Our results align with Chalé & Bastian, who report that synthetic augmentation can preserve performance, but purely synthetic training underperforms unless real data is retained, supporting the interpretation that synthetic data often reinforces existing statistical structure rather than introducing new predictive information [
19]. Conversely, our synthetic-only results contradict Rahman et al., who report strong synthetic-only intrusion detection; this discrepancy may reflect differences in domain (network flows vs. Android behavioral features), generator type (GAN vs. LLM), and the difficulty of capturing rare family-specific behaviors in a fixed hybrid feature space [
21].
Real-only training remains the benchmark, with near-perfect results across all three families. Real with synthetic augmentation preserves high performance but is consistently slightly below real-only. Synthetic-only training is family- and method-dependent: weak for Locker/SLocker, moderate for BankBot, and stronger for Airpush/StopSMS under a larger/deeper fine-tuning setup. These results suggest that improving synthetic fidelity via more fine-tuning data/epochs for larger families can materially narrow the synthetic-to-real generalization gap.
5. Conclusions and Future Work
This research set out to evaluate the feasibility of LLM-generated tabular malware feature records to support Android threat detection. By positioning real-only detection accuracy as a benchmark, the study contextualized the effectiveness of both augmentation and synthetic-only training. The results show that while LLMs can generate structurally consistent malware records and provide meaningful augmentation, they do not yet achieve the realism or diversity needed to serve as a standalone data source. Notably, for Airpush/StopSMS, a larger family that we fine-tuned with 150 samples and 3 epochs, synthetic-only models achieved substantially higher generalization than for the other families, indicating that synthetic utility improves with family size and fine-tuning depth.
The novelty of this work is 2-fold. First, we cast LLM-based synthesis as a fixed-schema, structured record generation task for Android malware detection and evaluate its utility under a three-scenario protocol that cleanly separates augmentation benefits from synthetic-only transfer. Second, we quantify downstream detection effects across multiple malware families and classifiers, showing that augmentation can preserve strong performance while synthetic-only generalization remains family-dependent and sensitive to the fine-tuning regime, thereby delineating when LLM-generated tabular records are useful and when they are not yet sufficient as a standalone training source.
The contribution of this work lies less in raw performance metrics and more in the insights it provides for practice. First, it demonstrates that synthetic augmentation can be applied without undermining specificity, offering a practical way to enrich scarce datasets. Second, it reveals the fragility of synthetic-only training, underscoring the gap between generated records and operationally valid malware. Third, it highlights methodological considerations such as prompt design, post-processing, and validation, which are crucial when applying LLMs to structured cybersecurity data. These findings establish a foundation for future exploration of synthetic data as both a supplement and a research tool in security analytics.
The findings of the study suggest that LLM-generated tabular malware feature records can be a useful augmentation tool but should not yet be relied upon as a primary data source. Synthetic augmentation can increase training volume while largely preserving detection specificity, but its ability to improve performance in truly low-data regimes remains to be evaluated. However, deploying classifiers trained exclusively on synthetic data would be premature, as current LLM outputs do not generalize reliably to real-world threats. In practice, synthetic malware is best applied as a complement to, rather than a substitute for, real data, supporting tasks such as adversarial training, red teaming, and benchmarking in environments where access to sensitive datasets is limited.
Future work will systematically vary fine-tuning factors such as sample count, epochs, and prompt design across malware families to measure their impact on synthetic fidelity and real-world generalization. By scaling dataset size and prompt diversity, the goal is to reduce distribution drift, improve recall, and avoid increased false positives. Planned experiments will assess fidelity using statistical tests and evaluate downstream model performance, especially on rare families. The work will also explore more capable, open-weight LLMs with efficient adapters to generate larger datasets and consider adversarial or sandbox-based pipelines for richer synthetic behavior.