Next Article in Journal
Deep Reinforcement Learning-Based Fairness and Throughput-Aware Association Control Algorithm for Dense WLAN Systems
Next Article in Special Issue
Embedding-Dependent Performance of Variational Quantum Reinforcement Learning for Intrusion Detection Under Dimensionality Constraints
Previous Article in Journal
Analysis and Control of Capacitor-Based Serial Chain-Link MMC with Reduced DC-Blocking Capacitor
Previous Article in Special Issue
AD-CapsFPN: An Asymmetric Dilated Convolutional Capsule Network with Feature Pyramid for Malware Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Addressing Data Scarcity in Malware Classification via Pixel-Level Synthetic Image Generation

by
Mounika Krishna Teja Karumudi
and
Fabio Di Troia
*
Department of Computer Science, San Jose State University, San Jose, CA 95192, USA
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(13), 2848; https://doi.org/10.3390/electronics15132848
Submission received: 24 May 2026 / Revised: 24 June 2026 / Accepted: 27 June 2026 / Published: 30 June 2026
(This article belongs to the Special Issue AI in Cybersecurity, 3rd Edition)

Abstract

Deep learning-based malware classification using image representations has emerged as a highly effective paradigm for threat detection. However, training robust neural networks is frequently bottlenecked by data scarcity and severe class imbalances in real-world repositories. This study investigates the viability of using an autoregressive PixelCNN framework to synthesize high-fidelity, class-specific malware images to augment limited training distributions. Utilizing the benchmark Malimg dataset, we systematically evaluate a Convolutional Neural Network (CNN) classifier across varying ratios of synthetic-to-authentic data under strict data scarcity constraints (ranging from 10 to 80 authentic samples per family). Our experimental results reveal that while PixelCNN successfully replicates intricate, byte-level micro-textures, classifiers trained exclusively on synthetic data experience catastrophic performance degradation, yielding an accuracy of just 3%. Crucially, however, the introduction of a minimal authentic data anchor (15% to 20%) restores functional decision boundaries, immediately elevating classification accuracy up to 72%. Furthermore, performance saturates rapidly once the training matrix reaches a 50/50 synthetic-to-authentic split, achieving up to 82% classification accuracy, rendering it highly competitive with the 89% accuracy upper bound of a fully authentic baseline. These findings demonstrate an exceptional degree of data efficiency, proving that generative autoregressive augmentation can halve the authentic data collection burden in cybersecurity workflows provided a minor, real-world baseline anchor is preserved.

1. Introduction

Malware remains a pervasive and existential threat to digital ecosystems, with the volume, velocity, and sophistication of attacks reaching unprecedented levels. Recent cybersecurity threat indices highlight an aggressive escalation in threat landscapes; for instance, corporate ransomware velocity and zero-day execution frequencies have surged dramatically, alongside a massive influx of millions of previously unrecorded, highly evasive polymorphic variants targeting enterprise infrastructures annually [1,2]. As these complex threats evolve across multi-architecture environments, manual reverse-engineering and traditional analysis of malicious binaries become entirely impractical. This creates an urgent and critical demand for automated, highly scalable classification techniques that remain resilient against rapid structural mutations and intentional code obfuscation [3,4].
Historically, malware detection has relied on two primary paradigms: signature-based detection and machine learning (ML) approaches [5], including deep learning (DL) architectures [6]. Inspired by the landmark breakthroughs of Convolutional Neural Networks (CNNs) in computer vision, researchers have successfully adapted these models for cybersecurity by transforming raw executable binary byte sequences into 2D grayscale images. These image-based approaches frequently outperform traditional static and dynamic analysis in both macroscopic classification accuracy and computational throughput, as they effectively leverage the spatial correlation and structural topology inherent to compiled code arrays without requiring code execution or unpacking [7,8].
The underlying efficacy of these visual classification frameworks is rooted in the fact that malware families often exhibit highly distinct texture patterns and structural symmetries due to code reuse, shared libraries, and modular development paradigms [9]. As illustrated in Figure 1, instances within a single family (e.g., FakeRean) maintain an exceptionally high degree of visual consistency, whereas samples originating from distinct families exhibit marked macroscopic structural deviations (see Figure 2). However, while standard deep CNNs excel at capturing global, macro-level structural features, they often fail to perceive localized, pixel-level cross-dependencies. This limitation becomes highly problematic when analyzing sophisticated, modern variants that utilize packing or localized encryption techniques specifically designed to smooth out global visual feature representations [10].
To address these fundamental structural limitations, and to mitigate the pervasive, real-world challenge of severe class imbalances and limited labeled samples in threat repositories, this work explores the utility of the Pixel Convolutional Neural Network (PixelCNN). Unlike traditional adversarial generative architectures or standard feedforward CNNs, PixelCNN operates as a strict autoregressive model that analyzes and synthesizes images one pixel at a time, conditioning each step explicitly on all previously generated pixel values [11]. This architectural property allows the model to map the fine-grained joint probability distribution and replicate the intricate, byte-level “micro-textures” of functional malware code. Consequently, this research systematically evaluates PixelCNN’s dual-operational capability: its performance as a robust feature classifier under strict data constraints, and its capacity to synthesize high-fidelity, class-specific malware samples to aggressively augment training data distributions.
Utilizing the foundational, highly benchmarked Malimg dataset, which comprises 9339 samples across 25 distinct threat families [9], this paper comprehensively demonstrates how generative autoregressive augmentation can structurally reinforce downstream classification resilience. We specifically investigate a critical frontier in generative cybersecurity workflows: whether synthetic samples can successfully replicate authentic distribution boundaries to defend classifiers against catastrophic performance degradation, thereby determining the minimum real-world anchor required to train highly accurate networks under severe data scarcity.
The remainder of this paper is organized as follows: Section 2 reviews the state of the art in generative modeling and image-based malware analysis. Section 3 details the mathematical mechanics of the PixelCNN architecture and our proposed augmentation framework. Section 4 presents our experimental setup, evaluation metrics, and empirical results, followed by concluding remarks and future research vectors in Section 5.

2. Related Work

In response to the growing challenge of malware identification and classification, researchers have increasingly explored image analysis combined with machine learning to design more effective detection systems [12]. This section reviews key studies that have shaped the development of image-based malware classification techniques and identifies current research gaps that this work aims to address.
One of the earliest contributions in this area was presented by Nataraj et al. [9], who introduced a framework for converting malware binaries into grayscale images. Rather than analyzing raw binary sequences, the authors extracted GIST texture features from these images and classified them using the k-nearest neighbors (k-NN) algorithm. Their model successfully distinguished among 25 malware families, establishing an early foundation for image-based malware analysis. Building on this work, Yajamanam et al. [13] applied convolutional neural networks (CNNs) to the same dataset, achieving even higher performance in malware detection. Similarly, in a more recent study [14], the authors employed a comparable deep learning approach, further reinforcing the view that automated feature learning substantially improves accuracy over handcrafted representations. Furthermore, researchers have expanded these mapping techniques, transforming executable binaries into grayscale, plasma colormaps, and structural Portable Executable (PE) file entropy layouts, to optimize how neural networks perceive varying structural threat patterns across diverse hardware platforms [15].
The field has recently shifted toward generative modeling to overcome the scarcity of labeled malware data. Generative Adversarial Networks (GANs) have been widely adopted for this purpose. For instance, MalGAN was proposed to generate adversarial malware examples that can bypass detection systems while maintaining functional integrity [16]. Similarly, researchers have utilized GAN-based architectures to augment training sets, improving the resilience of classifiers against previously unseen variants [17]. To combat severe dataset imbalances across rare malware families, conditional variants and Deep Convolutional GANs (DCGANs) have been deployed to equalize class distributions prior to classifier training. However, these systems still struggle with complex structural transformations in evolving threat environments, prompting hybrid attempts that model grayscale image distributions in rigid, sequential, pixel-by-pixel patterns to preserve critical functional and structural boundaries [18].
More recently, Denoising Diffusion Probabilistic Models (DDPMs), or Diffusion models, have emerged as a powerful alternative for synthetic data generation. Unlike GANs, which can suffer from training instability and mode collapse, Diffusion models learn to reverse a gradual noise process to generate high-fidelity images [19]. Recent studies have begun exploring these models for cybersecurity applications, noting their ability to produce more diverse and structurally complex samples compared to traditional generative frameworks [20].
However, despite the success of GANs and Diffusion models, a specific challenge persists: capturing the sequential, pixel-level dependencies that define the “micro-textures” of malware code. While GANs focus on global distribution and Diffusion models on iterative refinement, they can sometimes overlook the fine-grained structural nuances essential for distinguishing closely related malware families. This limitation highlights the need for autoregressive models capable of high-fidelity, pixel-by-pixel synthesis.
PixelCNN provides a promising direction in this regard. By modeling the distribution of image pixels sequentially, it can capture intricate local dependencies and reproduce detailed visual textures [11]. Crucially, the mathematical stability of maximizing log-likelihood via PixelCNN has already demonstrated immense viability in other highly skewed, unbalanced classification domains, proving highly effective at capturing complex high-dimensional textures where traditional GANs undergo training instability or mode collapse [21].
Beyond traditional convolutional networks, the state of the art in generative modeling has increasingly shifted toward Transformer-based architectures that operate on discrete visual tokens. Pioneered by frameworks like Vector Quantized Variational Autoencoders (VQ-VAE) [22], Image Transformers [23], and subsequent autoregressive synthesis pipelines like VQ-GAN [24], these models treat image generation similarly to sequence-to-sequence language modeling. Instead of predicting raw pixel intensities directly, these modern paradigms compress high-dimensional visuals into a discrete quantized codebook space, leveraging a Transformer decoder to handle long-range structural dependencies. While these token-based vision transformers offer superior global semantic coherence over complex, large-scale datasets, their multi-billion parameter footprints require massive source distributions to avoid severe manifold instability. Consequently, under conditions of extreme data starvation, the localized, lean autoregressive modeling properties of PixelCNN remain uniquely advantageous for capturing the rigid, translation-invariant byte blocks characteristic of binary malware imagery.
However, a persistent vulnerability in generative data augmentation frameworks is the potential divergence between visual fidelity and downstream classification utility. Prior benchmarks in computer vision have demonstrated that synthetic samples minimizing standard distribution distance metrics can still induce catastrophic performance degradation when utilized to train deep classifiers in complete isolation [25]. This vulnerability underscores a critical, under-explored threshold in cybersecurity workflows: determining the exact minimum anchor of authentic data required to structurally stabilize decision boundaries when relying heavily on autoregressive synthesis. This work aims to bridge the gap between generative modeling and classification by evaluating PixelCNN’s ability to produce authentic synthetic data that mirrors genuine malware distributions, thereby defining the optimal equilibrium between synthetic expansion and real-world anchoring even under severe data scarcity constraints.

3. Methodology

This section details the Malimg dataset and the architectural components of the PixelCNN framework, including masked convolutional layers, residual blocks, and the 1 × 1 convolutional classifier. It further describes the autoregressive training strategy employed to generate synthetic malware samples.

3.1. Dataset

This research utilizes the Malimg dataset, a benchmark collection comprising 9339 malware samples across 25 distinct families [9,26]. The images in this dataset are generated by reading the raw malware binary files byte-by-byte and mapping each individual byte directly to a grayscale pixel intensity value ranging from 0 (black) to 255 (white). The resulting visual representations organize these sequential byte values into fixed-width matrix structures to highlight the spatial correlations of the code. The dataset is characterized by a significant class imbalance, providing a realistic scenario for evaluating the generative model’s ability to augment minority classes. The distribution of samples per family is detailed in Table 1.
To mitigate the severe class imbalance inherent to the Malimg dataset (where class support ranges from 80 to 2949 samples), a strict class-balanced sampling strategy was enforced during the training phase. Rather than using naive random sampling, mini-batch generation utilized a stratified, weighted sampling mechanism. This approach dynamically adjusted selection probabilities to ensure that every malware family possessed an equal statistical probability of being represented in any given training batch, preventing the downstream CNN from developing an operational bias toward majority classes like Allaple.A. Furthermore, the synthetic augmentation pipeline itself acted as an algorithmic balancer; by generating uniform sample counts across all classes within our targeted scarcity baselines (80, 40, 20, and 10 samples), the framework regularized the uneven empirical distribution, stabilized macro metrics, and prevented minority classes from suffering feature suppression.

3.2. PixelCNN Architecture

To address the challenge of dataset scarcity, this study utilizes PixelCNN, a powerful generative model belonging to the family of autoregressive networks [27]. To understand the mechanics of PixelCNN, it is helpful to contrast autoregression with standard linear regression. In traditional multiple regression, a dependent variable y is predicted using a set of distinct, independent variables (e.g., A ,   B ,   D ) via a linear combination:
y = m 1 A + m 2 B + m 3 D + + C
where A, B, and D are independent predictor variables, m 1 , m 2 , and m 3 are their respective learned coefficients weighting the contribution of each predictor, and C is a constant intercept term.
In contrast, an autoregressive model assumes that the present value of a variable, y t , is directly dependent upon its own historical sequence:
y t = m 1 y t 1 + m 2 y t 2 + m 3 y t 3 + + C
where y t denotes the value of the variable at the current time step t, y t 1 , y t 2 , and y t 3 denote its values at the preceding time steps, m 1 , m 2 , and m 3 are the learned coefficients assigned to each lagged term, and C is a constant intercept.
PixelCNN adapts this sequential forecasting logic to the spatial domain of image generation. By framing an image as a flattened sequence of pixels processed in a row-by-row, pixel-by-pixel fashion, PixelCNN predicts the probability distribution of each new pixel conditioned strictly on the known values of all previously generated pixels. Conceptually, this spatial conditioning mirrors how recurrent architectures process temporal sequences [27]. By explicitly learning the conditional distribution of each pixel relative to its predecessors, PixelCNN captures fine-grained, byte-level micro-textures that standard convolutional architectures often overlook. This complete system framework is illustrated in Figure 3.
To maintain the autoregressive property, that is, ensuring that a pixel’s prediction depends only on previously observed data, PixelCNN utilizes masked convolutional filters. These filters restrict the receptive field to pixels located above and to the left of the target pixel [11,28]. The model employs two mask types:
  • Type A Mask: Applied exclusively to the initial layer to exclude the center pixel, ensuring the network does not “see” the value it is tasked with predicting.
  • Type B Mask: Applied to all subsequent layers to allow the center pixel (which now contains intermediate feature information) to contribute to deeper computations.
To facilitate the training of deeper networks, the architecture incorporates residual blocks. These blocks utilize shortcut connections to mitigate vanishing gradient issues and stabilize the learning process [28]. Within each block, a bottleneck structure is used: the channel dimension is first reduced, a masked convolution is applied, and the original dimensionality is restored (see Figure 4).
Following the feature extraction layers, 1 × 1 convolutions (pointwise convolutions) are employed. These layers act as per-pixel fully connected networks, allowing the model to learn complex cross-channel relationships without violating spatial constraints. The architecture concludes with a softmax classifier, which outputs a discrete probability distribution over the 256 possible intensity values for each pixel.

3.3. Autoregressive Training and Generation

The training objective is to model the joint distribution of the malware image x as a sequence of conditional probabilities, following a raster-scan order (row-by-row, pixel-by-pixel). The probability of an n × n image is defined as:
p ( x ) = i = 1 n 2 p ( x i x 1 , , x i 1 )
where x denotes the full malware image, n × n is the spatial resolution of the image yielding a total of n 2 pixels, x i is the intensity value of the i-th pixel in raster-scan order, and x 1 , , x i 1 represents the set of all previously observed pixels that condition the prediction of x i .
During the training phase, input pixel values are normalized to the [ 0 , 1 ] range via x ^ i = x i / 255 , where x i { 0 , 1 , , 255 } is the original integer pixel intensity. This scaling ensures that input features lie within a consistent numerical range, promoting stable gradient updates. The network outputs a discrete 256-channel softmax distribution over the possible pixel intensities, and the model is optimized using categorical cross-entropy loss against the original integer targets. Once trained, the generation process is iterative: the model predicts a probability distribution for the first pixel, samples a discrete value, and then uses that value as context for predicting the next pixel. This sequential dependency allows the model to synthesize high-fidelity malware images that maintain the structural logic of the original malware families.
While autoregressive models like PixelCNN possess a theoretical dual capability, namely, generating synthetic imagery and performing direct classification via class conditional likelihood estimation, evaluating PixelCNN as a standalone classifier falls outside the scope of this study. In this framework, PixelCNN is utilized strictly as an upstream generative model tasked with alleviating extreme data scarcity. The downstream classification task is entirely delegated to a dedicated, separate CNN classifier optimized specifically for discriminative feature extraction.

4. Experiments and Results

This section elaborates on the hyperparameter optimization and training phases of the PixelCNN generative model, followed by a comprehensive evaluation of the downstream classification performance when utilizing varying ratios of synthetic and authentic malware images.

4.1. PixelCNN Training and Optimization

The training procedure for the PixelCNN network was conducted in sequential phases to optimize image fidelity and stabilize the loss function. The initial baseline architecture yielded a high training loss of approximately 3.5. To improve convergence, an initial hyperparameter exploration was implemented: the learning rate was reduced from 0.01 to 0.001 to ensure smoother gradient descent, and the batch size was lowered from 128 to 32 to provide more frequent, stable weight updates.
Additionally, adjusting the target color depth yielded a significant performance optimization. While lowering the pixel intensity levels from 32 to 8 brought the loss down to approximately 2.0, further reductions to 4 and 2 levels lowered the loss to 1.28 and 0.637, respectively, but resulted in severe visual degradation and black artifacting. Consequently, a pixel intensity quantization level of 8 was retained to maintain an optimal balance between visual structural fidelity and loss minimization.
Following this initial tuning phase, the final optimized generation quality was achieved by scaling the network capacity to the following configuration: an image size of 64 × 64 pixels, 8 pixel quantization levels, 256 filter channels, 7 residual blocks, and a micro-batch size of 2. Under this specialized high-capacity regime, the cross-entropy loss started at 1.845 in Epoch 1 and successfully converged to a stable floor of 1.422 by Epoch 366 (see Figure 5).
To ground these selection choices and ensure reproducibility, the broader hyperparameter exploration bounds and structural grid search parameters examined prior to final model locking are documented in Table 2. This search space was systematically evaluated against a dedicated 15% validation split of the training data.
During the classification phase, convergence trajectories for the downstream CNN classifier were monitored via validation loss curves, with training terminating once the categorical cross-entropy loss hit a stable asymptotic floor (typically by Epoch 60) to protect the model from overfitting to the scarce training assets.

4.2. Class-Specific Image Generation

To resolve early issues where generalized generation failed to capture specific structural families, the PixelCNN model was transitioned to class-specific training. Utilizing an NVIDIA A100 GPU, separate models were trained for each of the 25 malware families. The input resolution was also increased to 128 × 128 , resulting in highly distinct, class-specific synthetic malware samples (see Figure 6).

4.3. Impact of Source Data Volume on Generator Quality

Before evaluating mixed datasets, we first assessed how the volume of initial authentic data impacts PixelCNN’s generation quality. Separate PixelCNN models were trained using 80, 40, 20, and 10 authentic samples per family. In all conditions, the held-out real images used for final evaluation were drawn from the same fixed test partition, defined prior to any training, ensuring a consistent and comparable evaluation benchmark across all source data volumes. Each model was then used to generate a uniform baseline of 100 synthetic samples per family. Subsequent classification models were trained entirely on these synthetic samples and evaluated against the remaining held-out real images. The classification accuracy decayed severely as the PixelCNN source data was reduced: models trained on the 80-sample PixelCNN generation achieved 28% accuracy, the 40-sample generation achieved 4%, and both the 20- and 10-sample generations collapsed to 3%. Figure 7 visually confirms that PixelCNN’s ability to generate useful, generalizing data is highly dependent on the size of its initial authentic training set.

4.4. Classification Performance with Synthetic Augmentation

With high-fidelity, class-specific synthetic data available, a rigorous experimental framework was designed to test the viability of supplementing authentic training data with synthetic data.
Four primary training baselines were established based on the number of available images per family: 80, 40, 20, and 10. Across each baseline, a series of experiments progressively shifted the ratio of the training data from 100% Synthetic to 100% Authentic. All models were subsequently evaluated against the designated test split of real malware images.

4.5. Computational Efficiency and Hardware Environment

To evaluate the practical deployment feasibility of the framework, the training and generation durations were logged under a high-performance cloud infrastructure profile. All computational pipelines were executed within a Google Colab environment backed by an enterprise-grade NVIDIA A100 Tensor Core GPU (equipped with up to 40 GB/80 GB of high-bandwidth VRAM) and a high-throughput multi-core cloud CPU allocation.
The computational overhead for both processing phases is detailed as follows:
  • Model Training Duration: Capitalizing on the tensor acceleration of the A100 architecture, the upstream generative PixelCNN requires approximately 1.5 to 2.5 s per epoch when optimizing on the restricted data baseline under the final high-capacity configuration (256 channels, 7 residual blocks). Consequently, the complete 366-epoch optimization run finishes execution in roughly 10 to 15 min. The downstream CNN classifier converges almost instantly, completing its training phase in under 60 s.
  • Sample Generation Throughput: Due to the sequential, pixel-by-pixel sampling constraints inherent to autoregressive architectures, image generation scales linearly with spatial resolution. For a target resolution of 64 × 64 pixels under 8 quantization levels, the runtime generation cost is approximately 0.05 to 0.10 s per fully synthesized malware sample on the A100. Producing a complete synthetic reinforcement batch of 150 variant images requires less than 15 s of continuous execution.
These benchmarks demonstrate that by utilizing modern cloud hardware acceleration, the framework completely mitigates the traditional computational bottlenecks of autoregressive generation, proving its high practical viability for real-time automated data augmentation workflows.

4.6. Experimental Results Summary

The global classification accuracies across all baselines and split ratios are consolidated in Table 3.
The most notable observation across all four baselines is the failure of the models when trained exclusively or nearly exclusively on synthetic data (0% to 10% real data). Despite PixelCNN’s ability to replicate the visual “texture” of malware, the classifiers require a minimum of 15% to 20% real data to achieve a functional baseline (ranging from 61% to 72%). This suggests that while synthetic images provide excellent spatial coverage, authentic samples act as the critical “ground truth” anchors required for the CNN to learn specific decision boundaries between similar families.
Minor non-monotonic fluctuations are observable across specific evaluation vectors in Table 3 (for instance, the isolated performance drop at the 30% Gen/70% Real split ratio within the 80-sample baseline). These localized variances are attributed to a combination of mini-batch stochastic optimization noise and data subset selection bias, which are naturally pronounced under extreme data starvation constraints. Because the authentic baseline sets are restricted to tiny seed counts (e.g., 10 to 80 samples), minor differences in the underlying structural diversity of the randomized cross-validation folds can cause small shifts in the downstream classifier’s gradient paths during the final training epochs. Rather than indicating a breakdown in the overarching generative scaling trends, these subtle micro-fluctuations reflect standard convergence variances inherent to deep network optimization over highly restricted data distributions.
As shown in Table 3 and visualized in Figure 8, performance begins to saturate as the ratio approaches 50% real data. For example, in the 80-sample baseline, the accuracy gain between 50% real data (82%) and 100% real data (89%) is only 7%. This demonstrates a high degree of data efficiency: by utilizing PixelCNN augmentation, comparable accuracy can be achieved while significantly reducing the burden of authentic data collection.
The detailed classification reports for all experiments (see Appendix A) reveal that certain families, such as Allaple.A and Allaple.L, were successfully identified even with high synthetic ratios. This is likely due to their distinct, repetitive structural patterns which PixelCNN models with high fidelity. Conversely, families like Swizzor.gen!E and Swizzor.gen!I remained difficult to classify until the real-world data ratio increased, indicating that these families possess subtle, non-repetitive features that are much harder to synthesize accurately.
Figure 9 illustrates the model’s prediction confidence evolution at three key stages of the synthetic-to-real ratio.
Detailed classification reports for the splits with at least 30% real data to 90% real data, alongside their corresponding visualization figures, are included in Appendix A to provide a granular view of performance fluctuations across minority classes.

4.7. Comparative Analysis Against Generative Baseline Modalities

To fully evaluate the comparative utility of the proposed PixelCNN data augmentation matrix under extreme data constraints, we contextualize our performance against both a representative GAN-based framework [17] and a state-of-the-art Denoising Diffusion pipeline [20] found in recent malware literature.
In the adversarial domain, Trehan and Di Troia [17] utilized Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) to synthesize 1D sequential opcode embeddings, reporting multi-class family classification accuracy ranging from 70.0% to 82.0% across major families like Zbot using Random Forest engines. In the diffusion domain, Bao et al. [20] leveraged Natural Language Processing (NLP) tokenization combined with a modified generative Diffusion model to augment minority threat classes, ultimately achieving a downstream classification accuracy of 96.0%. Both baseline frameworks operate on 1D sequential feature spaces (tokenized opcode sequences and structural embeddings), whereas our architecture synthesizes 2D spatial byte matrices evaluated directly via a deep Convolutional Neural Network (CNN). As a result, a direct algorithmic replication on identical data splits is unfeasible. Instead, our work isolates downstream evaluation metrics across progressive synthetic-to-authentic blending ratios under identical class-scarcity constraints (10, 20, 40, and 80 samples per class).
The downstream evaluation engine utilized to validate our generated data matrices is a deep convolutional architecture consisting of progressive visual feature extraction blocks—Conv2D(64, 3x3)MaxPooling2DConv2D(128, 3x3)MaxPooling2DConv2D(256, 3x3)GlobalAveragePooling2D—terminating in a Dense softmax layer configured to the 25 target threat families.
When evaluating the data metrics in Table 3 against these broader generative trends, several critical structural insights emerge. Traditional adversarial pipelines often undergo severe mode collapse or require extensive text corpora to achieve stability. Meanwhile, diffusion-based frameworks (e.g., Bao et al. [20]) provide exceptional global distribution alignment but rely heavily on complex NLP pre-processing pipelines to extract semantic context before generation.
Conversely, our pure synthetic autoregressive pipeline (100% Synthetic/0% Real) exhibits a complete failure of downstream task utility, collapsing to an accuracy of 3.0% under extreme scarcity constraints (10 to 20 samples). This empirical reality proves a key structural distinction: while autoregressive processes minimize distribution log-likelihood effectively to reproduce hyper-realistic, localized visual micro-textures, pure synthetic generations lack the global structural constraints required to establish independent downstream decision boundaries.
Crucially, however, the strategic introduction of a minor authentic data anchor completely resolves this baseline degradation. By anchoring the training distribution with a small fraction of authentic samples (the 80% Synthetic/20% Real split), classification performance immediately climbs to 69% at an extreme floor of just 10 authentic samples per class, stabilizing at 72% as baseline data availability scales. Once a balanced 50/50 synthetic-to-authentic data split is achieved, our framework reaches an accuracy of 81.0% to 82.0% across all scarcity layers. This performance successfully matches the operational utility of more complex, tokenized sequence-generation models in the literature without requiring intensive NLP code parsing or feature extraction pipelines. These findings demonstrate that while pure autoregressive synthesis cannot replace genuine datasets in isolation, utilizing a blended data matrix presents a highly data-efficient alternative that effectively halves the real-world sample collection burden in cybersecurity workflows.

4.8. Summary and Discussion of Limitations

This study investigated the viability of utilizing PixelCNN-generated synthetic malware images to mitigate data scarcity in deep learning-based malware classification. By systematically evaluating a Convolutional Neural Network (CNN) trained on varying ratios of synthetic and authentic data across severe scarcity baselines (80, 40, 20, and 10 samples per class), we established clear boundaries for the efficacy of synthetic augmentation.
Our empirical findings demonstrate that while PixelCNN excels at replicating the spatial and textural patterns of malware, particularly for highly repetitive families like Allaple, synthetic data cannot operate in isolation. Models trained exclusively on synthetic samples experienced catastrophic performance degradation, yielding a baseline accuracy of merely 3%. However, the introduction of a minimal authentic dataset (15% to 20%) acting as “ground truth anchors” successfully catalyzed the classifier’s ability to learn accurate decision boundaries, immediately elevating accuracy to functional levels (up to 72%). Most significantly, the results prove a high degree of data efficiency; by utilizing synthetic augmentation, we achieved up to 82% accuracy using only half of the target authentic samples, compared to an 89% accuracy ceiling when utilizing a fully authentic dataset. This confirms that generative augmentation can reduce the real-world collection burden by up to 50% while maintaining near-ceiling performance.
Despite these clear data efficiencies, several core algorithmic and structural limitations merit acknowledgment:
  • Class Overlap and Structural Convergency: Pure image-based feature extraction faces an absolute upper bound when distinguishing between structurally convergent variants. As observed in the Autorun.K family, extensive modular code reuse and common packing techniques create identical visual micro-textures in high-dimensional space. As a result, PixelCNN faithfully replicates these overlaps, compounding pre-existing distribution ambiguities.
  • Computational Scalability and Model Footprint: Training isolated, family-specific models yields an O ( M ) scaling footprint that presents a deployment bottleneck for enterprise repositories with thousands of classes. To mitigate this, future iterations can transition to a unified Conditional PixelCNN using discrete class conditioning vectors, or a vector-quantized latent space system (VQ-VAE/VQ-GAN) to compress textures into a shared codebook, reducing the required footprint to O ( 1 ) .
  • Generator Overfitting under Data Starvation: Training individual models on extremely small seed sets (10 to 20 samples) introduces severe overfitting risks. Rather than capturing a generalizable family distribution, the generator memorizes and amplifies idiosyncratic noise or artifacts present in the scarce source data.
  • Architectural and Evaluation Trade-offs: While modern Diffusion models (DDPMs) offer high semantic fidelity, they suffer from extreme manifold instability and blurred outputs under severe data scarcity, making PixelCNN a more stable baseline for rigid pixel layouts. Additionally, standard generative evaluation metrics like FID are blind to unique binary byte layouts, validating our reliance on downstream classification accuracy (TSTR) as a functional fidelity metric.
  • Feature-Space Representations and Domain Adaptation: The collapse of purely synthetic training stems from severe covariate shift, as the downstream classifier optimizes entirely around autoregressive generation artifacts rather than actual threat dynamics. Through the lens of semi-supervised domain adaptation, introducing a 15–20% authentic anchor provides the vital supervising signals necessary to execute manifold alignment, constraining the hidden layers to map both domains into a shared, invariant embedding subspace.

5. Conclusions and Future Work

This work demonstrates that while fully synthetic image distributions fail to sustain deep learning malware classifiers independently, combining generative data with a minimal anchor of real-world samples significantly mitigates data scarcity. Our evaluation confirms that generative augmentation can reduce the collection burden of authentic samples by up to 50% while maintaining highly functional classification rates, yielding a practically significant reduction in labeling and acquisition costs for security researchers.
Several avenues remain open for future exploration to enhance this framework. As autoregressive pixel-level generation struggles with complex, non-repetitive structural variations, future research should evaluate modern architectures like Denoising Diffusion Probabilistic Models (DDPMs) or advanced GANs. Additionally, implementing an adaptive blending framework could optimize ratios dynamically based on structural class complexity. Finally, future studies should investigate combining visual structural data with sequential features, such as opcode n-grams, and evaluate if classifiers trained on heavily augmented datasets exhibit heightened vulnerability to adversarial evasion techniques.

Author Contributions

Conceptualization, F.D.T.; methodology, F.D.T. and M.K.T.K.; software, M.K.T.K.; validation, F.D.T.; formal analysis, M.K.T.K.; investigation, M.K.T.K.; resources, M.K.T.K.; data curation, M.K.T.K.; writing—original draft preparation, M.K.T.K.; writing—review and editing, F.D.T.; visualization, F.D.T. and M.K.T.K.; supervision, F.D.T.; project administration, F.D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this article can be found at https://drive.google.com/drive/folders/1R9t5LrFjp7dU8MocnkCCPzUq7fyscFFX?usp=sharing (accessed on 26 June 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Extended Experimental Data

The following subsections provide the full classification reports and model prediction visualizations for the intermediate data configurations, specifically spanning ratios from 70% Synthetic/30% Real to 10% Synthetic/90% Real. These supplemental details are included to demonstrate the incremental performance fluctuations and class-specific optimization trends across all 25 malware families as the proportion of authentic training data progressively increases. Note that these experiments were conducted under conditions of extreme data scarcity, wherein a baseline constraint of only 10 total samples per class was utilized for model training. A comprehensive, in-depth analysis of these class-specific breakthroughs and systemic bottlenecks is discussed in Appendix A.8.

Appendix A.1. Results: 70% Synthetic/30% Real

In this experiment, the model accuracy reached 76%. While many families showed high precision, the model began to struggle with minority classes such as Obfuscator.AD and Rbot!gen.
Figure A1. Visualizing Model Predictions: 70% Synthetic/30% Real.
Figure A1. Visualizing Model Predictions: 70% Synthetic/30% Real.
Electronics 15 02848 g0a1
Table A1. Classification Report: 70% Synthetic/30% Real.
Table A1. Classification Report: 70% Synthetic/30% Real.
FamilyPrecisionRecallF1-ScoreSupport
Adialer.C0.931.000.96122
Agent.FYI0.900.980.94116
Allaple.A0.990.980.982949
Allaple.L1.000.990.991591
Alueron.gen!J0.610.910.73198
Autorun.K0.121.000.21106
C2LOP.P0.300.830.44146
C2LOP.gen!g0.280.830.42200
Dialplatform.B0.951.000.97177
Dontovo.A0.931.000.96162
Fakerean0.800.990.89381
Instantaccess0.741.000.85431
Lolyda.AA10.960.940.95213
Lolyda.AA20.900.700.79184
Lolyda.AA30.720.980.83123
Lolyda.AT0.880.990.93159
Malex.gen!J0.800.640.71136
Obfuscator.AD0.000.000.00142
Rbot!gen1.000.020.04158
Skintrim.N0.000.000.0080
Swizzor.gen!E0.000.000.00128
Swizzor.gen!I0.000.000.00132
VB.AT0.000.000.00408
Wintrim.BX0.000.000.0097
Yuner.A0.000.000.00800
Accuracy 0.76
Macro Avg0.550.630.549339
Weighted Avg0.720.760.729339

Appendix A.2. Results: 60% Synthetic/40% Real

In this experiment, the model accuracy remained at 76%. However, we observe an improvement in the F1-scores for several families, including Alueron.gen!J and Obfuscator.AD.
Figure A2. Visualizing Model Predictions: 60% Synthetic/40% Real.
Figure A2. Visualizing Model Predictions: 60% Synthetic/40% Real.
Electronics 15 02848 g0a2
Table A2. Classification Report: 60% Synthetic/40% Real.
Table A2. Classification Report: 60% Synthetic/40% Real.
FamilyPrecisionRecallF1-ScoreSupport
Adialer.C0.951.000.97122
Agent.FYI0.991.001.00116
Allaple.A1.000.840.912949
Allaple.L0.991.001.001591
Alueron.gen!J0.970.980.98198
Autorun.K0.121.000.21106
C2LOP.P0.300.900.45146
C2LOP.gen!g0.440.760.56200
Dialplatform.B0.531.000.69177
Dontovo.A0.960.810.88162
Fakerean0.990.990.99381
Instantaccess0.991.000.99431
Lolyda.AA10.971.000.98213
Lolyda.AA20.710.970.82184
Lolyda.AA30.980.990.98123
Lolyda.AT0.540.990.70159
Malex.gen!J0.220.930.36136
Obfuscator.AD0.651.000.79142
Rbot!gen0.930.990.96158
Skintrim.N0.000.000.0080
Swizzor.gen!E0.000.000.00128
Swizzor.gen!I0.000.000.00132
VB.AT0.000.000.00408
Wintrim.BX0.000.000.0097
Yuner.A0.000.000.00800
Accuracy 0.76
Macro Avg0.570.730.619339
Weighted Avg0.750.760.749339

Appendix A.3. Results: 50% Synthetic/50% Real

This experiment marks a significant improvement to 81% accuracy, with the model beginning to correctly identify families like Skintrim.N.
Table A3. Classification Report: 50% Synthetic/50% Real.
Table A3. Classification Report: 50% Synthetic/50% Real.
FamilyPrecisionRecallF1-ScoreSupport
Adialer.C0.991.001.00122
Agent.FYI0.881.000.94116
Allaple.A1.000.980.992949
Allaple.L0.990.990.991591
Alueron.gen!J0.980.930.95198
Autorun.K0.121.000.21106
C2LOP.P0.360.910.52146
C2LOP.gen!g0.420.840.56200
Dialplatform.B0.731.000.84177
Dontovo.A0.691.000.81162
Fakerean1.000.991.00381
Instantaccess0.990.980.98431
Lolyda.AA10.970.940.95213
Lolyda.AA20.980.970.98184
Lolyda.AA30.980.990.99123
Lolyda.AT0.420.990.59159
Malex.gen!J0.870.930.90136
Obfuscator.AD0.931.000.96142
Rbot!gen0.851.000.92158
Skintrim.N0.991.000.9980
Swizzor.gen!E0.000.000.00128
Swizzor.gen!I0.000.000.00132
VB.AT0.000.000.00408
Wintrim.BX0.000.000.0097
Yuner.A0.000.000.00800
Accuracy 0.81
Macro Avg0.650.780.689339
Weighted Avg0.770.810.789339
Figure A3. Visualizing Model Predictions: 50% Synthetic/50% Real.
Figure A3. Visualizing Model Predictions: 50% Synthetic/50% Real.
Electronics 15 02848 g0a3

Appendix A.4. Results: 40% Synthetic/60% Real

In this experiment, the accuracy reached 82%, and we see the first instances of correct classification for the Swizzor.gen!E family.
Table A4. Classification Report: 40% Synthetic/60% Real.
Table A4. Classification Report: 40% Synthetic/60% Real.
FamilyPrecisionRecallF1-ScoreSupport
Adialer.C0.971.000.98122
Agent.FYI0.990.970.98116
Allaple.A1.000.970.982949
Allaple.L0.991.001.001591
Alueron.gen!J0.990.960.98198
Autorun.K0.121.000.21106
C2LOP.P0.530.860.66146
C2LOP.gen!g0.620.850.72200
Dialplatform.B0.761.000.86177
Dontovo.A0.831.000.91162
Fakerean1.000.991.00381
Instantaccess1.001.001.00431
Lolyda.AA10.961.000.98213
Lolyda.AA20.880.970.92184
Lolyda.AA30.980.990.98123
Lolyda.AT0.830.990.90159
Malex.gen!J0.710.930.80136
Obfuscator.AD1.001.001.00142
Rbot!gen0.970.990.98158
Skintrim.N1.001.001.0080
Swizzor.gen!E0.200.760.32128
Swizzor.gen!I0.000.000.00132
VB.AT0.000.000.00408
Wintrim.BX0.000.000.0097
Yuner.A0.000.000.00800
Accuracy 0.82
Macro Avg0.690.810.739339
Weighted Avg0.790.820.809339
Figure A4. Visualizing Model Predictions: 40% Synthetic/60% Real.
Figure A4. Visualizing Model Predictions: 40% Synthetic/60% Real.
Electronics 15 02848 g0a4

Appendix A.5. Results: 30% Synthetic/70% Real

Accuracy remained at 82% in this experiment, but the model showed the first signs of correctly identifying the Swizzor.gen!I family.
Table A5. Classification Report: 30% Synthetic/70% Real.
Table A5. Classification Report: 30% Synthetic/70% Real.
FamilyPrecisionRecallF1-ScoreSupport
Adialer.C0.941.000.97122
Agent.FYI0.981.000.99116
Allaple.A1.000.980.992949
Allaple.L1.001.001.001591
Alueron.gen!J0.960.970.96198
Autorun.K0.121.000.21106
C2LOP.P0.540.790.64146
C2LOP.gen!g0.490.680.57200
Dialplatform.B0.980.990.99177
Dontovo.A0.591.000.74162
Fakerean1.000.991.00381
Instantaccess0.991.001.00431
Lolyda.AA10.951.000.97213
Lolyda.AA20.740.970.84184
Lolyda.AA30.990.980.99123
Lolyda.AT0.990.990.99159
Malex.gen!J0.770.930.84136
Obfuscator.AD0.991.001.00142
Rbot!gen0.991.000.99158
Skintrim.N1.001.001.0080
Swizzor.gen!E0.200.670.31128
Swizzor.gen!I0.410.080.14132
VB.AT0.000.000.00408
Wintrim.BX0.000.000.0097
Yuner.A0.000.000.00800
Accuracy 0.82
Macro Avg0.700.800.739339
Weighted Avg0.790.820.809339
Figure A5. Visualizing Model Predictions: 30% Synthetic/70% Real.
Figure A5. Visualizing Model Predictions: 30% Synthetic/70% Real.
Electronics 15 02848 g0a5

Appendix A.6. Results: 20% Synthetic/80% Real

In this experiment, accuracy improved to 83%. The model increased its recall for both Swizzor variants, though overall accuracy was still inhibited by the VB.AT and Yuner.A families.
Table A6. Classification Report: 20% Synthetic/80% Real.
Table A6. Classification Report: 20% Synthetic/80% Real.
FamilyPrecisionRecallF1-ScoreSupport
Adialer.C0.981.000.99122
Agent.FYI0.851.000.92116
Allaple.A1.000.980.992949
Allaple.L0.991.001.001591
Alueron.gen!J1.000.991.00198
Autorun.K0.121.000.21106
C2LOP.P0.570.850.68146
C2LOP.gen!g0.570.790.67200
Dialplatform.B0.951.000.98177
Dontovo.A0.511.000.68162
Fakerean1.001.001.00381
Instantaccess1.001.001.00431
Lolyda.AA10.961.000.98213
Lolyda.AA20.830.970.90184
Lolyda.AA30.940.990.96123
Lolyda.AT0.950.990.97159
Malex.gen!J0.910.930.92136
Obfuscator.AD0.991.001.00142
Rbot!gen0.980.990.99158
Skintrim.N1.001.001.0080
Swizzor.gen!E0.200.490.29128
Swizzor.gen!I0.420.340.38132
VB.AT0.000.000.00408
Wintrim.BX0.000.000.0097
Yuner.A0.000.000.00800
Accuracy 0.83
Macro Avg0.710.810.749339
Weighted Avg0.800.830.809339
Figure A6. Visualizing Model Predictions: 20% Synthetic/80% Real.
Figure A6. Visualizing Model Predictions: 20% Synthetic/80% Real.
Electronics 15 02848 g0a6

Appendix A.7. Results: 10% Synthetic/90% Real

In this experiment, the accuracy increased to 85%. Crucially, the model finally began to correctly classify members of the VB.AT family as the training set became 90% authentic.
Figure A7. Visualizing Model Predictions: 10% Synthetic/90% Real.
Figure A7. Visualizing Model Predictions: 10% Synthetic/90% Real.
Electronics 15 02848 g0a7
Table A7. Classification Report: 10% Synthetic/90% Real.
Table A7. Classification Report: 10% Synthetic/90% Real.
FamilyPrecisionRecallF1-ScoreSupport
Adialer.C1.001.001.00122
Agent.FYI0.990.970.98116
Allaple.A1.000.980.992949
Allaple.L0.990.990.991591
Alueron.gen!J1.000.920.96198
Autorun.K0.121.000.21106
C2LOP.P0.520.850.65146
C2LOP.gen!g0.490.710.58200
Dialplatform.B0.961.000.98177
Dontovo.A0.871.000.93162
Fakerean1.000.991.00381
Instantaccess0.981.000.99431
Lolyda.AA10.921.000.96213
Lolyda.AA21.000.970.99184
Lolyda.AA30.980.980.98123
Lolyda.AT0.761.000.86159
Malex.gen!J0.950.930.94136
Obfuscator.AD0.991.000.99142
Rbot!gen0.961.000.98158
Skintrim.N1.001.001.0080
Swizzor.gen!E0.340.540.42128
Swizzor.gen!I0.410.320.36132
VB.AT1.000.620.76408
Wintrim.BX0.000.000.0097
Yuner.A0.000.000.00800
Accuracy 0.85
Macro Avg0.770.830.789339
Weighted Avg0.850.850.849339

Appendix A.8. Class-Specific Attenuation and Boundary Breaking Points

A granular analysis of the sequential classification performance reports across varying synthetic-to-real ratios (Table A1 through Table A7) reveals structural behavioral trends that macro-level accuracy metrics obscure. While the global accuracy scales monotonically from 76.0% to 85.0% as the training configuration shifts from 70% Synthetic/30% Real to 10% Synthetic/90% Real, this performance growth is driven by sudden, non-linear optimization breakthroughs in specific malware families rather than uniform improvement across the dataset.
The most prominent architectural bottleneck observed is the total systemic blindness to the Yuner.A family, which maintains an F1-score of exactly 0.00 across all evaluated ratios. Despite comprising nearly 9% of the testing distribution ( S u p p o r t = 800 ), the features learned from the synthetic generation pipeline fail to align with the empirical reality of the Yuner.A target domain, causing a persistent mathematical depression of the upper-bound global accuracy. A similar phenomenon is present in Autorun.K ( S u p p o r t = 106 ), which exhibits a static recall of 1.00 alongside a severely depressed precision of 0.12 across all experiments. This reveals a highly warped feature space where the model routinely utilizes Autorun.K as a majority-class sink, misclassifying unmapped features from zero-F1 families into this specific boundary zone.
Conversely, the transition to high-proportion authentic training regimes exposes clear baseline threshold dependencies for complex classes. The VB.AT family ( S u p p o r t = 408 ) remains completely unclassified (F1-score of 0.00) from the 70% synthetic configuration down through the 20% synthetic configuration. However, upon reaching the 10% Synthetic/90% Real training split, VB.AT undergoes an abrupt optimization breakthrough, achieving an F1-score of 0.76 ( P r e c i s i o n = 1.00 , R e c a l l = 0.62 ). This indicates that certain malware families exhibit a strict floor of authenticity requirements, that is, their internal structural variance cannot be adequately bridged by the generative distribution, requiring a dominant presence of real-world data to converge during backpropagation.

References

  1. McAfee Labs. Global Threat Landscape Report: The Proliferation of Polymorphic Malware; Technical Report; McAfee Enterprise: San Jose, CA, USA, 2025. [Google Scholar]
  2. SonicWall Cyber Threat Research. 2026 SonicWall Cyber Threat Report: Persistent Evasions and Evolving Attack Vectors; Technical Report; SonicWall, Inc.: Milpitas, CA, USA, 2026. [Google Scholar]
  3. Vi, B.N.; Noi Nguyen, H.; Nguyen, N.T.; Truong Tran, C. Adversarial Examples Against Image-based Malware Classification Systems. In Proceedings of the 2019 11th International Conference on Knowledge and Systems Engineering (KSE), Da Nang, Vietnam, 24–26 October 2019; pp. 1–5. [Google Scholar] [CrossRef]
  4. Alshoulie, M.; Mehmood, A. Deep Learning Approaches for Malware Detection: A Comprehensive Review of Techniques, Challenges, and Future Directions. IEEE Access 2025, 13, 118652–118677. [Google Scholar] [CrossRef]
  5. Aycock, J. Anti-Virus Techniques. In Computer Viruses and Malware; Springer: Boston, MA, USA, 2006; pp. 53–95. [Google Scholar] [CrossRef] [PubMed]
  6. Agarap, A.F. Towards building an intelligent anti-malware system: A deep learning approach using support vector machine (SVM) for malware classification. arXiv 2017, arXiv:1801.00318. [Google Scholar]
  7. Kalash, M.; Rochan, M.; Mohammed, N.; Bruce, N.D.B.; Wang, Y.; Iqbal, F. Malware Classification with Deep Convolutional Neural Networks. In Proceedings of the 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), Paris, France, 26–28 February 2018; pp. 1–5. [Google Scholar] [CrossRef]
  8. Gnaneswar, S.S.; N, A.; M, R.; Bopche, G.S. From Bytes to Pixels: Robust Malware Classification Using Deep Neural Networks. In Proceedings of the 2025 1st International Conference on Data Science and Intelligent Network Computing (ICDSINC), Raipur, India, 9–11 December 2025; pp. 823–829. [Google Scholar] [CrossRef]
  9. Nataraj, L.; Karthikeyan, S.; Jacob, G.; Manjunath, B.S. Malware Images: Visualization and Automatic Classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security (VizSec), Pittsburgh, PA, USA, 20 July 2011; pp. 1–7. [Google Scholar] [CrossRef]
  10. Chandran, S.; Syam, S.R.; Sankaran, S.; Pandey, T.; Achuthan, K. From Static to AI-Driven Detection: A Comprehensive Review of Obfuscated Malware Techniques. IEEE Access 2025, 13, 74335–74358. [Google Scholar] [CrossRef]
  11. Van den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel Recurrent Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1747–1756. [Google Scholar]
  12. Peng, D.; Husain, M.; Siddiqui, A.; Bhattacharya, S. Visual Malware Classification Using a CNN. In Proceedings of the 2024 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 11–13 October 2024; pp. 1–5. [Google Scholar] [CrossRef]
  13. Yajamanam, S.; Selvin, V.R.S.; Di Troia, F.; Stamp, M. Deep Learning versus Gist Descriptors for Image-based Malware Classification. In Proceedings of the 4th International Conference on Information Systems Security and Privacy-Volume 1: ForSE; SciTePress: Setúbal, Portugal, 2018; pp. 553–561. [Google Scholar] [CrossRef]
  14. Vasan, D.; Alazab, M.; Wassan, S.; Naeem, H.; Safaei, B.; Zheng, Q. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. Comput. Netw. 2020, 171, 107138. [Google Scholar] [CrossRef]
  15. Nguyen, H.; Di Troia, F.; Ishigaki, G.; Stamp, M. Generative adversarial networks and image-based malware classification. J. Comput. Virol. Hacking Tech. 2023, 19, 579–595. [Google Scholar] [CrossRef]
  16. Hu, W.; Tan, Y. Generating adversarial malware examples for black-box attacks based on GAN. In Proceedings of the International Conference on Data Mining and Big Data; Springer: Berlin/Heidelberg, Germany, 2022; pp. 409–423. [Google Scholar]
  17. Trehan, H.; Di Troia, F. Fake Malware Generation Using HMM and GAN. In Proceedings of the Silicon Valley Cybersecurity Conference; Chang, S.Y., Bathen, L., Di Troia, F., Austin, T.H., Nelson, A.J., Eds.; Springer: Cham, Switzerland, 2022; pp. 3–21. [Google Scholar]
  18. Mazaed Alotaibi, F.; Fawad. A Multifaceted Deep Generative Adversarial Networks Model for Mobile Malware Detection. Appl. Sci. 2022, 12, 9403. [Google Scholar] [CrossRef]
  19. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  20. Bao, T.; Trousil, K.; Duy Tran, Q.; Di Troia, F.; Park, Y. Generating Synthetic Malware Samples Using Generative AI. IEEE Access 2025, 13, 59725–59736. [Google Scholar] [CrossRef]
  21. Sun, Y.; Ji, Y.; Tao, X. Research on Default Classification of Unbalanced Credit Data Based on PixelCNN-WGAN. Electronics 2024, 13, 3419. [Google Scholar] [CrossRef]
  22. Van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  23. Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image Transformer. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
  24. Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
  25. Ravuri, S.; Vinyals, O. Classification Accuracy Score for Conditional Generative Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  26. UCSB Vision and Robotics Group. Signal Processing for Malware Analysis. Available online: https://vision.ece.ucsb.edu/research/signal-processing-malware-analysis (accessed on 26 June 2026).
  27. Van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Conditional Image Generation with PixelCNN Decoders. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 4790–4798. [Google Scholar]
  28. Salimans, T.; Karpathy, A.; Chen, X.; Kingma, D.P. PixelCNN++: Improving PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Figure 1. Visual representations of three malware samples from the Fakerean family.
Figure 1. Visual representations of three malware samples from the Fakerean family.
Electronics 15 02848 g001
Figure 2. Visual representations from various malware families demonstrating distinct structural textures.
Figure 2. Visual representations from various malware families demonstrating distinct structural textures.
Electronics 15 02848 g002
Figure 3. Overview of the PixelCNN architecture, illustrating the flow from masked convolutions through residual blocks to the final softmax classifier.
Figure 3. Overview of the PixelCNN architecture, illustrating the flow from masked convolutions through residual blocks to the final softmax classifier.
Electronics 15 02848 g003
Figure 4. Architecture of the PixelCNN residual block, demonstrating the bottleneck layers and the identity shortcut connection.
Figure 4. Architecture of the PixelCNN residual block, demonstrating the bottleneck layers and the identity shortcut connection.
Electronics 15 02848 g004
Figure 5. Sample generation result at Epoch 366.
Figure 5. Sample generation result at Epoch 366.
Electronics 15 02848 g005
Figure 6. Class-specific image generation results across distinct malware families. The top row shows synthetic samples, while the bottom row shows real samples.
Figure 6. Class-specific image generation results across distinct malware families. The top row shows synthetic samples, while the bottom row shows real samples.
Electronics 15 02848 g006
Figure 7. Accuracy decay on real malware correlated with reduced PixelCNN training sample size.
Figure 7. Accuracy decay on real malware correlated with reduced PixelCNN training sample size.
Electronics 15 02848 g007
Figure 8. Accuracy vs. Percentage of Generated Malware Used in Training across all baselines.
Figure 8. Accuracy vs. Percentage of Generated Malware Used in Training across all baselines.
Electronics 15 02848 g008
Figure 9. Visualizing prediction confidence evolution. High-intensity areas represent strong class identification.
Figure 9. Visualizing prediction confidence evolution. High-intensity areas represent strong class identification.
Electronics 15 02848 g009
Table 1. Malimg dataset distribution by malware family and type.
Table 1. Malimg dataset distribution by malware family and type.
Malware FamilyMalware TypeSample Count
Adialer.CDialer122
Agent.FYITrojan116
Allaple.AWorm2949
Allaple.LWorm1591
Alueron.gen!JTrojan198
Autorun.KWorm106
C2LOP.PTrojan146
C2LOP.gen!gTrojan200
Dialplatform.BDialer177
Dontovo.ATrojan162
FakeReanRogue381
InstantaccessDialer431
Lolyda.AA1Trojan PWS213
Lolyda.AA2Trojan PWS184
Lolyda.AA3Trojan PWS123
Lolyda.ATTrojan PWS159
Malex.gen!JTrojan136
Obfuscator.ADTrojan142
Rbot!genTrojan158
Skintrim.NTrojan80
Swizzor.gen!ETrojan128
Swizzor.gen!ITrojan132
VB.ATWorm408
Wintrim.BXTrojan97
Yuner.AWorm800
Total 9339
Table 2. Hyperparameter Grid Search Space Boundaries and Configuration Parameters.
Table 2. Hyperparameter Grid Search Space Boundaries and Configuration Parameters.
Model Pipeline/ParameterSearch Space RangeExplored Baseline Value
Generative PixelCNN
Masked Convolutional Layers[4, 6, 8, 12]6 Layers
Filter Dimension (Kernel Size)[3 × 3, 5 × 5, 7 × 7]3 × 3
Feature Map Channels[32, 64, 128, 256]64 Channels
Learning Rate (Adam)[ 10 2 , 10 3 , 10 4 ] 10 3
Batch Size[16, 32, 64]32
Downstream CNN Classifier
Convolutional Blocks[2, 3, 4, 5]3 Blocks
Initial Learning Rate[ 10 3 , 5 × 10 4 , 10 4 ] 10 3
Dropout Rate[0.2, 0.3, 0.4, 0.5]0.3
Batch Size[16, 32, 64]32
Optimizer[SGD, Adam, RMSprop]Adam
Table 3. Classification Accuracy Across Varying Baselines and Synthetic/Real Ratios.
Table 3. Classification Accuracy Across Varying Baselines and Synthetic/Real Ratios.
Gen %Real %80 Samples/Class40 Samples/Class20 Samples/Class10 Samples/Class
100%0%28%4%3%3%
90%10%31%5%3%3%
85%15%61%61%61%-
80%20%72%71%71%69%
70%30%78%78%77%76%
60%40%80%81%77%76%
50%50%82%82%80%81%
40%60%83%82%83%82%
30%70%81%84%83%82%
20%80%84%84%82%83%
10%90%89%88%87%85%
0%100%89%89%88%86%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Karumudi, M.K.T.; Di Troia, F. Addressing Data Scarcity in Malware Classification via Pixel-Level Synthetic Image Generation. Electronics 2026, 15, 2848. https://doi.org/10.3390/electronics15132848

AMA Style

Karumudi MKT, Di Troia F. Addressing Data Scarcity in Malware Classification via Pixel-Level Synthetic Image Generation. Electronics. 2026; 15(13):2848. https://doi.org/10.3390/electronics15132848

Chicago/Turabian Style

Karumudi, Mounika Krishna Teja, and Fabio Di Troia. 2026. "Addressing Data Scarcity in Malware Classification via Pixel-Level Synthetic Image Generation" Electronics 15, no. 13: 2848. https://doi.org/10.3390/electronics15132848

APA Style

Karumudi, M. K. T., & Di Troia, F. (2026). Addressing Data Scarcity in Malware Classification via Pixel-Level Synthetic Image Generation. Electronics, 15(13), 2848. https://doi.org/10.3390/electronics15132848

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop