1. Introduction
As cyber threats continue to evolve, so too does malware adapt in complexity and sophistication [
1]. The increasing profitability of malware has been associated with a rise in data breaches, disruptions to services, and substantial recovery costs, as seen in cases of ransomware attacks [
2,
3]. Ransomware, despite its longstanding presence in the cybersecurity landscape, has evolved markedly in both scale and complexity in recent years. Notably, in 2023, the incidence of documented ransomware attacks increased by approximately 68%, with one of the most severe cases involving a ransom demand of USD 80 million directed at Royal Mail [
4]. Cybersecurity practitioners are required to consistently evolve and refine their detection methodologies to effectively counteract the impact of sophisticated adversarial threats.
The increasing sophistication and rapid evolution of malicious software has rendered its detection a fundamental challenge in contemporary cybersecurity. Advanced malware variants like polymorphic and memory-resident variants are engineered to circumvent conventional security mechanisms. Polymorphic malware, in particular, is capable of dynamically altering its code structure or file signature during propagation while retaining its underlying malicious functionality. Detecting this category of malware poses significant challenges due to its ability to transform into multiple variants that, while structurally distinct, maintain identical functionality [
5]. Cybercriminals increasingly utilise methods like code obfuscation, intentionally transforming malicious code into a convoluted form to hinder detection and analysis [
6].
Considering these challenges and the growing dependence on digital infrastructure for contemporary business operations and service provision, advancing malware detection methods is essential. This can be achieved either by refining existing techniques or by developing novel strategies and solutions. This study addresses this challenge by examining the effectiveness of static and dynamic malware analysis techniques in standardised RGB image formats using the CNN approach. The aim of this study was to establish which analysis strategy offers a more robust foundation for ML-based malware detection. The primary objective was to evaluate the effectiveness of convolutional neural networks in distinguishing between malicious and benign software with a particular focus on comparing the influence of static and dynamic representations on classification performance.
This study presents a novel approach to malware detection by integrating static and dynamic malware analyses into a unified RGB image representation by combining multi-modal analysis, novel augmentation techniques, and architecture-specific CNN design to tackle evolving threats.
The primary contributions presented in this study are summarised as follows:
We propose a novel deep convolutional neural network framework that integrates static and dynamic malware analyses into a unified RGB image representation to enable more comprehensive and accurate classification by capturing both structural and behavioural patterns.
We introduce an advanced data augmentation approach leveraging CycleGAN-based cross-domain image synthesis which significantly expands the malware dataset to enhance the model’s ability to generalise across diverse malware variants.
We design and optimise two tailored CNN architectures specifically adapted to the characteristics of static binary images and dynamic memory dump data by incorporating architectural modifications that effectively address noise and complexity.
2. Related Works
Malware detection approaches are generally categorised into two principal methodologies: static analysis and dynamic analysis [
7]. Static analysis is a conventional and widely adopted technique which has traditionally been employed to detect less sophisticated malware threats. Static analysis refers to the examination of software artefacts without their execution to identify potential vulnerabilities or structural issues. It examines packet headers, associated metadata, and file signature information. Signature-based detection techniques are supported by comprehensive databases of known malware signatures to offer high reliability in identifying previously identified malicious software with minimal false positives. However, adversaries are increasingly adopting evasion strategies such as code obfuscation and encryption to avoid detection. Although static analysis offers rapid and dependable detection for known threats, it exhibits limitations in characterising program behaviour and is often ineffective against novel or obfuscated malicious code.
Dynamic analysis involves a comprehensive understanding of a file’s behaviour by observing its execution within a controlled environment [
8]. Dynamic analysis provides comprehensive insights into program behaviour by enabling real-time observation of malware interactions with the system. By observing activities such as file alterations and network communication, file structures are identified during dynamic analysis [
9]. These behavioural characteristics frequently evade detection through static analysis especially for the effective detection of zero-day and advanced malware variants. However, dynamic approaches often demand substantial computational resources which are time-consuming [
10] and there are higher risks of malware infection [
11]. Virtual sandboxes and artificial environments are commonly employed for malware execution and analysis to mitigate associated risks. The challenge is that these tools may not fully capture the malware’s behaviour, as discrepancies often arise between its actions in such controlled environments and those observed in real-world scenarios [
12].
Static and dynamic analysis methods continue to be fundamental in malware examination, each offering unique advantages tailored to specific use cases. Dynamic analysis enables an in-depth exploration of malware behaviour through the observation of code during execution, whereas static analysis focuses on assessing the software’s structure and functionalities without running the code. The integration of these two approaches has demonstrated considerable improvements in detection accuracy and a greater comprehension of the analysed malware [
13]. While hybrid approaches have the potential to yield more reliable outcomes, they are accompanied by notable drawbacks. Hybrid approaches tend to incur higher costs which demand greater resources. Secondly, they present increased complexity in terms of maintenance. These challenges often render them impractical for widespread application, particularly in routine malware detection scenarios. This observation prompts a critical inquiry: Are hybrid strategies indispensable or can a single well-optimised method augmented by advanced machine learning techniques offer a more efficient and viable alternative? Machine learning (ML) represents an advanced computational approach that has the potential to transform malware detection [
14]. ML leverages artificial neural networks to model and interpret complex data structures. These networks draw inspiration from the architecture of the human brain [
15].
ML consists of interconnected layers of artificial neurons. Each layer processes input from the preceding one, progressively extracting intricate features and patterns from the data. By leveraging data-driven learning, ML techniques can recognise complex patterns and make autonomous decisions without dependence on explicitly programmed rules. The introduction of backpropagation algorithms facilitated the efficient training of multi-layer neural networks, thereby enhancing their capability to model complex data relationships. This advancement enables neural networks to effectively identify intricate patterns, making them particularly well suited for the analysis of malicious software across various contexts, including file-based malware, memory dumps, and behavioural traces.
Machine learning offers substantial advancements in the design of contemporary malware detection methods. One significant benefit of machine learning (ML) in malware analysis is its ability to automate the examination of large-scale data with remarkable speed and efficiency [
16]. However, adversaries have begun leveraging ML techniques to create novel and more sophisticated malware variants by training algorithms on previously identified malicious code [
17]. Machine learning models trained on labeled datasets comprising both malicious and benign software samples enable the classification of software into benign or malicious categories [
18]. After training, such models can generalise from previously observed malware characteristics to identify novel threats, thereby offering significant potential in the detection of zero-day attacks.
The performance of machine learning (ML) models in identifying emerging cyber threats is closely linked to the quality, diversity, and representativeness of the training data, as well as the robustness of the model architecture. Insufficient or unbalanced datasets can hinder the learning process, potentially resulting in elevated rates of false positives or false negatives [
19]. To address this challenge, one promising strategy involves the use of RGB image representations [
20]. This technique transforms malware-related data into colour-encoded images. The RGB image representation technique enhances malware detection models by distinguishing between benign and malicious instances from a set of features embedded within each sample. While static and dynamic malware analysis techniques have been extensively investigated in isolation, there remains a significant gap in comparative studies that utilise consistent convolutional neural network (CNN) architectures trained on uniformly pre-processed image-based representations from both analysis modalities. Deep learning is a subfield of machine learning (ML). It facilitates the automatic extraction of intricate features from raw malware input data with minimal human intervention [
21].
Deep learning architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have emerged as prominent models due to their effectiveness in various learning tasks including malware detection [
22]. Convolutional neural networks (CNNs) are known for efficient capturing of visual patterns with malware attributes. CNNs demonstrated stronger performance in image-based classification tasks than traditional ML models [
23]. A novel approach involves encoding malware binaries into RGB image formats. This enables both static and dynamic analyses to be interpreted within a unified image-centric malware detection. CNNs have demonstrated significant efficacy in malware detection using image-based data representations, owing to their capacity to autonomously learn and extract spatially invariant features from malware inputs.
In ref. [
24], the authors proposed the transformation of malware binaries into greyscale images to effectively capture and analyse texture-based similarities among malware samples. The transformation allows the CNN model to identify patterns and textures within the image representations which correspond to structural and behavioural similarities in the malware code. The RGB representation demonstrated enhanced classification performance by facilitating the extraction of deeper structural features through CNNs. However, despite their effectiveness in capturing spatial characteristics, there is a major limitation in processing data with malware intrinsic sequential dependencies. Recent developments in machine learning (ML) for malware detection have prompted the investigation of alternative data representation techniques to enhance model training and pattern recognition capabilities [
25]. The authors of [
26] designed a malware detection system using a CNN by converting malware files into image representations. The MDS method consists of spatial pyramid pooling layers to accept varying image dimensions. The proposed method can differentiate image colour spaces among malware features to enhance detection and can resist API injections.
In another study, ref. [
27] explored visualisation-based malware detection by image augmentation using a CNN technique. In this study, malware samples were initially represented in matrix form and subsequently converted into grayscale and RGB image formats through the application of a pooling technique. The experimental results indicated that when utilised within the B2IMG framework, the RGB image representation outperformed its grayscale counterpart in terms of malware detection accuracy. Notably, the evaluation was primarily conducted using earlier versions of convolutional neural network (CNN) architectures. However, further analysis employing more advanced CNN models, such as VGG-3 and higher versions, reinforced the superiority of RGB-based representations over grayscale images for the task of malware classification [
28]. While both approaches [
27,
28] are effective, RGB images offer a more comprehensive representation of data as each pixel comprises three distinct colour channels. Each of these colours is capable of independently encoding information, thereby enriching the overall content.
In addition to colour representation, the selection of the appropriate image file format is crucial when converting binary or memory data into visual representations. It is important to note that the chosen format maintains the pixel-level fidelity of the data and avoids the introduction of compression artefacts or distortions. The authors of [
29] proposed a novel MalJPEG framework to detect malicious JPEG images. The framework extracts ten simple yet discriminative features directly from the JPEG file structure. While the results demonstrated that the framework accurately distinguishes between benign and malicious JPEG images even in the presence of a class imbalance, the features used are static. This may not capture the more sophisticated obfuscation or payload concealment techniques employed by advanced attackers. In addition, even though the authors used a large and real-world dataset, this may not cover all variants or evolving threats associated with malicious JPEGs. As shown in
Table 1, despite incremental progress, none of the reviewed works offered a holistic model that simultaneously incorporates static and dynamic analysis, modern machine learning techniques, robust augmentation strategies, and RGB-based image representations. To address this, our proposed model integrates all five dimensions, namely: static and dynamic feature extraction, advanced machine learning, augmentation of the dataset to improve generalisation, and RGB image representation for enriched feature space. This comprehensive approach enhances detection accuracy, model robustness, and interpretability. These make the proposed work a significant advancement over existing techniques.
4. Model Approach
4.1. CNN Model
The proposed model is designed to process image representations of static and dynamic software behaviours for the purpose of binary classification to identify whether a given software instance is benign or malicious. The model was developed using a convolutional neural network. In the CNN architecture of the model, as illustrated in
Figure 6a, the network begins by accepting a standardised augmented image input of size 300 × 300 × 3, representing RGB image data generated from either binary files or memory dumps. The uniform input size ensures consistency across the dataset, which comprises 74,080 augmented samples derived from 504 original software artefacts through brightness and noise transformations. A structural overview of the proposed CNN model is presented in
Figure 6b.
The CNN pipeline developed for the proposed model is structured into four sequential convolutional blocks, each consisting of a convolutional layer followed by a ReLU activation and max pooling operation. Block 1 performs the initial feature extraction using 64 filters of size 3 × 3, capturing low-level patterns such as edges, blobs, and simple textures. The ReLU activation introduces non-linearity, which is essential for enabling the model to approximate complex functions. The max pooling in the model reduces the spatial dimensions from 300 × 300 to 150 × 150 to enhance computational efficiency and promote translational invariance. This early reduction helps manage memory requirements while preserving relevant structural information. Block 2 builds upon the output of Block 1 by increasing the depth to 128 feature maps. Maintaining the 3 × 3 convolutional kernels, this layer detects mid-level abstractions like curves, corners, and simple motifs that are often associated with typical patterns in malware structures or application behaviours. The consistent application of max pooling further downsamples the feature maps to 75 × 75 × 128 to facilitate a gradual abstraction of spatial hierarchies while compressing the image feature space. At this stage, our model begins to capture more distinctive visual signatures associated with specific malware families and benign software clusters.
In Block 3, the number of filters is doubled again to 256, and the spatial resolution is halved to 37 × 37. This increase in filter depth enables the model to learn more complex and task-specific patterns. These include embedded payload indicators, obfuscation patterns, or highly repetitive benign elements, such as UI layouts. Each convolution continues to be followed by a ReLU activation, and pooling is employed to preserve the most salient features while further reducing dimensionality. The growing depth also supports the learning of features that span larger spatial extents, which is particularly beneficial for identifying global patterns in software structure. Block 4 serves as the final stage of convolutional processing, employing 512 filters to extract highly abstracted and semantically rich features from the input. In this block, we reduced the spatial resolution to 18 × 18 to strike a balance between maintaining sufficient spatial context and achieving significant data compression. This deep feature representation is critical for differentiating between subtle yet impactful behaviours in malware versus benign applications.
The use of ReLU and max pooling continues to ensure efficient computation and gradient propagation, which contributes to the stability and robustness of the training process of our model. Following the convolutional backbone, the output tensor is flattened into a one-dimensional vector to serve as input to the fully connected classification head. This section of our proposed architecture is designed to interpret the learned high-dimensional representations for final decision making. The first dense layer consists of 1024 neurons with ReLU activation, which provides sufficient capacity to model the complex distribution of malware and benign classes.
Dropout regularisation is applied at multiple points within the head to mitigate overfitting, which is a critical consideration given the class augmentation and potential for label noise in malware data. The final classification layer reduces the 1024-dimensional vector to an output of two classes corresponding to the binary classification categories malicious and benign. This is followed by a softmax activation to produce normalised probability distributions over the two classes. The network is trained using a categorical cross-entropy loss function with performance optimised via the Adam optimiser. The separation between the convolutional feature extractor and the dense classifier ensures flexibility for future extensions, including transfer learning, multimodal fusion, or attention mechanisms. This makes the network a robust baseline model for intelligent malware classification.
Table 7 presents a summary of the architectural stages along with the corresponding output shapes.
4.2. Static Binary Model
Figure 7 illustrates the convolutional blocks of the static model prior to the fully connected layers and after. The convolutional neural network (CNN) architecture developed for static malware analysis was tailored to classify RGB image representations generated from the binary streams of executable files, effectively distinguishing benign from malicious software instances.
The model comprises four convolutional blocks. The first block employs 64 filters, while the subsequent three blocks utilise 128, 256, and filters. This design approach was to achieve a balance between model expressiveness and computational efficiency. This configuration was determined empirically to optimise performance and mitigate overfitting. Each convolution operation employs a 3 × 3 kernel with ReLU activation, followed by a 3 × 3 max-pooling layer to progressively downsample the spatial dimensions. A stride of 2 was applied during convolution to further accelerate spatial reduction and enhance training efficiency, particularly considering the relatively simpler structure of the binary-derived images compared to memory-based representations. The resulting two-dimensional feature maps were transformed into a one-dimensional vector using a flattening operation to enable compatibility with subsequent dense layers for classification. This vector was then fed into two successive fully connected layers. To mitigate overfitting, dropout regularisation was applied after each dense layer with rates of 0.25 and 0.2. These dropout values were empirically determined through preliminary hyperparameter tuning.
4.3. Dynamic Memory Dump Model
An additional convolutional neural network (CNN) model was designed to perform classification based on dynamic analysis, utilising RGB images derived from memory dumps collected during software execution within a controlled virtualised environment. These visual representations captured the in-memory behaviour and structural characteristics of the running programs. Given the inherent variability between dynamic and static datasets, this model was independently fine-tuned, while still adhering to the core architectural framework detailed in
Section 4.2. Unlike static binary representations, memory dump images exhibited higher structural complexity and noise, necessitating architectural modifications to preserve classification accuracy. As illustrated in
Figure 8, the memory-based architecture, similar in structure to the binary model, was composed of three convolutional blocks.
Each block integrated a 3 × 3 convolutional layer, followed by a 2 × 2 max pooling operation, which is slightly smaller than the 3 × 3 pooling used in the binary model. This was to preserve more spatial detail. Dropout layers were incorporated after pooling to mitigate the risk of overfitting. This design supported effective non-linear feature extraction while retaining architectural coherence with the static model. However, unlike the static configuration, which utilised a more conservative filter depth, the memory model implemented a progressively increasing number of filters to accommodate the higher complexity and variability characteristic of memory dump data. Also, the stride parameter was reduced from 2 to 1 to enhance the resolution of spatial features, which was essential for capturing fine-grained behavioural patterns in dynamic contexts. After feature extraction, the resulting output was flattened and subsequently fed into two fully connected layers comprising 64 and 32 units, respectively. Both layers incorporated L2 regularisation (with a coefficient of 0.09) to reduce overfitting by dealing with severely excessively large weights. Dropout layers with rates of 0.4 and 0.3 were inserted between the dense layers. These rates were elevated relative to the static model to address the increased variability and further prevent overfitting. A detailed summary of the architectures for both models with the final parameter values was determined through hyperparameter tuning, as presented in
Table 8.
5. Results
5.1. Evaluation Metrics
To assess the effectiveness of the convolutional neural network (CNN) models, a comprehensive set of evaluation metrics was employed. These included accuracy, precision, recall, F1-score, and confusion matrices. Among these, accuracy and recall were emphasised due to their critical role in measuring the models’ capability to detect malware accurately while minimising classification errors. In particular, recalls were deemed highly significant, as false negatives instances where malware is misclassified as benign present a greater security threat than false positives. Collectively, these metrics offer a holistic view of each model’s discriminatory power between benign and malicious software instances. The definitions and corresponding formulas for each metric are provided below, along with their relevance to the study.
TP (True Positives): malicious samples correctly identified as malicious. TN (True Negatives): benign samples correctly identified as benign. FP (False Positives): benign samples incorrectly classified as malicious. FN (False Negatives): malicious samples incorrectly classified as benign. To evaluate the classification performance of the proposed model, several standard metrics were employed. Accuracy offers a general indication of the proportion of correct predictions relative to the total number of predictions made, as expressed in Equation (19). While useful, accuracy can be misleading in cases of class imbalance, which is common in malware datasets. Precision assesses the proportion of instances predicted as malicious that are truly malicious, providing insight into the model’s ability to minimise false alarms, which is an important consideration in malware detection tasks, as defined in Equation (20). Recall, also known as sensitivity, quantifies the proportion of actual malicious samples that the model correctly identifies. This metric highlights the model’s capacity to detect true threats, as expressed in Equation (21). To account for the trade-off between precision and recall, the F1-score is employed. This harmonic mean provides a single measure of the model’s balance between detecting malware and avoiding false positives, particularly relevant in imbalanced classification settings. The F1-score is calculated as expressed in Equation (22).
5.2. Binary Streams
The static CNN model exhibited robust classification performance on the test dataset, attaining an overall accuracy of 0.9945 (99.45%) and a recall of 1.00, as illustrated in
Figure 9. The confusion matrix, as illustrated in
Figure 10, summarises the performance of a binary classification model distinguishing between Benign and Malicious classes. The classifier demonstrates exceptionally high accuracy, with most predictions falling on the diagonal of the confusion matrix. This result, while impressive, was not interpreted in isolation, especially in a security-sensitive domain, such as malware, where false positives and false negatives carry different consequences; the sensitivity for both Malicious and Benign classes was calculated as well. The static CNN model achieves 0.9933 (99.33%) sensitivity for the Malicious class. High recall indicates that our model is highly effective at identifying malicious samples and reducing the likelihood of malware bypassing detection. A specificity of 0.9957 (99.57%) for benign samples was achieved by the static model, suggesting that benign samples are rarely misclassified as malicious. This is crucial to minimise disruption to legitimate users or applications in real-time detection. The model achieves a precision (Malicious class) of 0.9957 and an F1-score of 0.9945. This reflects a low false positive rate in minimising unnecessary alerts. The metrices values reflect the model’s consistent ability to accurately differentiate between classes across various threshold values. The learning curves in
Figure 11 reveal that both training and validation accuracies remain consistently high throughout the training process, with no observable signs of overfitting. The corresponding loss curves further support this by showing a smooth and stable convergence over 46 epochs. Collectively, these findings confirm the strong classification performance of the static CNN model in effectively distinguishing between benign and malicious files with minimal misclassification.
While the static model performance is exceptionally strong across all major metrics, a deeper inspection is warranted for real-world deployment benefit. It is observed that out of the 74,080 samples, a low false positive rate of 8 was recorded, ensuring that benign applications are not flagged unnecessarily, which is crucial in enterprise environments. Also, a moderate FN of 19 was recorded by the model. Although small in absolute terms, these instances represent malicious samples that evaded detection. In security contexts, even a small number of undetected threats can lead to significant vulnerabilities or breaches. This is especially critical when scaled to millions of samples in a deployment. Several underlying factors may contribute to the 19 false negatives observed. Malicious instances may exhibit feature distributions similar to benign samples. In this sample, there could be polymorphic malware that might mimic benign behaviour to evade detection, leading the model to misclassify them due to overlapping patterns in the feature space. However, since FN < 0.1%, this model is robust enough to be deployed in compliance-driven industries for industry best practice.
5.3. Memory Dump
The convolutional neural network (CNN) trained on the dynamic (memory dump) dataset demonstrated strong classification performance, achieving an overall accuracy of 99.21%, recall of 97.9%, and F1-score of 99.21%, as illustrated in the confusion matrix in
Figure 12. The low false positive rate is particularly critical in real-world deployment scenarios, where erroneously classifying benign software as malicious can result in unnecessary operational disruptions and administrative overhead. The use of augmented attributes potentially derived through feature engineering techniques such as entropy analysis, opcode frequency transformation, and memory pattern abstraction contributed to enhanced separability between the two classes. This augmentation enabled the dynamic model to learn more discriminative features, leading to tighter decision boundaries.
While accuracy remains high, the marginal reduction in recall for the Malicious class relative to the static model suggests a minor decline in the model’s sensitivity to certain malware variants. This degradation, though numerically small, is of particular concern in threat detection contexts, where even a few undetected malicious instances can have significant operational and security implications. The presence of false negatives in the dynamic environment may stem from temporal variability in runtime behaviour or insufficient generalisation over memory state features. The model has a high precision for the Malicious class, indicating a minimal false positive rate and strong reliability in positively identified threats. However, the precision achieved in the dynamic model is lower by 0.0157 compared to the static model. To further evaluate the discriminatory power of the proposed CNN model, a Receiver Operating Characteristic (ROC) analysis was conducted. The resulting Area Under the Curve (AUC) of 0.98 demonstrates near-optimal performance, indicating the model’s exceptional ability to distinguish between benign and malicious samples across varying classification thresholds, as illustrated in
Figure 13.
The ROC curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different decision thresholds, provides a threshold-independent assessment of classifier performance. The achieved AUC of 0.98 places the model within the excellent performance tier, signifying that it correctly ranks a randomly chosen malicious sample higher than a benign one 98% of the time. This high AUC value also complements previously reported metrics, such as recall, precision and F1-score, confirming that the model performs well not only at its fixed classification threshold but also across a range of possible operating points. This is particularly important in security-sensitive applications, where the optimal decision threshold may vary depending on operational constraints, such as prioritising low false negatives in high-risk deployments. Moreover, the convex shape of the ROC curve reflects consistent performance trade-offs between sensitivity and specificity, reinforcing the model’s robustness. The marginal deviation from a perfect AUC is most likely attributable to edge-case samples that exhibit behaviour or characteristics common to both benign and malicious classes. The training dynamics of the CNN model, as depicted in
Figure 14, indicate consistent convergence across 34 epochs.
This is characterised by a monotonic reduction in both training and validation loss, accompanied by a corresponding increase in classification accuracy. The absence of divergence between training and validation curves suggests no observable overfitting. This suggests that the model’s capacity to generalise effectively to previously unseen dynamic samples is high. Despite its overall robustness in classifying memory-based malware, the CNN’s performance on the dynamic dataset exhibited a marginal decline in recall relative to its static counterpart. This slight reduction in sensitivity is likely attributable to the higher temporal and behavioural variability inherent in memory dump data, which may introduce subtle feature ambiguities during runtime. When evaluating the comparative performance of CNN architectures trained on static (binary stream) and dynamic (memory dump) datasets, both models demonstrated strong generalisation capabilities and high overall accuracy. Nevertheless, the static model exhibited consistently superior performance across all key evaluation metrics, suggesting enhanced discriminative power in capturing invariant patterns present in static representations.
5.4. RGB Visualisation
The visual representations illustrated in
Figure 15 are categorised based on the hierarchical structure which outlines the dataset organisation and feature mapping taxonomy. The RGB images were derived from binary and memory dumps through a byte-to-pixel encoding. Each byte is linearly mapped across the RGB channels to produce a two-dimensional spatial representation of the executable content. This transformation enables the preservation of local and global binary and memory dump structures within the visual domain.
Figure 15a illustrates a diverse set of benign software binaries, including freecad, freeorion, geogebra, googledrive, gyazo, kitty, malwarebytes, and nextcloud. These images exhibit relatively high entropy and heterogeneous textures, reflecting structured compilation and modular code practices typical of legitimate software development. The result shows that the pixel patterns display smoother gradients and multi-tonal distributions, particularly in gyazo and nextcloud. This suggests regularity in the code layout and data sections. The prevalence of colour noise in these images may be attributed to embedded media, GUI libraries, or multilingual resource packs.
Figure 15b reveals the RGB-transformed visual characteristics of malicious binaries. In contrast to benign samples, these samples exhibit highly distinctive visual traits, such as sharp contrast blocks, abrupt transitions, and uniform horizontal/vertical striping patterns. For example, the first image in
Figure 15b contains repetitive rectangular artefacts that suggest packed or obfuscated code, which is commonly used in malware to evade static analysis. The second sample demonstrates a dominant white block, likely indicating padded sections or alignment artefacts introduced by packers. In addition, it shows reduced visual entropy and heightened structural regularity, pointing to the presence of synthetic code injection and encrypted payloads.
Figure 15c,d illustrate memory dump visualisations of benign and malicious software. The visual patterns provide a structural and behavioural contrast between legitimate applications and malware executing in memory. In
Figure 15c, the memory visualisations of benign software reveal high intra-sample variability, yet share a common theme of structured horizontal bands, uniform textures, and clear segment separations. These characteristics suggest systematic memory allocation and modular execution flows, consistent with compiler optimisations and standard operating system routines for legitimate applications. For instance, gimp exhibits a large upper block of uniform black, possibly representing unused memory or zeroed regions, which is common in idle or GUI-intensive applications. These patterns reflect predictable and regulated runtime behaviour for stable benign execution fingerprints. On the other hand,
Figure 15c presents more uniform, noisy, and dense textures across the image plane. This indicates highly packed or obfuscated memory regions, often resulting from runtime unpacking, self-modifying code, or the use of encryption. Also, a lack of modularity or segmentation compared to benign samples suggests non-standard memory access patterns and volatile behaviour common in advanced malware variants. The dense pixel streaks observed in some samples point to code injection, API hooking, shellcode execution, or memory scraping which overwrites or manipulates memory for exploitation.
7. Conclusions
This study proposes a unified detection framework that leverages both static and dynamic analysis features for malware classification based on RGB image representations. By transforming both binary and memory dump data into visual RGB image formats, our approach effectively bridges the structural and behavioural perspectives of malicious software. The architectural specialisation of the static and dynamic models allowed for domain-specific feature extraction by optimising performance in each analysis context. The extensive use of data augmentation, including CycleGAN-based cross-domain synthesis, significantly expanded the training dataset and contributed to model generalisability. The static model demonstrated exceptional performance, achieving an accuracy of 99.45% and perfect recall, underscoring its precision in detecting structural malware patterns. Meanwhile, the dynamic model achieved a commendable accuracy of 99.21%, despite the inherent noise and variability of memory-based data. These results show the efficacy of multimodal visual learning for detecting both known and evolving malware variants.
The heatmap-based dataset analysis confirms the balanced and diverse composition of both benign and malicious samples, which is an essential factor in ensuring fair and unbiased model training. The complementary nature of the two CNN architectures highlights the practical potential for deployment in layered malware detection systems where static pre-screening is followed by dynamic profiling for ambiguous cases. These findings point to a broader trade-off between representation complexity and dataset completeness, especially in dynamic analysis scenarios where subprocess interactions or incomplete memory acquisition can degrade performance. Unlike previous works relying heavily on handcrafted descriptors or static grayscale transformations, this study demonstrates the viability of direct RGB image conversion from both binaries and memory snapshots by streamlining the pipeline while maintaining high classification accuracy. One of the unique features of the proposed model is its architectural simplicity, which when paired with data augmentation and controlled pre-processing is sufficient for achieving robust malware detection.
While the experimental framework was designed to isolate and compare static and dynamic features under uniform conditions, certain limitations of the proposed model should be noted. The dynamic analysis was constrained to the memory region of the primary process, omitting potential behavioural artefacts arising from spawned child processes or inter-process code injection. This limitation may have led to incomplete behavioural representations. To address this limitation, future work should consider implementing full-system memory dumps or adopting process tree tracing techniques to ensure a more holistic characterisation of runtime behaviour by strengthening the completeness of the dynamic analysis pipeline. In addition, real-time performance optimisation and the incorporation of memory-efficient lightweight CNNs will be explored to enable on-device inference in endpoint protection systems.