MalScore: A Quality Assessment Framework for Visual Malware Datasets Using No-Reference Image Quality Metrics

Czaplicki, Jakub; Rahouti, Mohamed; Chehri, Abdellah; Hayajneh, Thaier

doi:10.3390/fi17120554

Open AccessArticle

MalScore: A Quality Assessment Framework for Visual Malware Datasets Using No-Reference Image Quality Metrics

¹

Department of Computer and Information Science, Fordham University, 113 W 60th Street, New York, NY 10023, USA

²

Department of Mathematics and Computer Science, Royal Military College of Canada, Kingston, ON K7K 7B4, Canada

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(12), 554; https://doi.org/10.3390/fi17120554

Submission received: 10 November 2025 / Revised: 24 November 2025 / Accepted: 28 November 2025 / Published: 1 December 2025

(This article belongs to the Special Issue Intrusion Detection and Resiliency in Cyber-Physical Systems and Networks—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The Internet has been progressively more integrated into daily life through its evolutionary stages, ranging from web1.0 to the current development of web3.0. These continued integrations broaden the attack surface that cybercriminals aim to exploit. The prevalence of cybercrimes, particularly malware attacks, has become increasingly sophisticated and made more accessible through dark web marketplaces. Including artificial intelligence (AI) within anti-virus solutions has challenged the traditional dichotomy of malware detection schemes, offering more accurate and holistic detection capabilities. Research has shown that transforming malware files into textured images offers resistance to obfuscation and the potential to detect zero days. This paper explores the application of image quality assessment (IQA) techniques in enhancing visual malware dataset curation. We propose a novel framework that applies a no-reference IQA algorithm to evaluate current datasets and offer guidance in future dataset curation. Using multiple popular datasets, our evaluation demonstrates that the proposed MalScore framework effectively differentiates dataset quality—for example, MalNet Tiny achieves the highest score of 95%, while the NARAD malicious-image subset scores 50%. Additionally, BRISQUE was the only IQA algorithm to exhibit a strong linear sensitivity to blur levels across datasets. These results highlight the practical utility of MalScore in assessing and ranking visual malware datasets and lay the groundwork for uniting IQA and visual malware detection in future research.

Keywords:

visual malware detection; image quality assessment; synthetic dataset generation

1. Introduction

The Internet has steadily grown in popularity since its inception, evolving through various stages until reaching the widely adopted Web2.0 era, while Web3.0 technologies continue to emerge and evolve [1]. Each evolutionary step has expanded the digital ecosystem and, with it, the attack surface that cybercriminals seek to exploit. As online services have become more integral to daily life, the prevalence and sophistication of cyberattacks, particularly malware, have continued to rise [2]. Malware operations today benefit from low entry barriers, widespread distribution channels, and increasingly advanced evasion techniques [3], underscoring the need for robust, scalable detection mechanisms [4].

To address these evolving threats, modern cybersecurity research increasingly relies on machine learning (ML) and deep learning (DL) to complement traditional static and dynamic analysis techniques [5], while static analysis is efficient for known malware, it struggles against obfuscation; dynamic analysis can detect new variants but is resource-intensive and difficult to scale [6]. Visual malware detection offers an alternative by converting malware binaries into grayscale or RGB images (Figure 1) that can be processed by DL architectures, such as convolutional neural networks (CNNs), enabling them to learn texture and structural patterns resilient to certain obfuscation techniques. Nataraj et al. [6] first established this transformation process, which has since become foundational to DL-based visual malware analysis.

Building on the techniques of representing malware as images, DL has become a powerful approach for learning discriminative patterns from high-dimensional data. Numerous architectures, including deep neural networks (DNNs), CNNs, and recurrent neural networks (RNNs), have been successfully applied to malware detection tasks [7]. Although these models differ in structure and capabilities, they share a critical dependency: their performance is directly tied to the quality of the dataset used for training. Poor-quality images, inconsistent preprocessing, noise, distortions, or insufficient diversity can degrade learning, reduce generalizability, and undermine evaluation results. Consequently, curating high-quality, consistent, and up-to-date visual malware datasets is essential for advancing detection research.

This challenge brings us to the field of image quality assessment (IQA). IQA provides objective mechanisms to quantify image clarity, distortion, noise, and naturalness—factors that directly influence the effectiveness of DL models trained on visual representations. IQA has been widely applied in multimedia systems and content distribution pipelines, where image quality strongly affects performance and user experience [8]. In this domain, three types of IQA algorithms are commonly used:

Full-reference algorithms: Compare a distorted image with an original pristine reference.
Reduced-reference algorithms: Compare extracted features from pristine and distorted images to estimate quality.
No-reference algorithms: Operate without any reference image, evaluating quality solely from the provided pixels or bitstream. These methods are among the most challenging due to the absence of ground truth [9].

Despite the increasing reliance on visual malware datasets, no standardized methodology currently exists to assess their image quality or guide consistent dataset creation. High-entropy malware images differ significantly from natural images, making it unclear which IQA algorithms are reliable for this domain. This gap motivates the need to investigate IQA, particularly no-reference approaches, as a means to evaluate, compare, and ultimately improve visual malware datasets used in DL-based detection systems.

Building upon these research directions, this work proposes a framework to standardize the creation of new visual malware datasets and evaluate current ones. The main contributions of this paper are summarized as follows:

Bridging the gap between IQA and visual malware detection: We integrate IQA techniques with visual malware detection, enhancing the quality and robustness of malware image datasets against obfuscation and zero-day threats.
Investigation of no-reference IQA (NR-IQA) algorithms: We apply NR-IQA algorithms to evaluate the quality of visual malware datasets without needing a pristine reference image. This helps measure image quality and detect issues like blur and noise, crucial for maintaining dataset integrity.
MalScore formula and framework: We propose the MalScore framework, which provides a standardized method to evaluate and compare visual malware datasets using metrics such as BRISQUE score, labeling clarity, dataset size, content diversity, data recency, and preprocessing needs. This formula helps identify areas for improvement and guides the creation of high-quality, diverse, and up-to-date datasets, improving DL model performance in malware detection.

The remaining sections of this paper are outlined as follows: Section 2 discusses the current state of research regarding visual malware detection and IQA. Section 3 summarizes the experiment methodology and provides a discussion. Section 4 describes the proposed framework to evaluate existing datasets and guide the creation of new ones. Section 5 describes the limitations faced and offers direction for future work. Section 6 presents concluding remarks.

2. Related Work

There has been considerable attention in the literature on visual malware detection and IQA. However, further research is needed to explore the application of IQA in enhancing the development of visual malware detection.

Nataraj et al. [6] streamlined the process of converting a malicious software’s bit stream into a visual representation, demonstrating that basing detection on grayscale images and their textures was resilient to obfuscation techniques. They applied an ML algorithm, GIST, achieving a 98% classification accuracy. Building on this, the next logical step was to investigate using RGB images instead of grayscale. He and Kim [10] explored this by correlating every three bytes to red, green, and blue, respectively. Researchers soon adopted a DL approach to the visual malware classification problem, paving the way for neural network methodologies. Aslan and Yilma [7] discussed various DL architectures such as CNNs and hybrid deep neural networks. Additionally, other practical DL approaches, like using autoencoders, have been showcased in the work by Xing et al. [11].

The issue of limited visual malware datasets is frequently highlighted in many studies. Datasets often lack the desired variety or suffer from class imbalance. Additionally, some organizations are hesitant to distribute malware samples due to their sensitive nature, further complicating dataset curation. The use of conditional generative adversarial networks (cGANs) to generate synthetic malware samples has been examined in several studies [12,13]. These studies demonstrated that cGANs are effective in mitigating class imbalance issues and can produce a diverse range of malware samples, including synthetically obfuscated malware images.

Malware obfuscation techniques continue to challenge researchers, regardless of the detection method used. Although Nataraj et al. [6] showed resistance to obfuscation techniques using an ML method, further research is needed. Notably, Phan et al. [14] demonstrated that reinforcement learning (RL) paired with GANs can generate samples that effectively bypass malware classifiers.

The cited literature demonstrates that dataset curation is a complex task requiring a wide variety of malware samples. A diverse range of malware image samples can make a model more resistant to adversarial attacks and better at detecting unseen samples and zero-day threats. Currently, there is no standard model for dataset creation. Studies such as [6,10,15] provide some guidance on the methodology of converting malicious code bit streams into grayscale and RGB image datasets. However, while these processes are explained, more detail is needed regarding image resolution and quality standards.

To address the dataset quality issue, image-specific concepts must be considered. Ying et al. [9] differentiated between esthetic images and perceptual quality, noting that an image may be visually pleasing but not necessarily of excellent picture quality. Additionally, images may have the same level of degradation but differ perceptually in perceived quality. These challenges fall within the scope of the NR-IQA research problem, leading to the development of specific algorithms optimized for various tasks. Zhai and Min [16] introduced several NR-IQA algorithms and discussed human opinion scoring and the purpose behind these algorithms. For instance, they introduced BRISQUE as a spatial quality evaluator, which is considered a general-purpose IQA measure. Shahid et al. [8] generalized common issues affecting photos, such as blurring, blockiness, ringing, and noise, which are metrics used by pixel-based NR-IQA algorithms to provide a score. Additionally, NR-IQA algorithms can be tuned to detect the “naturalness” of an image.

Unlike existing studies that primarily focus on either the technical aspects of malware detection or the quality assessment of general images, our paper bridges these two fields by applying IQA techniques specifically to the domain of visual malware detection. We propose the MalScore framework, which uses no-reference IQA algorithms to evaluate and enhance the quality of malware image datasets, thereby improving the robustness and effectiveness of malware detection models.

3. Experimental Methodology and Discussion

This section outlines the experimental methodology and discusses the implementation and evaluation of our proposed techniques for enhancing visual malware detection.

3.1. Benchmark Datasets

Experiments were performed on five datasets: Malimg, Malevis, MalNet, NARAD, and Malware as Images.

3.1.1. Malimg Dataset [17]

It contains 9339 visual malware samples spanning across 25 families. For CNN training and blur assessment, the dataset used was enriched with an additional 981 benign grayscale samples obtained from [18]. During framework calculations, the original Malimg dataset was used.

3.1.2. Malevis Dataset [19]

It contains 9100 training samples with an additional 5126 validation RGB samples, all corresponding to 25 malware families.

3.1.3. MalNet Dataset [20]

It is a massive, 80GB database of Android malware image representations and graph data. It contains 1,262,024 images spanning over 696 families and 47 different types of malware. To improve logistics, the MalNet-Tiny dataset was used containing 61,201 training images, 8742 validation images, and 17,486 test images spanning 43 types of malware.

3.1.4. The NARAD Dataset [21]

It contains 48,240 binary visualization images generated using fuzzy logic. The “Malicious images” subfolder was used containing 11,919 files.

3.1.5. Malware as Images [22]

It converted portable executables (PE) of both malicious and benign software. The malicious files were taken from the Zoo [23] malware repository and benign files were chosen from PC Magazine’s The Best Free Software of 2020. Each malicious folder contains 75 files with 8 various processing methods. Each benign folder contains 61 files also processed with 8 various processing methods. All original dataset images included an unacceptable X and Y axis border for CNN use. Photos were carefully cropped to remove the border. This manual cropping method could lead to minuscule amounts of data loss due to methodology constraints. To make the overall testing process faster, the Lanczos_1200 processed images were taken and down sampled by 25% using the Lanczos interpolation algorithm. Table 1 summarizes these known and miscellaneous datasets.

3.2. Dataset Degradation

Various levels of Gaussian blur were applied to the original datasets. This blur was intentionally introduced to degrade neural network classification and IQA performance. As the blur was applied with predefined percentages, statistical insight into the IQA algorithm sensitivity could be gained. Additionally, a correlation between IQA scores and neural network performance can be investigated. After the blur application, three datasets were derived: 3% blur, 10% blur, and 25% blur. In summary, four datasets were investigated with the unmodified original dataset containing 0% blur. Figure 2 illustrates the varying levels of Gaussian blur applied to a sample from the Malevis set.

To provide full reproducibility and clarify the underlying implementation, the Gaussian blur used in this study was applied using a parameterization that scales with image resolution. Specifically, for each image of width W and height H, the standard deviation

σ

of the Gaussian kernel was computed as

σ = α \cdot \sqrt{W^{2} + H^{2}},

where

α \in {0.03, 0.10, 0.25}

corresponds to the 3%, 10%, and 25% blur levels, respectively. This formulation ensures that blur severity is proportional to overall image size and therefore consistent across datasets with different dimensions. The Gaussian kernel applied was

G (x, y) = \frac{1}{2 π σ^{2}} exp (- \frac{x^{2} + y^{2}}{2 σ^{2}}),

which is a widely used low-pass filter in image processing and IQA research.

The following pseudocode summarizes the blur-generation procedure used throughout all experiments:

procedure APPLY_GAUSSIAN_BLUR(image, alpha):
W, H ← image dimensions
sigma ← alpha ∗ sqrt(W^2 + H^2)
blurred ← GaussianFilter(image, sigma)
return blurred

It is important to note that Gaussian blur is not intended to mimic any specific real-world degradation mechanism in malware image pipelines. Instead, it serves as a controlled, monotonic, and mathematically well-defined distortion that allows for systematic evaluation of IQA sensitivity and CNN robustness. Although real degradation in visual malware datasets typically arises from processes such as downsampling, interpolation, compression, or preprocessing inconsistencies rather than optical blur, these artifacts share low-pass characteristics that influence texture and entropy in comparable ways.

3.3. IQA Procedure

The IQA model experiments were facilitated by utilizing calibrated and configured algorithms. As shown in Algorithm 1, six NR-IQA algorithms were run on each original and degraded dataset:

BRISQUE [24]: BRISQUE evaluates the spatial quality and does not take into account ringing, blur, or blocking. Rather, it analyzes scene statistics to quantify a loss of “naturalness”.
CLIPIQA, CLIPIQA+, CLIPIQA+_rn50_512 [25]: CLIP operates under the notion that quantification of issues such as blur and noise do not directly correlate with human preferences. CLIP aims to assess the quality perception and abstract perception of an image. The differing versions of the implementation operate on different underlying infrastructures.
CNNIQA [26]: CNNIQA takes image patches as input and works in the spatial domain without traditional feature extraction methods. It utilizes one convolutional layer with max and min pooling preceding two fully connected layers with an output node. It is additionally capable of demonstrating local quality estimations.
DBCNN [27]: The DBCNN implementation consists of two specifically trained CNNs trained for different distortion scenarios. The model has been fine tuned on human-evaluated image databases.

Within BRISQUE, lower scores indicate a superior result. Conversely, for all other IQA algorithms, a higher score indicates a superior result. Each one of the six algorithms evaluates image quality differently, with lower scores indicating better quality for BRISQUE, and higher scores indicating better quality for the other algorithms.

Algorithm 1 IQA model experiments.

1:: procedure RunIQAAlgorithms(datasets)
2:: for dataset in datasets do
3:: originalDataset ← dataset.original
4:: degradedDataset ← dataset.degraded
5:: scoreBRISQUEOriginal ← RunBRISQUE(originalDataset)
6:: scoreBRISQUEDegraded ← RunBRISQUE(degradedDataset)
7:: scoreCLIPIQAOriginal ← RunCLIPIQA(originalDataset, ‘CLIPIQA’)
8:: scoreCLIPIQADegraded ← RunCLIPIQA(degradedDataset, ‘CLIPIQA’)
9:: scoreCLIPIQAPlusOriginal ← RunCLIPIQA(originalDataset, ‘CLIPIQA+’)
10:: scoreCLIPIQAPlusDegraded ← RunCLIPIQA(degradedDataset, ‘CLIPIQA+’)
11:: scoreCLIPIQAPlusRN50Original ← RunCLIPIQA(originalDataset, ‘CLIPIQA+_rn50_512’)
12:: scoreCLIPIQAPlusRN50Degraded ← RunCLIPIQA(degradedDataset, ‘CLIPIQA+_rn50_512’)
13:: scoreCNNIQAOriginal ← RunCNNIQA(originalDataset)
14:: scoreCNNIQADegraded ← RunCNNIQA(degradedDataset)
15:: scoreDBCNNOriginal ← RunDBCNN(originalDataset)
16:: scoreDBCNNDegraded ← RunDBCNN(degradedDataset)
17:: end for
18:: end procedure
19:: function RunBRISQUE(dataset)
20:: /* BRISQUE evaluates the spatial quality */
21:: score ← ExtractSceneStatistics(dataset)
22:: naturalnessScore ← QuantifyLossOfNaturalness(score)
23:: return naturalnessScore
24:: end function
25:: function RunCLIPIQA(dataset, version)
26:: /* CLIPIQA evaluates quality perception */
27:: if version == ‘CLIPIQA’ then
28:: score ← AssessQualityPerception(dataset)
29:: else if version == ‘CLIPIQA+’ then
30:: score ← AssessQualityPerceptionPlus(dataset)
31:: else if version == ‘CLIPIQA+_rn50_512’ then
32:: score ← AssessQualityPerceptionRN50(dataset)
33:: end if
34:: return score
35:: end function
36:: function RunCNNIQA(dataset)
37:: /* CNNIQA works in the spatial domain */
38:: patches ← ExtractImagePatches(dataset)
39:: convolutionLayer ← ApplyConvolutionLayer(patches)
40:: pooledFeatures ← ApplyMaxMinPooling(convolutionLayer)
41:: flattenedFeatures ← Flatten(pooledFeatures)
42:: outputScore ← FullyConnectedLayers(flattenedFeatures)
43:: return outputScore
44:: end function
45:: function RunDBCNN(dataset)
46:: /* DBCNN uses two CNNs trained for different distortion scenarios */
47:: cnn1 ← LoadTrainedCNN(‘distortion1’)
48:: cnn2 ← LoadTrainedCNN(‘distortion2’)
49:: score1 ← cnn1.evaluate(dataset)
50:: score2 ← cnn2.evaluate(dataset)
51:: combinedScore ← CombineScores(score1, score2)
52:: return combinedScore
53:: end function

3.4. CNN Training and Experimental Configuration

To correlate IQA algorithm scores with neural network performance, a simple ResNet-50 [28] CNN was implemented using TensorFlow. The objective was not to achieve high accuracy but rather to obtain a metric that could be correlated with blur levels and IQA scores. Each dataset was trained for ten epochs as a proof-of-concept, using only the Malevare set for training. The unmodified original dataset’s folders were combined and then processed using scikit-learn’s train_test_split. Folders within the datasets with applied blur were combined to form the training set, while the val folder within the original Malevare set was used as the test set for blurred datasets. Model validation accuracy was recorded and charted.

To ensure full reproducibility and satisfy best-practice standards in DL experimentation, we report the complete training configuration used in all CNN experiments. All models were trained on a workstation equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), an Intel Core i9–12900K CPU, and 64 GB RAM, running Ubuntu 22.04 LTS, CUDA 12.1, and cuDNN 8.9. The implementation was conducted in TensorFlow 2.14 with Keras high-level APIs and supporting libraries including NumPy 1.26, SciPy 1.11, scikit-learn 1.3 for dataset partitioning, and OpenCV 4.8 for blur operations. All images were resized to a unified resolution of 224 × 224 and normalized to the

[0, 1]

floating-point range. A batch size of 32 was used for all runs. Optimization was performed using the Adam optimizer with a learning rate of

1 \times 10^{- 4}

,

β_{1} = 0.9

,

β_{2} = 0.999

, and no weight decay; all other hyperparameters remained at TensorFlow defaults. Training proceeded for 10 epochs, consistent with the study’s proof-of-concept objective. No data augmentation was applied beyond the Gaussian blur described in Section 3.2.

The ResNet-50 backbone was initialized with ImageNet-pretrained weights, and only the final classification layers were fine-tuned; the remaining layers were left trainable to preserve representational capacity. The train/validation split used for the Malevare set followed scikit-learn’s train_test_split with an 80/20 ratio and a fixed random_state=42 to ensure deterministic reproducibility. During evaluation, accuracy was computed on the unaltered Malevis validation partition. All code used in these experiments is publicly available in the referenced GitHub repository [29], ensuring that all architectural, preprocessing, and parameter choices can be fully reproduced.

3.5. Results and Discussion

Each IQA algorithm generated a numerical score to evaluate the image quality. Surprisingly, most algorithms did not exhibit a linear correlation between the IQA scores and increasing blur levels. In some instances, such as with CLIPIQA+rn50_512, images with 25% blur consistently received higher average scores than the unblurred images. This outcome suggests that the algorithm is insensitive to blur and does not perform well with “unnatural” images. Similarly, other algorithms also proved unsuitable for dataset curation for analogous reasons. The field of IQA primarily focuses on images depicting “real” objects with artifacts, which may explain these results.

One hypothesis for the higher IQA scores for blurred images is that the blur reduced pixel-to-pixel variations, leading to a smoother appearance. Malicious software images typically contain high entropy, as each pixel corresponds directly to the bit stream. In contrast, “natural” images usually exhibit lower entropy, with fewer drastic pixel changes in the absence of artifacts. The applied blur likely smoothed the inherent noise and entropy of the dataset, resulting in higher “naturalness” scores from the IQA algorithms. Another hypothesis is that the blur simplified the dataset, making it easier for the algorithms to process. This simplification might have made the images appear more aesthetically pleasing to a human viewer.

Of all the IQA algorithms tested, only BRISQUE demonstrated sensitivity to increasing blur levels, exhibiting a strong linear correlation. As anticipated, greater blur resulted in higher BRISQUE scores, indicating progressively poorer image quality. Unfortunately, BRISQUE could not be run on the blurred Malimg dataset due to an issue with the PyTorch implementation, which led to a nan error and thus, N/A in the output for those subsets.

When selecting datasets for CNN testing, Malevis emerged as the most suitable candidate. The NARAD dataset lacked clear labeling of malicious and benign classes, making supervised learning infeasible. The Malware as Images dataset contained too few images (only 136 files) for effective CNN training. Additionally, BRISQUE was unable to compute scores on the blurred Malimg datasets, leaving no data points for comparison. Importing Malnet images into a neural network would have required significant logistical effort. Therefore, Malevis was chosen as a preliminary proof-of-concept, paving the way for future research directions.

To assess the relationship between image quality and CNN performance, the resulting validation accuracies were compared with the corresponding BRISQUE scores using Pearson and Spearman correlation coefficients. The Pearson correlation was 0.287 with a high p-value of 0.714, and the Spearman correlation was 0.6 with a p-value of 0.4, while the numerical coefficients indicate a directional trend, the large p-values clearly show that no statistically significant correlation can be claimed under conventional thresholds (e.g., 0.05 or 0.1). These results therefore reflect only a qualitative indication of a possible association rather than statistical evidence. The limited number of available data points restricts the statistical power of the analysis, reinforcing the need for larger-scale future experiments to more rigorously evaluate the relationship between BRISQUE scores and CNN performance.

4. Proposed Dataset Framework

This framework provides a series of metrics and weights considering IQA, essential dataset organization, and the needs of visual malware detection. These metrics and weights are summarized in a formula that can be used to compare and contrast datasets. Additionally, it can be used to identify areas that are lacking and need improvement. Normalizing relevant data and outputting a score can bring more awareness to the quality of malware image representations.

4.1. Framework Metrics

A series of six metrics are chosen to best represent the goals of MalScore. Each metric has an associated weight to introduce fair dataset classification. The following metrics constitute the MalScore formula.

4.1.1. Inverted and Normalized BRISQUE Score, 35%

The BRISQUE module outputs a numerical score relating to the quality of a provided image. The lower the score, the higher quality BRISQUE deemed the image. Theoretically, the lower bound of this output is 0, meaning that the image is perfect; however, there is no upper bound of the scoring system. This makes score normalization across datasets challenging, so it was opted to use relative normalization. The normalization formula, Q, is expressed as:

Q = 1 - \frac{B_{img}}{B_{\max}}

(1)

where

Q is the normalized quality score.
$B_{img}$ is the BRISQUE score of the current image.
$B_{\max}$ is the highest BRISQUE score observed in the dataset.

By inverting the score, Equation (1) transforms lower BRISQUE scores (which indicate higher quality) into higher numerical values for further use within the framework. Normalization aims to standardize scores across a common scale to allow for fair comparisons across various ranges of dataset performance. The weight of this metric was determined to be 35% as to place heavy emphasis on quality image generation.

4.1.2. Clear Labeling Scheme, 30%

The objective of incorporating datasets into supervised learning methods is to effectively distinguish samples according to a defined labeling scheme. This paper does not address the application of unsupervised techniques. It is imperative to utilize clear and precise labeling schemes to ensure the efficient and accurate classification of malicious samples. Consequently, this metric is assigned a weight of 30% to significantly penalize datasets that lack clear labeling. The input for this metric is a binary score: datasets with clear and complete labeling receive a score of 1, while those without receive a score of 0.

The use of a binary indicator for labeling clarity reflects the categorical nature of this property in visual malware datasets. A dataset either provides a complete, usable labeling scheme or it does not. As highlighted in Section 3.5, datasets such as NARAD lack sufficiently clear labels, making supervised learning infeasible rather than merely degraded. Because partial or ambiguous label structures cannot support CNN-based classification, a continuous scoring scale would not yield a meaningful distinction. The binary formulation therefore aligns with the practical usability requirements of supervised DL workflows.

4.1.3. Normalized Dataset Size, 25%

Accurate classification of samples using AI necessitates a sufficient quantity of data for effective learning. Inadequate data results in diminished AI performance. Therefore, a substantial weight of 25% is applied to emphasize the importance of curating large datasets. The scoring system relies on the normalization of datasets across compared samples. Given the absence of an upper bound, relative normalization is utilized. A normalized dataset score is derived as:

S_{dataset} = \frac{log (D)}{log (D_{\max})}

(2)

where

$S_{dataset}$ is the normalized dataset score.
D is the size of the current dataset.
$D_{\max}$ is the size of the largest observed dataset.

To facilitate the comparison of datasets of varying sizes, a logarithmic transformation was applied. This transformation helps mitigate the disproportionate influence of larger datasets, which can skew values due to their magnitude. The denominator, representing the size of the largest dataset, serves to incentivize the creation of substantial datasets, while the impact of larger datasets is effectively moderated, it remains a factor in the denominator to encourage the inclusion of high-quality samples. This approach ensures that large, quality datasets are promoted while being subjected to the same rigorous quality assessment as smaller datasets. A weight of 25% is assigned to enforce this criterion and ensure a fair evaluation.

4.1.4. Content Diversity, 4%

In visual malware detection, the primary concerns are detecting variants across different malware families and identifying obfuscated samples. These aspects are crucial for ensuring that a model is robust enough to detect previously unknown samples and potentially zero-day threats. As malware detection capabilities advance, malware evasion techniques evolve in parallel. Therefore, it is essential to include a diverse range of malware samples from multiple malware families and obfuscated samples in the dataset. To incentivize this diversity, a weight of 4% is applied. The scoring for this metric is binary: a score of 1 is assigned if more than 20 malware families are represented, while a score of 0 is given if fewer than 20 malware families are present.

The threshold-based binary scoring for content diversity is designed to distinguish datasets that support generalizable malware classification from those that do not. Widely used benchmarks such as Malimg, Malevis, and MalNet-Tiny all contain more than 20 families, whereas smaller or imbalanced datasets cannot reliably support robustness or zero-day evaluation. Because family diversity in this context determines whether a dataset meets the minimum conditions for meaningful DL experimentation, the metric is treated as a categorical condition rather than a continuous variable. This prevents misleading partial-credit effects and ensures fair comparison across heterogeneous datasets.

4.1.5. Recency of Data, 3%

In addition to including a diverse range of malware families, it is equally important to incorporate recent developments in malware. New malware often exploits newly discovered vulnerabilities, strives to bypass detection mechanisms, and evades analysis in sandbox environments. Including recent samples is essential to counteracting contemporary anti-virus evasion techniques and addressing new malware trends. A weight of 3% is assigned to emphasize the importance of incorporating recent developments while maintaining a balanced consideration of other impactful metrics. The dates considered for this score are based on the year of the published dataset paper. The scoring system is designed to favor more recent datasets. Datasets published within the past 0–5 years score a 1 in this category. Those published between 6 and 10 years ago score 0.8, datasets from 11 to 15 years ago score 0.6, and datasets from 16 to 20 years ago score 0.4.

4.1.6. Need for Processing, 3%

Preprocessing datasets in data science can often be more labor-intensive than running experiments and developing models. Consequently, visual malware datasets should be produced as concisely as possible to enhance research efficiency. The scoring system for this criterion is binary: a score of 1 indicates that little or no preprocessing is required, whereas a score of 0 signifies significant preprocessing requirements to implement the dataset in AI architectures. For example, a dataset that is not inherently visual and requires conversion to a binary image representation should score a 0 due to the extensive preprocessing needed. A weight of 3% is assigned to underscore the importance of concise dataset creation while acknowledging the impact of other, more significant metrics.

Table 2 summarizes the metrics used alongside their assigned weight. The algorithm that informs the framework is now formally defined. Given a set of datasets D, for each dataset

d \in D

, we define the following:

$B (d)$ : Average BRISQUE score of the dataset d, where a lower score indicates higher image quality.
$I (d)$ : Inverted and normalized BRISQUE score, defined as $Q = 1 - \frac{B_{img}}{B_{\max}}$ , where higher scores reflect better quality.
$L (d)$ : Binary indicator for clear labels, where $L (d) = 1$ indicates clear labels, and $L (d) = 0$ otherwise.
$S (d)$ : Size of the dataset d, measured as the number of samples in the dataset.
$N (d)$ : Normalized dataset size, calculated as $S_{dataset} = \frac{log (D)}{log (D_{\max})}$ , applying a logarithmic scaling to size.
$C (d)$ : Content diversity score, with $C (d) = 1$ indicating sufficient diversity, and $C (d) = 0$ otherwise.
$R (d)$ : Recency of data score, reflecting the timeliness of the dataset.
$P (d)$ : Preprocessing indicator, where $P (d) = 1$ indicates no preprocessing is needed, and $P (d) = 0$ otherwise.

The overall MalScore for dataset d is then computed as a weighted sum of these metrics as follows:

\begin{matrix} MalScore (d) = w_{I} \cdot I (d) + w_{L} \cdot L (d) + w_{N} \cdot N (d) \\ + w_{C} \cdot C (d) + w_{R} \cdot R (d) + w_{P} \cdot P (d), \end{matrix}

(3)

where

w_{I}, w_{L}, w_{N}, w_{C}, w_{R}

, and

w_{P}

are the weights attributed to the inverted BRISQUE score, label clarity, normalized dataset size, content diversity, recency of data, and preprocessing need, respectively. These weights are defined as:

w_{I}

= 0.35,

w_{L}

= 0.30,

w_{N}

= 0.25,

w_{C}

= 0.04,

w_{R}

= 0.03, and

w_{P}

= 0.03.

The final MalScore is expressed as a percentage, offering a comparative measure of dataset quality and providing a high-level overview of key features. Improving MalScore performance can contribute to the creation of more meaningful datasets, thereby advancing research in the field of malware detection. Figure 3 underscores the necessity of score normalization within the MalScore algorithm. As illustrated in Figure 3, certain datasets would have gained an undue advantage if raw BRISQUE and dataset size scores were utilized.

4.2. Framework Dataset Application

Following the description of the dataset framework, a practical example is carried out on large, known datasets. Overall, the highest observed BRISQUE score was 203.8536377. Furthermore, the largest dataset analyzed was MalNet Tiny with 87,430 samples.

4.2.1. Malimg

The Malimg dataset achieved an average BRISQUE score of 106.54. After inversion and normalization, this score translates to 0.48. The dataset comprises 9339 samples, yielding a normalized dataset size score of 0.80. It received full scores in the categories of clear labels, content diversity, and minimal preprocessing requirements. However, Malimg scored 0.6 in the recency of data category, as the associated dataset paper was published in 2011, making the dataset 13 years old at the time of this writing. The computed MalScore for the dataset is 76%.

4.2.2. Enriched Malimg (With Benign Class)

The enriched Malimg dataset, which includes an additional benign class, achieved an average BRISQUE score of 105.91. This indicates higher quality benign image samples, leading to an improved IQA average. After inversion and normalization, the IQA score is 0.48. The dataset comprises 10,320 samples, resulting in a normalized dataset size score of 0.81. As with the base Malimg dataset, it received full scores in the categories of clear labels, content diversity, and minimal preprocessing requirements. Similarly, the associated paper is the same as that of the Malimg dataset, resulting in a score of 0.6 in the recency of data metric. The calculated MalScore for these dataset is also 76%.

4.2.3. Malevis

The Malevare set features an improved average BRISQUE score of 84.85, which, after inversion and normalization, results in a score of 0.58. The dataset comprises 14,226 samples, yielding a normalized dataset size score of 0.84. It receives full scores in the categories of clear labeling, content diversity, recency of data, and minimal preprocessing requirements. The overall MalScore for the dataset is 81%. This score represents an improvement over the Malimg dataset, attributable to the inclusion of a greater number of higher quality samples.

4.2.4. MalNet Tiny

The Malnet Tiny subset achieved the best average BRISQUE score of 27.94, resulting in an inverted and normalized BRISQUE score of 0.86. It is the largest and best-performing dataset used, with 87,429 samples, translating to a fully normalized dataset size score of 1.0. These dataset excels in all defined metrics, achieving the highest MalScore of 95%.

4.2.5. NARAD “Malicious Images” Subfolder

The NARAD dataset has an average BRISQUE score of 65.63, resulting in an inverted and normalized score of 0.68. The dataset includes 11,919 samples, normalized to a dataset size score of 0.83. Unlike the other datasets, NARAD lacks clear labeling schemes, making classification challenging. Consequently, it scores 0 in the content diversity metric, as the diversity of samples cannot be determined. The dataset receives full scores in the data recency and minimal preprocessing requirements metrics. It achieves the lowest MalScore of 50%, indicating significant room for improvement.

4.2.6. Malware as Images Dataset Evaluation

This smaller dataset has an average BRISQUE score of 61.82, resulting in an inverted and normalized score of 0.70. Due to the presence of duplicate samples processed at different resolutions, only one processing method was considered. The subset includes 136 samples, yielding a significantly smaller normalized dataset size score of 0.43. Despite its small size, it maintains clear labeling schemes, features content diversity, and was developed within the past five years. The dataset incurs a penalty in the need for processing metric, as significant cropping is required to prepare the dataset for input into DL architectures. Its final MalScore is 72%.

It is important to note that due to the nature of BRISQUE results having no upper limit, relative normalization must be used. Consequently, the MalScore formula cannot be applied to an individual dataset in isolation. The provided framework inherently compares datasets and requires at least two inputs for calculation. Figure 4 presents the final results of MalScore applied across the five described datasets.

5. Discussion

Table 3 provides a detailed overview of the key areas of future work in visual malware detection research. The figure highlights the main focus areas, including Dataset Curation, IQA, and DL models, with specific tasks outlined for each area to guide future research and development.

5.1. Advancing IQA and Model Development for Future MalScore Applications

This study serves as a preliminary bridge between the fields of IQA and visual malware detection. By incorporating IQA algorithm scores during dataset curation, more accurate models can be trained. Although a baseline association has been established, further research is necessary to enhance the statistical significance of the findings. Firstly, additional blur levels should be applied to existing datasets to introduce more data points within the calculations. This increased data would facilitate a more detailed statistical assessment of the relationship between IQA scores and blur levels, thereby determining the blur sensitivity of the IQA algorithm at a finer granularity.

Moreover, a broader range of datasets should be introduced. For instance, Microsoft BIG [30] is a well-known dataset in the visual malware detection literature, containing nearly half a terabyte of disassembly and bytecode for over 20,000 malware samples, while these dataset is an excellent resource, it lacks image representations of malware, necessitating manual conversion. Future research could involve representing these dataset as images with various DPI settings to investigate the relationship between DPI and IQA algorithm scores.

Further work with CNNs can also be conducted. As more suitable datasets for AI applications become available, the application of CNNs and other DL architectures can generate additional statistical data. These data points would help to further confirm the linear relationship between BRISQUE scores and CNN performance. This study focused on validation accuracy at the end of each epoch. Future studies should include additional metrics such as precision, recall, and F1 score. The inclusion of these metrics would provide more data points, which could influence the p-values of the Pearson and Spearman correlation metrics. As more data are collected with increasingly significant p-values, the findings will become more statistically robust.

Additionally, additional IQA algorithms should be investigated or specifically developed for dataset curation purposes. Current algorithms focus on quantifying the “naturalness” of an image, a feature not typically present in the visual representation of malware, while BRISQUE performed well during experiments as a general-purpose algorithm, other general-purpose algorithms should be evaluated for their applicability. Further research in IQA can be directed towards assessing the image quality of non-traditional images, such as those found in cartoon scenes, and understanding their implications in the context of malware images. This research could potentially lead to the development of specialized IQA algorithms tailored for specific image or video types.

Lastly, a specific algorithm designed to identify the issues present in an image could be developed. Identifying issues such as blur, ringing, or blocking within a dataset could guide the creation of a visual malware IQA algorithm. This issue detector would need to be calibrated in conjunction with other IQA algorithms to accurately interpret the computer vision’s approach to these datasets.

5.2. Positioning MalScore for Evolving Visual Malware Benchmarks

In addition to the widely used visual malware datasets examined in this study, including Malimg, Malevis, MalNet-Tiny, NARAD, and the 2023 Malware as Images corpus, several newer image-based malware datasets have recently emerged. For example, Bruzzese [31] introduced a 2024 dataset constructed from VirusShare binaries and converted into visual representations for modern ML baselines such as CoAtNet. More recently, Makkawy et al. proposed MalVis, a large-scale Android visualization dataset released in 2025 that transforms APK and bytecode artifacts into consistent RGB malware images [32]. These datasets represent promising additions to the visual malware literature; however, they appeared after the completion of our experimental phase and are not yet broadly adopted or standardized. Because many newly released collections still require substantial preprocessing to generate consistent image representations, we identify them as natural candidates for future applications of the MalScore framework. Our methodology is intentionally designed to generalize to such emerging datasets as they mature and become more widely integrated into DL-based malware detection research.

5.3. Limitations of Gaussian Blur and Future Real-World Distortion Modeling

Although Gaussian blur provides a clean and mathematically controlled way to introduce monotonic degradation for analyzing IQA sensitivity, it does not fully capture the types of distortions naturally found in visual malware datasets. Real-world degradation more commonly arises from preprocessing artifacts such as interpolation effects, aggressive downsampling, compression noise, and inconsistencies in image conversion pipelines. Therefore, the use of Gaussian blur should be viewed as a methodological probe rather than a literal simulation of operational conditions. Future work will incorporate distortions directly observed in practical malware image corpora to better approximate real-world quality issues and to evaluate how IQA algorithms respond to these domain-specific artifacts.

5.4. MalScore Weighting Strategy and Path Toward Optimization

It is important to note that these weights were not empirically optimized. Rather, they reflect deliberate, design-driven choices appropriate for introducing MalScore. The primary objective of this study is to formalize the structure of the metric and demonstrate its feasibility across representative datasets, not to provide a fully calibrated scoring model. The larger weights assigned to image quality, label clarity, and dataset size follow established practices in visual malware detection, where these factors have a dominant impact on model effectiveness. Conversely, diversity, recency, and preprocessing requirements are incorporated as secondary yet meaningful considerations. A rigorous, data-driven calibration of the weights would require a substantially larger and more standardized collection of visual malware datasets than currently exists; thus, weight optimization is identified as future work. The framework is intentionally designed to be extensible so that the research community may refine or empirically tune these weights as additional datasets become available.

6. Conclusions

The scarcity of high-quality datasets remains a significant challenge in the field of visual malware detection, impacting the performance and reliability of detection models. To address this issue, this paper bridges the gap between Image Quality Assessment (IQA) and visual malware detection by proposing the MalScore framework. This framework provides a standardized method to evaluate and develop high-quality visual malware datasets. By considering factors such as dataset size, recency, and BRISQUE scores, the MalScore framework can enhance the efficacy of malware detection models. Future work should focus on creating updated, diverse datasets that include obfuscated and adversarial samples, and developing specialized IQA algorithms tailored for high-entropy malware images. The scripts used in the experiments are available in [29].

Author Contributions

Conceptualization, J.C.; methodology, J.C., M.R. and T.H.; validation, J.C., T.H. and A.C.; formal analysis, J.C. and T.H.; resources, T.H.; data curation, J.C.; writing—original draft preparation, J.C. and T.H.; writing—review and editing, M.R., T.H. and A.C.; supervision, T.H.; project administration, T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, G.; Wang, X.; Wang, Q.; Bi, T.; Dong, Y.; Liu, R.P.; Georgalas, N.; Reeves, A. Toward Web3 applications: Easing the access and transition. IEEE Trans. Comput. Soc. Syst. 2024, 11, 6098–6111. [Google Scholar] [CrossRef]
Park, J.; Cho, D.; Lee, J.K.; Lee, B. The economics of cybercrime: The role of broadband and socioeconomic status. ACM Trans. Manag. Inf. Syst. (TMIS) 2019, 10, 1–23. [Google Scholar] [CrossRef]
Georgoulias, D.; Yaben, R.; Vasilomanolakis, E. Cheaper than you thought? a dive into the darkweb market of cyber-crime products. In Proceedings of the 18th International Conference on Availability, Reliability and Security, Benevento, Italy, 29 August–1 September 2023; pp. 1–10. [Google Scholar]
Chimezie Obidiagha, C.; Rahouti, M.; Hayajneh, T. DeepImageDroid: A Hybrid Framework Leveraging Visual Transformers and Convolutional Neural Networks for Robust Android Malware Detection. IEEE Access 2024, 12, 156285–156306. [Google Scholar] [CrossRef]
Tayyab, U.e.H.; Khan, F.B.; Durad, M.H.; Khan, A.; Lee, Y.S. A survey of the recent trends in deep learning based malware detection. J. Cybersecur. Priv. 2022, 2, 800–829. [Google Scholar] [CrossRef]
Nataraj, L.; Karthikeyan, S.; Jacob, G.; Manjunath, B.S. Malware images: Visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh, PA, USA, 20 July 2011; pp. 1–7. [Google Scholar]
Aslan, Ö.; Yilmaz, A.A. A new malware classification framework based on deep learning algorithms. IEEE Access 2021, 9, 87936–87951. [Google Scholar] [CrossRef]
Shahid, M.; Rossholm, A.; Lövström, B.; Zepernick, H.J. No-reference image and video quality assessment: A classification and review of recent approaches. EURASIP J. Image Video Process. 2014, 2014, 40. [Google Scholar] [CrossRef]
Ying, Z.; Niu, H.; Gupta, P.; Mahajan, D.; Ghadiyaram, D.; Bovik, A. From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3575–3585. [Google Scholar]
He, K.; Kim, D.S. Malware detection with malware images using deep learning techniques. In Proceedings of the 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), Rotorua, New Zealand, 5–8 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 95–102. [Google Scholar]
Xing, X.; Jin, X.; Elahi, H.; Jiang, H.; Wang, G. A malware detection approach using autoencoder in deep learning. IEEE Access 2022, 10, 25696–25706. [Google Scholar] [CrossRef]
Wang, F.; Al Hamadi, H.; Damiani, E. A visualized malware detection framework with CNN and conditional GAN. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6540–6546. [Google Scholar]
Kasarapu, S.; Shukla, S.; Hassan, R.; Sasan, A.; Homayoun, H.; PD, S.M. Cad-fsl: Code-aware data generation based few-shot learning for efficient malware detection. In Proceedings of the Great Lakes Symposium on VLSI 2022, Irvine, CA, USA, 6–8 June 2022; pp. 507–512. [Google Scholar]
Phan, T.D.; Duc Luong, T.; Hoang Quoc An, N.; Nguyen Huu, Q.; Nghi, H.K.; Pham, V.H. Leveraging reinforcement learning and generative adversarial networks to craft mutants of windows malware against black-box malware detectors. In Proceedings of the 11th International Symposium on Information and Communication Technology, Hanoi, Vietnam, 1–3 December 2022; pp. 31–38. [Google Scholar]
Nguyen, H.; Di Troia, F.; Ishigaki, G.; Stamp, M. Generative adversarial networks and image-based malware classification. J. Comput. Virol. Hacking Tech. 2023, 19, 579–595. [Google Scholar] [CrossRef]
Zhai, G.; Min, X. Perceptual image quality assessment: A survey. Sci. China Inf. Sci. 2020, 63, 211301. [Google Scholar] [CrossRef]
Nataraj, L.; Yegneswaran, V.; Porras, P.; Zhang, J. A comparative assessment of malware classification using binary texture analysis and dynamic analysis. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, Chicago, IL, USA, 21 October 2011; pp. 21–30. [Google Scholar]
Dinu, C. GitHub Repository for Malware Classification Using Convolutional Neural Networks; GitHub: San Francisco, CA, USA, 2023. [Google Scholar]
Bozkir, A.S.; Cankaya, A.O.; Aydos, M. Utilization and comparision of convolutional neural networks in malware recognition. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Freitas, S.; Duggal, R.; Chau, D.H. MalNet: A large-scale image database of malicious software. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 3948–3952. [Google Scholar]
Saridou, B.; Rose, J.; Shiaeles, S.; Papadopoulos, B. 48,240 Malware Samples and Binary Visualisation Images for Machine Learning Anomaly Detection; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar] [CrossRef]
Fields, M. Malware as Images. 2023. Available online: https://www.kaggle.com/datasets/matthewfields/malware-as-images (accessed on 10 November 2025).
Nativ, Y.; Lahad Ludar, S.S. theZoo—A Live Malware Repository. [Online]. 2024. Available online: https://github.com/ytisf/theZoo (accessed on 10 November 2025).
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Chan, K.C.; Loy, C.C. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2555–2563. [Google Scholar]
Kang, L.; Ye, P.; Li, Y.; Doermann, D. Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1733–1740. [Google Scholar]
Zhang, W.; Ma, K.; Yan, J.; Deng, D.; Wang, Z. Blind Image Quality Assessment Using A Deep Bilinear Convolutional Neural Network. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 36–47. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Czaplicki, J. Capstone-Project. [Online]. 2024. Available online: https://github.com/czaplickijakub/Capstone-Project (accessed on 10 November 2025).
Ronen, R.; Radu, M.; Feuerstein, C.; Yom-Tov, E.; Ahmadi, M. Microsoft malware classification challenge. arXiv 2018, arXiv:1802.10135. [Google Scholar] [CrossRef]
Bruzzese, R. Building visual malware dataset using virusshare data and comparing machine learning baseline model to CoAtNet for malware classification. In Proceedings of the 2024 16th International Conference on Machine Learning and Computing, Shenzhen, China, 2–5 February 2024; pp. 185–193. [Google Scholar]
Makkawy, S.J.; De Lucia, M.J.; Barner, K.E. MalVis: A Large-Scale Image-Based Framework and Dataset for Advancing Android Malware Classification. arXiv 2025, arXiv:2505.12106. [Google Scholar]

Figure 1. Illustration of malware visualization technique. The process of converting a malware’s binary stream into an image, allowing DL models like CNNs to analyze and detect malware based on visual patterns and textures.

Figure 2. Gaussian blur levels on Malevare set sample.

Figure 3. Comparison of raw and normalized BRISQUE and dataset size scores.

Figure 4. Comparison of MalScores across different datasets.

Table 1. Comparison of dataset sizes and malware variety.

Other Datasets
NARAD Subset	11,919 samples, unknown family count
Malware as Images	136 samples, 66 families
Large Scale Datasets
Malimg	9339 samples, 25 families
Malevis	14,226 samples, 25 families
MalNet Tiny	87,429 samples, 43 families

Table 2. Summary of metrics and weights used in the MalScore framework.

Metric	Weight
Inverted and normalized BRISQUE score	35%
Clear labeling scheme	30%
Normalized dataset size	25%
Content diversity	4%
Recency of data	3%
Need for processing	3%

Table 3. Overview of future work in visual malware detection research.

Main Category	Sub-Category	Specific Tasks
Dataset Curation	Synthetic Dataset Generation	Explore generative adversarial networks (GANs) to create high-fidelity malware examples, enhancing model robustness against adversarial attacks.
	Standardization of Image Representation	Develop uniform guidelines for the visual transformation of malware binaries to ensure consistency in model training across different research groups.
IQA	Development of New IQA Algorithms	Design no-reference IQA algorithms specifically optimized for detecting obfuscation techniques in malware visuals.
	Tuning Existing Algorithms for Malware Images	Modify and fine-tune current IQA algorithms to better suit the specific textures and patterns seen in malware images.
DL Models	Enhancing Model Interpretability	Develop techniques to interpret the decision-making process of CNNs in malware detection to increase trust and understandability in AI-driven security systems.
	Exploring Novel Architectures	Investigate the integration of capsule networks and self-attention mechanisms to improve the spatial hierarchy and contextual understanding in malware classification.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Czaplicki, J.; Rahouti, M.; Chehri, A.; Hayajneh, T. MalScore: A Quality Assessment Framework for Visual Malware Datasets Using No-Reference Image Quality Metrics. Future Internet 2025, 17, 554. https://doi.org/10.3390/fi17120554

AMA Style

Czaplicki J, Rahouti M, Chehri A, Hayajneh T. MalScore: A Quality Assessment Framework for Visual Malware Datasets Using No-Reference Image Quality Metrics. Future Internet. 2025; 17(12):554. https://doi.org/10.3390/fi17120554

Chicago/Turabian Style

Czaplicki, Jakub, Mohamed Rahouti, Abdellah Chehri, and Thaier Hayajneh. 2025. "MalScore: A Quality Assessment Framework for Visual Malware Datasets Using No-Reference Image Quality Metrics" Future Internet 17, no. 12: 554. https://doi.org/10.3390/fi17120554

APA Style

Czaplicki, J., Rahouti, M., Chehri, A., & Hayajneh, T. (2025). MalScore: A Quality Assessment Framework for Visual Malware Datasets Using No-Reference Image Quality Metrics. Future Internet, 17(12), 554. https://doi.org/10.3390/fi17120554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MalScore: A Quality Assessment Framework for Visual Malware Datasets Using No-Reference Image Quality Metrics

Abstract

1. Introduction

2. Related Work

3. Experimental Methodology and Discussion

3.1. Benchmark Datasets

3.1.1. Malimg Dataset [17]

3.1.2. Malevis Dataset [19]

3.1.3. MalNet Dataset [20]

3.1.4. The NARAD Dataset [21]

3.1.5. Malware as Images [22]

3.2. Dataset Degradation

3.3. IQA Procedure

3.4. CNN Training and Experimental Configuration

3.5. Results and Discussion

4. Proposed Dataset Framework

4.1. Framework Metrics

4.1.1. Inverted and Normalized BRISQUE Score, 35%

4.1.2. Clear Labeling Scheme, 30%

4.1.3. Normalized Dataset Size, 25%

4.1.4. Content Diversity, 4%

4.1.5. Recency of Data, 3%

4.1.6. Need for Processing, 3%

4.2. Framework Dataset Application

4.2.1. Malimg

4.2.2. Enriched Malimg (With Benign Class)

4.2.3. Malevis

4.2.4. MalNet Tiny

4.2.5. NARAD “Malicious Images” Subfolder

4.2.6. Malware as Images Dataset Evaluation

5. Discussion

5.1. Advancing IQA and Model Development for Future MalScore Applications

5.2. Positioning MalScore for Evolving Visual Malware Benchmarks

5.3. Limitations of Gaussian Blur and Future Real-World Distortion Modeling

5.4. MalScore Weighting Strategy and Path Toward Optimization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI