Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis

Jamil, Mahnoor; Mihajloska Trpcheska, Hristina; Popovska-Mitrovikj, Aleksandra; Dimitrova, Vesna; Creutzburg, Reiner

doi:10.3390/app15116158

Open AccessArticle

Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis

by

Mahnoor Jamil

^1,2,*

,

Hristina Mihajloska Trpcheska

¹

,

Aleksandra Popovska-Mitrovikj

¹

,

Vesna Dimitrova

¹

and

Reiner Creutzburg

^3,4,*

¹

Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, 1000 Skopje, North Macedonia

²

School of Graduate Studies, Kadir Has University, Kadir Has Cd, 34083 Istanbul, Turkey

³

School of Technology and Architecture, SRH University of Applied Sciences Heidelberg, D-12059 Berlin, Germany

⁴

Fachbereich Informatik und Medien, Technische Hochschule Brandenburg, D-14770 Brandenburg, Germany

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6158; https://doi.org/10.3390/app15116158

Submission received: 11 April 2025 / Revised: 18 May 2025 / Accepted: 26 May 2025 / Published: 30 May 2025

(This article belongs to the Special Issue New Advances in Computer Security and Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

Image-based spam poses a significant challenge for traditional text-based filters, as malicious content is often embedded within images to bypass keyword detection techniques. This study investigates and compares the performance of six machine learning models—ResNet50, XGBoost, Logistic Regression, LightGBM, Support Vector Machine (SVM), and VGG16—using a curated dataset containing 678 legitimate (ham) and 520 spam images. The novelty of this research lies in its comprehensive side-by-side evaluation of diverse models on the same dataset, using standardized dataset preprocessing, balanced data splits, and validation techniques. Model performance was assessed using evaluation metrics such as accuracy, receiver operating characteristic (ROC) curve, precision, recall, and area under the curve (AUC). The results indicate that ResNet50 achieved the highest classification performance, followed closely by XGBoost and Logistic Regression. This work provides practical insights into the strengths and limitations of traditional, ensemble-based, and deep learning models for image-based spam detection. The findings can support the development of more effective and generalizable spam filtering solutions in multimedia-rich communication platforms.

Keywords:

spam detection; image spam; machine learning; Support Vector Machine; XGBoost; Logistic Regression; ResNet50; LightGBM; VGG16

1. Introduction

Cybersecurity plays a crucial role in users’ daily lives. Without it, individuals are vulnerable to malicious threats that can compromise their online security, reduce productivity, and harm their privacy. Reducing spam messages can significantly lower the risk of cybercrime and help users feel more confident and secure in performing their daily online activities. Spam can take various forms: text-based, image-based, or hybrid. A newer and more evasive type, known as image-based spam, involves embedding malicious text within images to bypass traditional text-based spam filters. This type of spam is characterized by a more complex structure compared to its text-based counterpart, making detection increasingly difficult. Traditional filtering methods typically rely on text analysis techniques such as keyword matching, pattern recognition, and heuristic rules, which are ineffective against visually embedded content. The widespread use of multimedia on platforms like email and social media has introduced new challenges for spam detection. For instance, attackers frequently use deceptive visuals, such as fake banking login screens or fraudulent promotional banners, to trick users into clicking harmful links. This tactic is commonly employed in financial scams, fake advertisements, and malware distribution, making detection more challenging. Therefore, there is a growing need for a reliable and accurate image-based spam detection system. Although existing studies have explored both traditional and deep learning models, very few have provided comparative performance analysis across a unified image dataset. This research addresses that gap by evaluating various machine learning models for image-based spam detection, focusing on each model’s ability to effectively classify spam and legitimate (“ham”) images. The results aim to lay a strong foundation for developing robust and accurate spam detection systems tailored to visual content. Our contributions can be summarized as follows:

We present a comprehensive comparative analysis of six machine learning models—ResNet50, XGBoost, Logistic Regression, LightGBM, VGG16, and SVM—for image spam detection, using a standardized dataset and unified preprocessing pipeline.
We propose a consistent experimental setup that ensures a fair evaluation of all models by applying identical training conditions and evaluation metrics such as AUC, ROC, accuracy, precision, recall, and F1-score.
We tested the models on a new dataset to see how well they work on unseen data, without retraining.

This paper is structured as follows: Section 2 represents related work, Section 3 describes the methodology, and Section 4 and Section 5 cater to the results and discussion. Challenges are presented in Section 6. The conclusion of this paper is highlighted in Section 7.

2. Related Work

Ref. [1] conducted an extensive survey on machine learning techniques for cybersecurity, covering a decade of research. The authors systematically reviewed various ML applications in intrusion detection, spam filtering, and malware detection across computer and mobile networks. Their comprehensive study outlined the effectiveness of different ML methods, including deep learning and traditional classifiers, and highlighted critical evaluation metrics and common cybersecurity datasets. However, the study focused mainly on text-based threats and did not explore the complexities of image-based spam detection, which remains a growing challenge in modern digital communications. As machine learning (ML) models can learn from complex datasets, they have been used more frequently to detect picture spam to address this issue. To detect patterns and identify spam text in images, they can be trained on large datasets. In the research on image spam identification, multiple machine learning models, including Support Vector Machines (SVMs), Naive Bayes, Logistic Regression, and XGBoost, have shown great potential in spam detection [2]. However, most existing studies lack a unified preprocessing pipeline or comparative analysis, making it difficult to assess the relative performance of these models under consistent conditions.

Ref. [3] has shown that machine learning techniques can improve image spam detection by analyzing low-level image texture features. This research evaluated various ML classifiers, including decision trees, Support Vector Machines, Bayesian Networks, and random forests, on publicly available datasets. The authors observed that the random forest model outperformed other models and achieved 98.6% precision. Despite strong results, the study’s reliance on basic texture features could limit its robustness. The research carried out in [4] proposed an image-based spam detection using a deep learning model to increase performance. They added four sub-models to an already existing spam detection model. The research conveyed that added submodels increased the capability to detect spam images but remained computationally heavy and hard to deploy in real time.

The authors in [5] evaluate the effectiveness of ML methods for email spam detection. They discuss how models such as Logistic Regression and Naive Bayes can achieve up to 99% accuracy. The findings highlight the importance of ML in classifying spam versus non-spam messages. However, email-based spam detection differs significantly from image spam detection in the handling features. In [6], the authors were focused on SMS spam detection, proposing a hybrid deep learning model combining convolutional neural networks (CNNs) and gated recurrent units (GRUs). They achieved an accuracy of 99.07% and highlighted the effectiveness of hybrid architectures in handling textual data, emphasizing the importance of optimal hyperparameter tuning for improved spam detection performance. In a detailed study on the weaknesses and capabilities of several models, the authors in [7] used the spam email dataset along with numerous approaches and identified challenges and limitations when using ensemble frameworks, such as the need for hyperparameter tuning for good accuracy and performance. Similarly, the research conducted in [8] suggested combining random forest and decision tree to obtain better accuracy for spam classification. These studies mostly address spam in structured text forms and do not explore spam that blends images and text.

In [9], the authors proposed an approach that combines optical character recognition, natural language processing, and a machine learning algorithm to detect image spam more effectively. This combination improves the models’ performance and also provides a more robust method for analyzing image-based spam content. The research does not address the performance implications of utilizing images that contain styled or formatted text. Another significant contribution to image spam detection comes from [10], in which a new dataset of spam images was introduced, which were challenging to detect with existing methods. The authors found that both PCA and SVM models achieved high accuracy with low computational complexity. More research in spam detection uses deep learning model techniques. In the research conducted in [11], the authors showed that deep learning models such as Long Short-Term Memory (LSTM) and convolutional neural networks (CNNs) can surpass traditional ML methods by automatically extracting features. Additionally, ref. [12] offers an even more accurate alternative compared to methods like Optical Character Recognition and SVM, emphasizing the evolving role of advanced ML algorithms in countering image and email spam. Regarding the image spam filtering methods, in [13], the authors introduced a deep learning-based approach by utilizing pre-trained convolutional neural networks (CNNs), including InceptionV3, DenseNet121, ResNet50, VGG16, and MobileNetV2. Their framework leveraged transfer learning, data augmentation, and replaced the fully connected layers with a Support Vector Machine (SVM) classifier, significantly improving accuracy and computational efficiency. A study found that the SVM performed exceptionally well in detecting image spam, overcoming traditional methods due to its ability to learn from visual data and adapt to various image variations [14]. Experiments conducted on standard datasets demonstrated that ResNet50 was the best-performing model, achieving an accuracy of 99.87%. Despite impressive accuracy, the high resource demand of these deep learning models can be a barrier for real-time applications.

Looking for a suitable ML technique for image spam detection, XGBoost has gained attention for its ensemble learning technique, which merges the output of multiple decision trees to improve the overall prediction accuracy. However, it is not easy to reasonably tune the hyperparameters. To obtain greater accuracy from XGBoost, it requires prior knowledge of researchers and their experience in parameter tuning, as well as a great deal of time [15]. Lastly, SVM has also been used in spam classification tasks due to its strong handling of high-dimensional data [16]. Several studies have proven that SVM can classify images when combined with feature extraction techniques. However, SVM models require extensive tuning of their parameters in complex image datasets. These two models have allowed for adaptive learning, so they can continuously improve accuracy by adapting to new types of spam images. This is crucial given the constantly evolving tactics of spammers [17]. However, studies considering XGBoost and SVM rarely offer comparative benchmarks against deep learning models on the same dataset. To address the gaps identified in prior research, including inconsistent benchmarking, limited comparative evaluation, and less processing power utilization, our study experimented on multiple machine learning models under identical conditions to determine which model achieves higher accuracy and fewer false positives.

Comparative Analysis of Machine Learning Models

Supervised learning models have to go through numerous research cycles to achieve the correct output. As the name suggests, the supervisor is the one who instructs the model on what the input is and what the corresponding output should be. The supervisor primarily trains the model with labels [18]. Then, models use these labels for classification purposes. The supervised learning models are trained on a dataset, which is always divided into a training and a test set. The training set is the data passed into the model for the training, and the test data are used to test the accuracy of the model and how well it can classify. There are many supervised learning models that can predict continuous target variables. Logistic Regression is used for binary classification tasks. Furthermore, decision trees are also used for both classification and regression. Moreover, Support Vector Machine focuses on classification tasks by finding the optimal hyperplane to separate different classes.

Furthermore, ensemble learning includes merging multiple models to create a stronger model with better accuracy. Popular ensemble methods include random forests, which are collections of decision trees used for classification and regression tasks; gradient boosting, which builds strong models by sequentially combining weaker ones for regression problems; and AdaBoost, an ensemble technique that improves weak classifiers for binary classification tasks. These ensemble methods have the strengths of multiple models to improve performance, making them powerful tools for complex tasks (Figure 1).

For the comparative analysis given in this paper, we selected XGBoost, Support Vector Machine, ResNet50, Logistic Regression, LightGBM, and VGG16 to ensure a comprehensive evaluation of diverse learning paradigms, including classical machine learning, ensemble learning, and deep learning. Each model represents a unique approach to the classification of spam images. ResNet50 was selected for the comparative analysis due to its proven architectural strengths and robust performance in complex image classification tasks. Its ability to effectively manage structured data and deliver high accuracy with minimal false positives has been consistently demonstrated in prior research [13]. Additionally, ResNet50 integrates techniques such as gradient boosting and regularization, enhancing its predictive power while reducing the risk of overfitting. Support Vector Machines (SVMs) have also gained prominence as a reliable and widely adopted method for classification, particularly in image-based spam detection, due to their effectiveness in high-dimensional spaces and strong generalization capabilities. In [20], the authors explored the effectiveness of SVMs trained on a diverse set of image features to classify spam content. The research highlights the use of a linear SVM to analyze and quantify the relative importance of various visual features, demonstrating the model’s ability to distinguish between spam and legitimate images with high precision. Drawing from these insights, we selected SVM for our comparative analysis due to its proven performance in image spam detection tasks.

LightGBM is an ensemble model, developed by Microsoft, and is renowned for its speed and efficiency. It utilizes histogram-based learning and leaf-wise tree growth strategies to enable faster training and lower memory usage than traditional boosting methods. However, as highlighted in [21], the LightGBM model shows sensitivity to noisy data. LightGBM is also designed to use less memory by training faster, supporting distributed and parallel computing, and handling large data [22]. These characteristics make LightGBM excellent for image spam detection, where rapid processing of images is required.

Furthermore, XGBoost was created by Chen and Guestrin [23], who drew up a robust methodology for regression and classification. Numerous Kaggle Machine competitions have highlighted and included the use of XGBoost for classification. As XGBoost is based on the gradient boosting framework, it continuously adds new decision trees to fit a value to improve accuracy and performance. It has gained popularity due to its scalability and accuracy, and incorporates regularization techniques to prevent overfitting. The research by [24] integrated CNNs with gradient-boosting techniques to enhance detection accuracy and resulted in 88% F1-score. Furthermore, XGBoost and LightGBM have consistently achieved top results in many structured data competitions.

VGG16 is a deep learning model and was selected for this research due to its strong and consistent performance in image classification tasks. Its straightforward layer structure makes it effective at learning patterns from image data, which is important for detecting spam images. In the study by [13], VGG16 achieved the second-highest accuracy and AUC, showing that it performs well compared to other models. This makes VGG16 a reliable choice for comparing it with both traditional and ensemble-based models in our analysis.

Logistic Regression is a highly successful machine learning algorithm that calculates probabilities using discrete and continuous data and classifies newly entered data [25]. Based on probability, it is decided whether that feature vector belongs to a specific class or not. Logistic Regression demonstrated outstanding performance in spam detection. In the comparison given in [26], it achieved an accuracy of 0.981 and a precision score of 0.972, indicating its strong ability to correctly identify spam images while minimizing false negatives. Also, Logistic Regression has been effectively applied in spam detection frameworks, particularly in multi-modal architectures where it serves as a probabilistic fusion layer. In [27], authors proposed a CNN-LSTM-based spam filter, integrating outputs via logistic regression, achieving over 98% accuracy on hybrid image/text datasets. These results highlight Logistic Regression as a competitive traditional model, offering both simplicity and reliability. Therefore, this model is included in our comparative analysis.

These models were selected to represent a range of classification approaches: traditional (SVM, Logistic Regression), deep residual (ResNet50), ensemble-based (XGBoost, LightGBM), and deep learning (VGG16). This allows for a balanced evaluation across algorithmic families. By incorporating these models, the research aims to compare their accuracy, false positive rates, and effectiveness in identifying true positives, which are key metrics for evaluating the reliability and robustness of image spam detection.

3. Methodology

Machine learning is essential for image spam detection because traditional rule-based methods struggle to identify spam embedded in images. Unlike text-based spam, where keywords and patterns are easy to analyze, image spam disguises malicious content within visuals, making it harder for traditional filters to detect. In addition, ML models continuously adapt to evolving spam tactics, improving detection accuracy and reducing false positives. In this paper, the performance of six machine learning models is evaluated.

3.1. Dataset and Preprocessing

3.1.1. Dataset Description

For our experiments, a dataset was used to train and test an image-based spam model. The original dataset was obtained from [28] and consists of 928 spam images collected from real spam emails and over 800 ham (non-spam) images. The original record has 4 columns: Serial Number (the row number), File Name, Image Size, and Label (binary; 0 and 1 denoting ham and spam, respectively). The dataset was later carefully scrutinized and finalized with 678 ham images and 520 spam images as uploaded at [29]. As displayed in Figure 2, data augmentation was also conducted to artificially increase the diversity of a training dataset by applying realistic transformations to existing images [30]. A total of 301 additional samples were generated by data augmentation, which consisted of 150 ham and 151 spam images.

To proceed with the training and testing process, we used a new dataset, ensuring a balanced distribution for effective model evaluation. The dataset was split into 70% training and 30% testing data. The augmented images were subsequently added only to the training dataset to improve generalization and prevent overfitting. Ham images represent benign content, while spam images include advertisements, scams, or irrelevant promotional content. The original RGB format of some ham images is given in Figure 3, which, in order to enable more efficient processing during the classification pipeline, is subsequently converted to grayscale, as shown in Figure 4.

Figure 5 illustrates the original RGB format of some spam images, which is also subsequently converted to grayscale, as shown in Figure 6.

3.1.2. Preprocessing

To ensure a fair and consistent comparison, all models were trained using a standardized preprocessing pipeline and feature space. Images were resized to 64 × 64 pixels, converted to grayscale, and then flattened into 1D vectors to serve as input features. The StandardScaler was applied to normalize the feature set, ensuring zero mean and unit variance across pixel values. To address high dimensionality and improve computational efficiency, Principal Component Analysis (PCA) was employed for dimensionality reduction. While flattening simplifies the input for traditional machine learning models, it eliminates the spatial structure of images, which may degrade performance. Nevertheless, this approach was adopted to maintain uniform conditions across all models. An exception was made for ResNet50 and VGG16, which were evaluated using RGB images resized to 224 × 224 pixels, in accordance with their pre-trained configurations optimized for color image inputs.

3.2. Models

Support Vector Machine (SVM)

Parameters: The model was trained with class_weight = ‘balanced’ to address class imbalance and probability=True to enable probability-based predictions. Linear kernel was utilized.

Model Training: An SVM classifier was trained using the linear kernel to optimize the decision boundary for spam vs. ham classification.

Logistic Regression

Parameters: Logistic Regression was trained with max_iter = 1000 to ensure convergence, particularly due to the complexity of the dataset.

Model Training: The model was trained to classify spam and ham images, applying a linear decision boundary.

XGBoost

Parameters: XGBoost was trained with max_depth = 6 to balance training time and model complexity.

Model Training: XGBoost, a gradient-boosting model, was trained to classify spam and ham images using ensemble learning to improve accuracy.

LightGBM

Parameters: LightGBM was trained with boosting_type = ‘gbdt’, learning_rate = 0.05, and max_depth = 7. The model utilized 100 estimators and was trained over 20 epochs with a batch size of 32 to balance training efficiency and model accuracy.

Model Training: LightGBM, a gradient boosting model, was trained to classify spam and ham images. It used an ensemble of decision trees to improve classification accuracy through boosting.

VGG16

Parameters: VGG16 was fine-tuned with a pre-trained ImageNet base. The convolutional layers were frozen, and more fully connected layers were added to classify spam and ham images. Training was carried out with a batch size of 32 and 10 epochs.

Model Training: VGG16, a CNN, was used for classifying spam and ham images. The pre-trained weights from ImageNet allowed for effective extraction of features, and the model was fine-tuned with additional dense layers for classification. As VGG16 is optimized for RGB images of size 224 × 224, this was utilized for training the model.

ResNet50

Parameters: ResNet50 was initialized with pre-trained ImageNet weights, the base model layers were frozen to retain learned features, and a custom head was added, which consisted of a GlobalAveragePooling2D layer, followed by a dense layer with 128 ReLU units.

Model Training: ResNet50, a deep convolutional neural network, is highly utilized for transfer learning and classification tasks. It can extract high-level image features and is trained for the binary classification task of distinguishing between spam and ham images. ResNet50 is also optimized for 224 × 224 RGB images and these images were used for training the model.

3.3. Evaluation and Performance Metrics

Regarding model evaluation, a 5-fold cross-validation approach was used to assess performance. This method splits the data into five folds while maintaining the class distribution of spam and ham images. Each model was trained and tested across five iterations, with one fold held out for testing and the others used for training in each round. The final performance metrics are the average scores across all folds. This approach reduces the risk of overfitting and provides a more robust estimation of the model’s performance compared to a single train/test split.

Furthermore, the following metrics were used for the evaluation of the models:

Accuracy: Identifies the percentage of correctly classified instances.
AUC (Area Under the Curve): Measures the model’s ability to distinguish between spam and ham images. A larger AUC value indicates better performance.
Precision, Recall, and F1-Score: These metrics are particularly useful when utilizing imbalanced datasets, as they offer insights into how well the model identifies spam (precision), avoids false negatives (recall) and calculates the mean of precision and recall (F1-score)
ROC Curve: Plots true positive rate against false positive rate to evaluate classifier performance. A curve closer to the top left indicates better classification.
AUC Variance: Estimated through bootstrapping, it indicates the consistency of the model’s performance, with lower variance suggesting greater reliability.
Confusion Matrix: The confusion matrix highlights the performance of the model. It displays the number of true positives, true negatives, false positives, and false negatives.
Classification Report: Provides a comprehensive view of the model’s performance across various metrics like precision, recall, F1-score, and support.

3.4. Experimentation Setup

Our experiments were carried out using the Google Colab environment https://colab.research.google.com/, accessed on 4 March 2025. Running the command !cat /proc/cpuinfo revealed that the system is equipped with an Intel Xeon CPU @ 2.20 GHz, consisting of two processing units (siblings) with one core per processor. The processor operates at 2200.162 MHz and includes hyper-threading capabilities. Although security vulnerabilities like Spectre, Meltdown, and L1TF are present, mitigations may be in place. The system architecture supports 46-bit physical and 48-bit virtual addressing, facilitating efficient memory management.

4. Results

The results given in Figure 7 demonstrate the effectiveness of each model in detecting spam images. In addition, Figure 8 represents the detection performance for non-spam (ham) images. The corresponding misclassification results are shown in Figure 9 for spam images and Figure 10 for non-spam images. The performance evaluation based on classification outcomes on the 30% test set indicates that ResNet50 achieved the highest overall accuracy, correctly identifying 152 spam and 191 ham images, with only 5 spam and 12 ham samples being misclassified. XGBoost also showed strong performance, accurately detecting 145 spam and 189 ham images, with low misclassification rates of 11 spam and 15 ham samples. VGG16 performed well in spam detection with 151 spam and 157 ham correctly classified, though it displayed slightly higher ham misclassification (47 ham, 5 spam). Logistic Regression demonstrated a balanced performance, correctly classifying 136 spam and 176 ham images, with moderate misclassification of 20 spam and 28 ham. LightGBM and SVM resulted in comparable outcomes, correctly detecting 128 and 133 spam instances, respectively, and 171 and 170 ham each. However, both models resulted in higher misclassification rates; LightGBM misclassified 28 spam and 33 ham, whereas SVM misclassified 24 spam and 33 ham images.

4.1. ROC Curve

The ROC curve is a graphical representation used to evaluate the performance of a binary classifier. It plots the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. A curve that bows towards the upper-left corner indicates stronger model performance, as it reflects a higher rate of correctly identified positives and a lower rate of false alarms.

As illustrated in Figure 11, models such as ResNet50 and XGBoost produce ROC curves that closely approach the top-left corner, indicating high sensitivity and specificity to highlight excellent classification performance. These curves exhibit a steep rise, reflecting a strong ability to identify spam images while minimizing false alarms. Intermediate models such as VGG16, Logistic Regression, LightGBM, and SVM demonstrate moderate ROC curves, performing better than random forest but with varying degrees of trade-off between true positive and false positive rates. The visual distinction between these curves highlights how effectively each model processes visual features from the image dataset and underscores the impact of model architecture on classification capability.

4.2. Training and Validation Loss Curves

The training and validation loss curves illustrated in Figure 12 show how each model learns and generalizes over time. Deep learning models like ResNet50 and VGG16 show a steep decline in both training and validation loss, with curves that closely align—indicating efficient learning and minimal overfitting. ResNet50, in particular, exhibits the lowest loss values across all epochs, converging quickly and maintaining stability, which reflects strong feature learning and generalization. In contrast, traditional models such as Logistic Regression and SVM demonstrate slower loss reduction and higher validation losses, suggesting limited ability to capture the complex patterns in image-based spam detection. Ensemble models, XGBoost and LightGBM, show consistent performance with gradual loss reduction, though LightGBM displays a slightly wider gap between training and validation loss, hinting at some overfitting. In terms of overall performance, based on the loss curves, ResNet50 ranks the highest, demonstrating superior convergence and minimal generalization error. XGBoost follows closely, with low and stable losses and strong consistency between training and validation. VGG16 ranks third, performing well but slightly more sensitive to validation fluctuations. LightGBM comes next, showing good learning but with signs of mild overfitting. Traditional models trail behind, with Logistic Regression outperforming SVM; however, both showing limited adaptability to the image classification task. This ranking aligns well with the quantitative results reported in this paper, including AUC scores and confidence intervals, further validating ResNet50 and XGBoost as the most robust models for image spam detection.

5. Discussion

As highlighted in Table 1, the models were evaluated using confusion matrices, AUC scores, and other relevant metrics. The detailed results for each model are as follows:

Support Vector Machine (SVM)

The SVM model demonstrated a robust ability to distinguish between spam and non-spam instances, achieving an AUC of 0.91. This indicates a high probability that the model will rank a randomly chosen positive instance (spam) higher than a negative one (non-spam). The confusion matrix reveals a true positive (TP) rate of 37% and a true negative (TN) rate of 47%, suggesting balanced performance in identifying both classes. The SVM model performed better with a high AUC when a linear kernel was utilized, indicating good discrimination between spam and ham images.

However, the false positive (FP) rate of 9% and false negative (FN) rate of 7% indicate that the model occasionally misclassifies non-spam as spam and misses some spam instances. The number of false positives and false negatives suggest the model can be further optimized, though it demonstrated robust performance overall. The AUC variance of 0.000277 reflects that its performance is consistent.

Logistic Regression

Logistic Regression achieved better performance than SVM with an AUC score of 0.94 AUC. The model showed a TP rate of 38% and a TN rate of 49%, with FP and FN rates of 8% and 6%, respectively. These metrics suggest that Logistic Regression is effective in reducing the number of false negatives. The low AUC variance (0.000171) also highlights the model’s reliability across different data subsets. Logistic Regression’s consistent performance makes it an excellent baseline model.

XGBoost

XGBoost demonstrated robust performance in image-based spam detection, achieving a TP rate of 40% and a TN rate of 52%, with only 3% FN and 4% FP. The model attained a high AUC score of 0.97 with a low AUC variance of 0.000049, indicating consistent performance across different data splits. These results highlight XGBoost’s strong ability to correctly distinguish between spam and non-spam images with minimal misclassification. Although slightly outperformed by ResNet50, XGBoost remains a competitive and interpretable ensemble-based model, particularly well suited for structured visual data tasks.

LightGBM

LightGBM delivered moderate performance with a TP rate of 36% and a TN rate of 47%. The model exhibited 8% FN and 9% FP, reflecting a balanced approach to spam detection. It achieved an AUC score of 0.91 and an AUC variance of 0.000215, suggesting stable classification capability with room for improvement in reducing misclassification. Although it did not outperform other ensemble-based models, like XGBoost, LightGBM demonstrated good discrimination between spam and non-spam images.

ResNet50

ResNet50 outperformed all other models by achieving a TP rate of 42% and a TN rate of 53%, with only 1% FN and 3% FP. The model recorded the highest AUC score of 0.99 and the lowest AUC variance of 0.000009, indicating exceptional classification ability.

VGG16

VGG16 also demonstrated strong performance, matching ResNet50 with a TP rate of 42% and a minimal FN rate of 1%. However, its TN rate was comparatively lower at 44%, and it produced a higher false positive rate of 13%, suggesting reduced specificity in classifying non-spam (ham) content. The model attained an AUC score of 0.93 with a variance of 0.000167, reflecting reliable but less stable performance relative to ResNet50.

This study evaluated the performance for image-based spam detection of six machine learning and deep learning models based on AUC scores, ROC curves and variances. Among all models, ResNet50 achieved the highest AUC score of 0.99 with the lowest variance of 0.000009, demonstrating exceptional and consistent classification performance. XGBoost followed with an AUC of 0.97 and variance of 0.000049, benefiting from its ensemble-based learning capability to effectively separate spam from ham images. Logistic Regression also performed reliably, achieving an AUC of 0.94 with a variance of 0.000171, making it a lightweight yet effective model for this task.

While VGG16 achieved a slightly lower AUC of 0.93 and variance of 0.000167, it showed better performance when RGB images were used instead of grayscale and with appropriate hyperparameter tuning. LightGBM and SVM models both achieved an AUC of 0.91, with LightGBM showing a slightly lower variance of 0.000215 compared to SVM’s variance of 0.000277, indicating similar performance but with SVM being slightly less stable. The AUC confidence intervals indicate the range within which the true AUC is expected to lie with 95% confidence, based on bootstrapped resampling. Models such as ResNet50 and XGBoost exhibit the narrowest confidence intervals ([0.985, 0.997] and [0.960, 0.987], respectively), reflecting high stability and consistent performance. In contrast, models like SVM and LightGBM have wider intervals, suggesting greater variability in their classification ability. Overall, the AUC confidence intervals add statistical rigor to the evaluation, enabling a more reliable comparison of model robustness.

The classification reports were generated. The corresponding recall, precision, and F1-scores for each model are presented in Table 2. These values were obtained as the mean results across the five-fold cross-validation, ensuring consistency and robustness of the model’s performance across different data splits. Among the models tested for image-based spam detection, VGG16 performed well in identifying spam with a recall of 97%, meaning it detected most spam messages. However, its precision for spam was 77%, indicating some false positives. For non-spam, it achieved high precision (97%) but lower recall (76%). Despite this trade-off, the F1-score for the spam class was 0.85 and 0.86 for non-spam. Furthermore, Logistic Regression demonstrated well-balanced performance across all metrics. For spam, it achieved a precision of 84% and a recall of 88%; for non-spam, the precision and recall were 89% and 86%, respectively. The resulting F1-scores were 0.86 for spam and 0.88 for non-spam, indicating consistent and reliable classification performance across both classes. Moreover, LightGBM maintained consistent performance, achieving a precision of 80% and a recall of 83% for spam, and a precision of 85% and a recall of 83% for non-spam. The corresponding F1-scores were 0.82 for spam and 0.84 for non-spam, indicating balanced effectiveness across both classes. SVM performed similarly, slightly better in some cases, with an F1-score of 0.83 for spam and 0.85 for on-spam, showing a decent balance between both classes. XGBoost and ResNet50 were the top performers. XGBoost achieved precision and recall scores above 91% for both classes, with an F1-score of 0.92 for spam and 0.93 for non-spam classes. ResNet50 outperformed all other models, achieving a precision of 93% and a recall of 97% for spam, and a precision of 97% and a recall of 94% for non-spam. The F1-score for both classes was 0.95, highlighting the model’s exceptional and consistent performance across the dataset.

Top-performing Models

ResNet50, XGBoost, and Logistic Regression emerged as the most effective models, with ResNet50 leading both in performance and consistency. The high AUC and low variance values indicate that these models not only classified spam images correctly but also did so consistently across folds, making them suitable candidates for deployment in spam detection systems.

VGG16 and SVM Observations

While VGG16 is known for its success in image classification, its performance in grayscale image spam detection was significantly lower. However, the results improved considerably when RGB images and tuning techniques were applied. SVM achieved reasonable results but exhibited a tendency toward false negatives, indicating a need for further optimization.

In summary, ResNet50 outperformed all other models, showing strong generalization and minimal variance. XGBoost and Logistic Regression also demonstrated robust and stable results. LightGBM and SVM performed moderately well. Future research could explore the real-time deployment of deep learning models and assess their scalability and robustness in dynamic spam detection environments.

Validation with Different Dataset

This study assessed the model performance using cross-data training to examine the models accuracy with unseen image-based spam. Later, this study used a different dataset from Kaggle which included 811 ham photos and 930 spam images to validate the model’s performance [31]. The spam image dataset was preprocessed identically, resized and passed through the same pipeline. Importantly, the model was neither re-trained nor subjected to hyperparameter tuning on this external dataset. Instead, it was used strictly for inference to assess the model’s robustness. The risk of the model overfitting to a single source of dataset was decreased by testing it on a different dataset. After evaluation, ResNet50 achieved an impressive 98% accuracy, reinforcing its effectiveness in spam image classification. Due to its superior accuracy, low variance, and consistent performance across datasets, ResNet50 stands out as the most reliable model for this task.

The execution times recorded during inference varied across the models, reflecting differences in computational complexity. ResNet50 was relatively efficient for a deep learning model, completing inference in 38 s. XGBoost followed with a time of 48 s, benefiting from its optimized tree-based structure. LightGBM and SVM demonstrated similar performance, both completing in 50 s. VGG16 and Logistic Regression each required 54 s, with the latter being unexpectedly slower due to preprocessing overhead. These results demonstrate that while multiple models achieved efficient performance, ResNet50 combines high accuracy, minimal execution time, and strong generalization, making it the most suitable candidate for image-based spam detection systems.

6. Challenges

Complex Image Data: Images contain massive amounts of data and have many unique features that need to be analyzed. Finding useful features from images is a complex task.
Variation in Spam Content: Spam images can be different in terms of font, color, and content. This makes it difficult for a spam filter to distinguish between legitimate and spam images effectively.
Invisible Text in Images: Often, spam images contain embedded text, which is visually similar to legitimate content. Identifying this hidden text requires sophisticated techniques such as Optical Character Recognition (OCR), but even OCR can struggle with stylized fonts and distorted images.
Adaptive Nature of Spammers: Spammers continually adapt their strategies to evade detection. For example, they might use distortion, blending, or steganography to hide the actual content of the image.

Due to these challenges, machine learning models, especially those that can handle high-dimensional data, have become increasingly important in addressing the limitations of traditional methods.

7. Conclusions

Based of a rigorous evaluation of six machine learning and deep learning models, this study concludes that ResNet50 is the most effective model for image-based spam detection, achieving the highest AUC score of 0.99 with the lowest variance of 0.000009, indicating superior and consistent classification performance. XGBoost followed with a strong AUC of 0.97 and variance of 0.000049, while Logistic Regression also delivered high accuracy with an AUC of 0.94. VGG16 achieved an AUC of 0.93, slightly outperforming LightGBM, which recorded an AUC of 0.91, though both models demonstrated relatively stable performance. SVM also achieved an AUC of 0.91, albeit with a higher variance, suggesting slightly lower consistency. The best outcomes in terms of AUC and stability were achieved by ResNet50, XGBoost, and Logistic Regression. ResNet50 clearly outperformed all other models, while XGBoost and Logistic Regression provided strong and reliable results.

Further research on ensemble methods can be conducted, which could make use of various machine learning models to enhance accuracy in image spam identification. Adding models into an ensemble framework may improve the classification accuracy as it eliminates the drawbacks of utilizing a single model. More reliable and strong spam detection systems may result from this line of research.

Author Contributions

Conceptualization, M.J., H.M.T., A.P.-M. and V.D.; methodology, M.J., H.M.T., A.P.-M. and V.D.; software, M.J.; validation, M.J.; formal analysis, M.J.; investigation, M.J.; resources, M.J., H.M.T. and A.P.-M.; data curation, M.J.; Funding acquisition, R.C.; writing—original draft preparation, M.J.; writing—review and editing, M.J., H.M.T., A.P.-M. and V.D.; visualization, M.J.; supervision, H.M.T., A.P.-M., V.D. and R.C.; project administration, R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in two publicly accessible repositories. The primary dataset is available in the GitHub repository ImageSpamDetection at https://github.com/mahnoorjjamil/ImageSpamDetection/, accessed on 12 March 2025. An additional dataset used for external validation is available in the Kaggle repository Spam Image Dataset at https://www.kaggle.com/datasets/asifjamal123/spam-image-dataset?resource=download, accessed on 12 March 2025.

Acknowledgments

This work was supported partially by the European Union in the framework of ERASMUS MUNDUS, Project CyberMACS #101082683 and Faculty of Computer Science and Engineering at Ss. Cyril and Methodius University in Skopje.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shaukat, K.; Luo, S.; Varadharajan, V.; Hameed, I.; Xu, M. A Survey on Machine Learning Techniques for Cyber Security in the Last Decade. IEEE Access 2020, 8, 222310–222354. [Google Scholar] [CrossRef]
Dada, G.; Bassi, J.; Chiroma, H.; Abdulhamid, S.; Adetunmbi, A.; Ajibuwa, O. Machine learning for email spam filtering: Review, approaches and open research problems. Heliyon 2019, 5, e01802. [Google Scholar] [CrossRef] [PubMed]
Al-Duwairi, B.; Khater, I.; Al-Jarrah, O. Detecting Image Spam Using Image Texture Features. Int. J. Inf. Secur. Res. IJISR 2013, 3, 344–353. [Google Scholar] [CrossRef]
Nam, S.G.; Lee, G.D.; Seo, Y.S. Spam Image Detection Model based on Deep Learning for Improving Spam Filter. J. Inf. Process. Syst. 2023, 19, 289–301. [Google Scholar] [CrossRef]
Kontsewaya, Y.; Antonov, E.; Artamonov, A. Evaluating the Effectiveness of Machine Learning Methods for Spam Detection. Procedia Comput. Sci. 2021, 190, 479–486. [Google Scholar] [CrossRef]
Altunay, H.C.; Albayrak, Z. SMS Spam Detection System Based on Deep Learning Architectures for Turkish and English Messages. Appl. Sci. 2024, 14, 11804. [Google Scholar] [CrossRef]
Mangena, M.V.; Pande, S.D.; Umekar, P.; Mahore, T.; Kalyankar, D. Comparative analysis of detection of email spam with the aid of machine learning approaches. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1022, 012113. [Google Scholar] [CrossRef]
Rayan, A. Analysis of e-mail spam detection using a novel machine learning-based hybrid bagging technique. Comput. Intell. Neurosci. 2022, 2022, 2500772. [Google Scholar] [CrossRef]
Yaseen, Y.; Abbas, A.; Sana, A. Image Spam Detection Using Machine Learning and Natural Language Processing. J. Southwest Jiaotong Univ. 2020, 55, 41. [Google Scholar] [CrossRef]
Annadatha, A.; Stamp, M. Image Spam Analysis and Detection. J. Comput. Virol. Hacking Tech. 2018, 14, 39–52. [Google Scholar] [CrossRef]
Sheneamer, A. Comparison of Deep and Traditional Learning Methods for Email Spam Filtering. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2021, 12, 0120164. [Google Scholar] [CrossRef]
Rungta, A.; Arya, B.; Usha, G. Image Spam Filtering using Machine Learning Techniques. Int. J. Recent Technol. Eng. IJRTE 2019, 8, 186–190. [Google Scholar] [CrossRef]
Salama, W.M.; Aly, M.H.; Abouelseoud, Y. Deep learning-based spam image filtering. Alex. Eng. J. 2023, 68, 461–468. [Google Scholar] [CrossRef]
Sharmin, T.; Troia, F.; Potika, K.; Stamp, M. Convolutional Neural Networks for Image Spam Detection. Inf. Secur. J. Glob. Perspect. 2020, 29, 103–117. [Google Scholar] [CrossRef]
Jiao, W.; Hao, X.; Qin, C. The Image Classification Method with CNN-XGBoost Model Based on Adaptive Particle Swarm Optimization. Data Model. Predict. Anal. Inf. 2021, 12, 156. [Google Scholar] [CrossRef]
Siddique, Z.; Khan, M.; Din, I.; Almogren, A.; Mohiuddin, I.; Nazir, S. Machine Learning-Based Detection of Spam Emails. Sci. Program. 2021, 2021, 6508784. [Google Scholar] [CrossRef]
Fan, A.; Yang, Z. Image spam filtering using convolutional neural networks. Pers. Ubiquit. Comput. 2018, 22, 1029–1037. [Google Scholar] [CrossRef]
Cunningham, P.; Cord, M.; Delany, S.J. Supervised Learning. In Machine Learning Techniques for Multimedia. Cognitive Technologies; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar] [CrossRef]
Wize Up AI. 99+ Machine Learning Algorithms. Medium 2023. Available online: https://medium.com/@twizeupai/99-machine-learning-algorithms-e300b1acfe7b (accessed on 2 February 2025).
Chavda, A.; Potika, K.; Troia, D.F.; Stamp, M. Support Vector Machines for Image Spam Analysis. Int. Workshop Behav. Anal. Syst. Secur. 2018, 1, 431–441. [Google Scholar] [CrossRef]
Yang, H.; Qin, G.; Liu, Z.; Hu, Y.; Dai, Q. LightGBM robust optimization algorithm based on topological data analysis. In Proceedings of the International Conference on Computing and Multimedia Technology, Sanming, China, 24–26 May 2024. [Google Scholar] [CrossRef]
Microsoft Corporation. LightGBM’s documentation. In Microsoft Docs; Microsoft Corporation: Washington, DC, USA, 2025; Available online: https://lightgbm.readthedocs.io/en/stable/ (accessed on 8 January 2025).
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
Kim, B.; Abuadbba, S.; Kim, H. DeepCapture: Image Spam Detection Using Deep Learning and Data Augmentation. In Proceedings of the Australasian Conference on Information Security and Privacy, Perth, WA, Australia, 30 November–2 December 2020. [Google Scholar] [CrossRef]
Keskin, S.; Sevli, O. Machine Learning Based Classification for Spam Detection. Sak. Univ. J. Sci. 2024, 28, 270–282. [Google Scholar] [CrossRef]
Adnan, M.; Imam, M.O.; Javed, M.F.; Murtza, I. Improving spam email classification accuracy using ensemble techniques: A stacking approach. Int. J. Inf. Secur. 2024, 23, 505–517. [Google Scholar] [CrossRef]
Yang, H.; Liu, Q.; Zhou, S.; Luo, Y. A Spam Filtering Method Based on Multi-Modal Fusion. Appl. Sci. 2019, 9, 1152. [Google Scholar] [CrossRef]
Gao, Y.; Yang, M.; Zhao, X. Image Spam Hunter Dataset. EECS Department, Northwestern University, 2008. Available online: http://www.cs.northwestern.edu/~yga751/ML/ISH.htm (accessed on 12 March 2025).
Jamil, M. Github Image Spam Detection Repository. In GitHub; 2025 Mahnoor Jamil, Pakistan. Available online: https://github.com/mahnoorjjamil/ImageSpamDetection/ (accessed on 10 May 2025).
Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Jamal, A.; Parida, N. Spam Image Dataset. In Kaggle; 2023 Asif Jamal. Available online: https://www.kaggle.com/datasets/asifjamal123/spam-image-dataset (accessed on 23 March 2025).

Figure 1. Machine learning models [19].

Figure 2. Original image (left); augmented samples generated through data augmentation (right).

Figure 3. Original format of ham images.

Figure 4. Grayscale conversion of ham images.

Figure 5. Original format of spam images.

Figure 6. Grayscale conversion of spam images.

Figure 7. Spam detection results.

Figure 8. No spam/ham detected.

Figure 9. Misclassification of spam.

Figure 10. No misclassification of spam.

Figure 11. ROC curve comparison.

Figure 12. Training and validation loss curves.

Table 1. Confusion matrix components and AUC scores (as % of total dataset).

Model	TP (%)	FN (%)	TN (%)	FP (%)	AUC	Variance	95% Conf. Int.
SVM	37	7	47	9	0.91	0.000277	[0.872, 0.939]
LR	38	6	49	8	0.94	0.000171	[0.909, 0.961]
XGBoost	40	3	52	4	0.97	0.000049	[0.960, 0.987]
LightGBM	36	8	47	9	0.91	0.000215	[0.880, 0.937]
ResNet50	42	1	53	3	0.99	0.000009	[0.985, 0.997]
VGG16	42	1	44	13	0.93	0.000167	[0.907, 0.956]

Table 2. Comparison of F1-score, precision and recall.

Model	Spam			Non-Spam/Ham
Model	Precision	Recall	F1-Score	Precision	Recall	F1-Score
VGG16	0.77	0.97	0.85	0.97	0.76	0.86
Logistic Regression	0.84	0.88	0.86	0.89	0.86	0.88
LightGBM	0.80	0.83	0.82	0.85	0.83	0.84
SVM	0.81	0.85	0.83	0.87	0.83	0.85
XGBoost	0.91	0.93	0.92	0.94	0.92	0.93
ResNet50	0.93	0.97	0.95	0.97	0.94	0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jamil, M.; Mihajloska Trpcheska, H.; Popovska-Mitrovikj, A.; Dimitrova, V.; Creutzburg, R. Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis. Appl. Sci. 2025, 15, 6158. https://doi.org/10.3390/app15116158

AMA Style

Jamil M, Mihajloska Trpcheska H, Popovska-Mitrovikj A, Dimitrova V, Creutzburg R. Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis. Applied Sciences. 2025; 15(11):6158. https://doi.org/10.3390/app15116158

Chicago/Turabian Style

Jamil, Mahnoor, Hristina Mihajloska Trpcheska, Aleksandra Popovska-Mitrovikj, Vesna Dimitrova, and Reiner Creutzburg. 2025. "Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis" Applied Sciences 15, no. 11: 6158. https://doi.org/10.3390/app15116158

APA Style

Jamil, M., Mihajloska Trpcheska, H., Popovska-Mitrovikj, A., Dimitrova, V., & Creutzburg, R. (2025). Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis. Applied Sciences, 15(11), 6158. https://doi.org/10.3390/app15116158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis

Abstract

1. Introduction

2. Related Work

Comparative Analysis of Machine Learning Models

3. Methodology

3.1. Dataset and Preprocessing

3.1.1. Dataset Description

3.1.2. Preprocessing

3.2. Models

3.3. Evaluation and Performance Metrics

3.4. Experimentation Setup

4. Results

4.1. ROC Curve

4.2. Training and Validation Loss Curves

5. Discussion

6. Challenges

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI