1. Introduction
Today, email is at the center of digital interactions as an indispensable part of individual and corporate communication. According to Statista data, the number of emails sent worldwide each day reached approximately 361 billion in 2024, and is expected to exceed 408 billion by 2027 [
1]. However, this massive volume of communication also brings with it serious security risks. One of the most common and persistent of these risks is unsolicited commercial messages, commonly known as spam emails. These messages, sent without the user’s consent, often contain various cyber threats such as deceptive advertisements, phishing attacks, and malware distribution. While spam emails accounted for 45.6% of total email traffic at the end of 2023, this percentage rose to 46.8% by the end of 2024 [
2]. It is estimated that this percentage will continue to increase in 2027. These percentages indicate that nearly one in every two emails is spam [
3].
As shown in
Figure 1, email traffic has steadily increased from approximately 281 billion in 2018 to 361 billion in 2024, and is expected to reach 408 billion by 2027. This upward trend highlights the growing scale of the spam problem and emphasizes the need for more advanced and scalable detection systems. The rise in volume also parallels the increasing risk and financial impact of spam-related cyberattacks, reinforcing the urgency of robust filtering and adaptive security models.
The harm caused by spam emails extends beyond cluttering inboxes. For businesses, it leads to lost productivity, bandwidth waste, storage overload, and costly investments in IT security. More critically, spam can result in data breaches, financial losses, and reputational damage. Some studies show that email is the most common medium for phishing and that losses per incident can sometimes reach hundreds of thousands of dollars in SMEs. For example, Carroll et al. (2022) report that one in every 4200 emails sent in 2020 was phishing [
4]; Le, Le-Dinh, and Uwizeyemungu (2024) report that cyber incident costs in SMEs ranged from
$826 to
$653, 587 USD (average ≈ 25,000 USD) [
5].
As shown in
Figure 2, the damage caused by spam attacks to individual users and corporate companies is increasing yearly. With the rise of these threats, legal regulations aimed at protecting data privacy and user rights have also been tightened worldwide. Regulations such as the General Data Protection Regulation (GDPR) in the European Union and the Personal Data Protection Law (KVKK) in Turkey impose strict rules on the processing of individuals’ personal data and unsolicited communications. These laws significantly restrict the sending of spam and encourage organizations to obtain user consent while developing effective spam filtering mechanisms. Therefore, the automatic and highly accurate detection and blocking of spam emails has evolved from a mere convenience into a legal requirement and a critical component of cybersecurity strategy. Given this legal and technological landscape, researchers and developers have increasingly turned to methods rooted in artificial intelligence, especially those involving natural language and machine learning processing, as viable tools to combat spam more effectively.
Machine Learning and Natural Language Processing (NLP)-based methods are widely used to detect spam, thanks to their ability to capture complex patterns in text. However, many research papers rely on a limited dataset, which can lead to overfitting and poor generalization to new types of emails. This can limit the ability of developed models to generalize across different types and structures of spam or ham emails, potentially providing an incomplete picture of their performance in real-world scenarios. Additionally, many studies focus on a specific algorithm or model architecture, avoiding a comprehensive comparison of different approaches or proposing more powerful hybrid systems by combining these approaches. For example, while some studies only test traditional machine learning algorithms, others focus on variations in a specific deep learning model, but do not systematically compare these two main paradigms or different deep learning architectures within the same experimental framework. These shortcomings can negatively affect the robustness of the developed spam detection systems and their adaptation to different email distributions.
Although many previous studies have reported high accuracy scores, often exceeding 99% on specific datasets, these results have often been obtained on single-source, homogeneous, or relatively small datasets. This reliance on narrow data distributions poses a significant challenge in terms of generalizability, as models often fail to perform robustly when exposed to the diverse and ever-evolving nature of real-world spam. A fundamental limitation in the literature is the lack of studies that validate high-performance models on large-scale, heterogeneous datasets compiled from multiple and diverse sources.
To address these fundamental gaps, this study makes several methodological contributions. The novelty of this work lies not in the invention of a new classification algorithm, but in its rigorous and comprehensive approach to model evaluation, which directly confronts the issues of data homogeneity and limited comparative analysis prevalent in the literature. Specifically, the contributions are threefold:
First, a large-scale, heterogeneous benchmark dataset is constructed by merging seven distinct public corpora. By training and testing on this diverse collection of over 81,000 emails, this study provides a more realistic and challenging environment for assessing the true robustness and generalizability of spam detection models—a critical step toward creating systems that are effective in real-world scenarios.
Second, a comprehensive comparative analysis is conducted across different modeling paradigms, including classical machine learning, deep learning, and state-of-the-art Transformer-based architectures. By evaluating all models under identical data conditions and evaluation protocols, this work serves as a much-needed, comprehensive benchmark for the field.
Third, a significant limitation in existing literature is the predominant focus on static performance, often overlooking the non-stationary nature of spam. Spam tactics are not static; they evolve over time in response to new technologies, social trends, and anti-spam countermeasures. This phenomenon, known as ‘concept drift’, poses a critical threat to the long-term viability of spam detection models. A model achieving high accuracy on a dataset from 2005 may fail catastrophically against the sophisticated phishing and social engineering attacks of 2022. Yet, empirical studies quantifying this performance degradation on large-scale, public email corpora remain scarce.
Therefore, in addition to establishing a robust performance benchmark, this study makes a crucial third contribution by conducting a novel temporal analysis to empirically measure the impact of concept drift on both traditional and Transformer-based models. Through the segmentation of the heterogeneous dataset into ‘classic’ and ‘modern’ eras, models were trained on historical data and their resilience against contemporary spam was evaluated. This analysis not only quantifies the performance decay on a large-scale, heterogeneous corpus but also provides empirical evidence for the mechanisms behind this decay (keyword shift vs. contextual pattern drift), offering a crucial justification for the necessity of adaptive security models.
The datasets were transformed into a common structure based solely on the label and text columns, thus creating a combined dataset containing a total of 81,586 email examples that are rich in content, source, and style. The class imbalance observed in the dataset (31,670 spam, 49,916 ham) was addressed using random oversampling of the minority class, with the aim of enabling classification algorithms to learn both classes in a balanced manner.
During the modeling process, a variety of classical machine learning algorithms, including Support Vector Machines (SVM), Random Forest (RF), Naive Bayes (NB), and Logistic Regression (LR), were initially applied and their performances compared. Following these traditional approaches, the focus of the study shifted to more advanced deep learning architectures. In particular, Distilled Bidirectional Encoder Representations from Transformers (DistilBERT) and Robustly Optimized BERT Pretraining Approach (RoBERTa), Transformer-based models that have achieved great success in text classification in recent years, were trained and evaluated to extract deep semantic representations from email texts. To achieve even higher performance than what these models provide individually, a multimodal deep learning architecture has been developed. This architecture combines text embeddings from the DistilBERT model with structural and statistical numerical features of emails, including text length, word count, and capitalization rate. The goal of this hybrid approach is to achieve stronger classification by considering both the semantic content and superficial characteristics of the text. All models were compared using the same combined dataset and evaluation metrics, with the multimodal (DistilBERT + Numerical Features) model achieving the highest test accuracy of 99.62% across the entire dataset.
The study did not only focus on classification performance. It also examined spam and ham email contents at the word level, analyzed frequently used words, and identified class-specific linguistic patterns. In this regard, the study aims to contribute to the field of spam email detection both theoretically and practically by offering data diversity, comprehensive model comparison (including traditional machine learning, different Transformers, and multimodal approaches), and in-depth content analysis. This combination distinguishes the work from studies limited to a single dataset or a narrow set of models.
The structure of the remainder of this paper is outlined as follows:
Section 2 presents an extensive review of the literature, encompassing datasets, rule-based techniques, classical machine learning, deep learning, and Transformer-based methods, while also identifying existing gaps and emerging trends.
Section 3 details the methodology adopted in this study, including the dataset merging process, preprocessing steps, class balancing techniques, feature extraction methods, and the modeling approaches applied.
Section 4 presents the experimental results, offering a comparative analysis of the model performances, an evaluation of practical applicability, and a discussion of observed limitations. Finally,
Section 5 summarizes the main findings and proposes recommendations for future research directions.
2. Literature Review
This section presents a concise review of the literature on spam email detection, focusing on both the datasets used and the methods applied. The quality and structure of datasets significantly influence model performance, while detection techniques have evolved from rule-based systems to machine learning, deep learning, and transformer-based language models. By examining recent studies in these areas, this review aims to highlight the progress made and the current research gaps.
2.1. Datasets Used and Access Sources
The datasets used in spam email detection are one of the critical components that directly affect model performance. One of the most commonly used datasets, the SpamAssassin dataset, is provided by the Apache Software Foundation and contains real-world spam and ham email samples. This text-based dataset has been used as a fundamental resource for evaluating classical methods such as NB for many years [
6]. Another important resource is the Enron Spam dataset. This dataset, created by processing real corporate email archives belonging to the Enron company, was prepared by Carnegie Mellon University. This dataset, which contains approximately 43,000 emails, stands out for its balanced presentation of both spam and ham content. Additionally, it is of great value for the adaptation of spam filtering systems to real-world conditions because it reflects corporate communication [
7]. The Ling-Spam dataset is a small but balanced dataset that evaluates academic and technical content alongside spam emails. It provides effective results in distinguishing between academic email content and spam text and serves as a reference, especially for early spam filtering studies [
8]. The SMS Spam dataset is the most frequently used source for testing systems with short text formats. Created by Almeida and Hidalgo, this dataset contains approximately 5500 short messages and has a compact structure for spam classification. The SMS Spam dataset is widely used in spam detection studies in the field of mobile communication [
9]. In addition to these, various email datasets accessible through the Kaggle platform are also used for both academic and applied studies. These datasets are generally created, labeled, and shared by users in Comma-Separated Values (CSV) format. In this study, in addition to the SpamAssassin, Enron, Ling-Spam, and SMS Spam datasets mentioned above, two different datasets shared on Kaggle were also brought together. These datasets were converted into a common structure by considering only the label and text columns, and then cleaned and balanced to prepare them for model training. The use of multiple datasets allows the model to learn different types of spam content without being tied to a specific source. This increases the generalizability and real-world applicability of the developed classification system. In conclusion, the datasets used in spam email detection research are a decisive factor not only for model performance but also for the validity of the study. The method of combining datasets from different sources is still not widely used in the literature, and each new study in this field contributes unique insights by providing data diversity.
2.2. Rule-Based Systems
The detection of unwanted spam emails has become one of the most important subtopics of digital security since the late 1990s, when the internet began to spread. Early work in this area was generally based on rule-based systems. Methods such as blacklisting certain words or sender addresses, recognizing special character patterns in message headers, or filtering based on content length were frequently used during this period. However, these fixed rules became insufficient over time due to the constantly changing content of spam messages and lost their sustainability because they required manual intervention. To address these shortcomings, machine learning-based automatic classifiers were developed. Studies conducted in the early 2000s demonstrated that statistical learning methods could process email text in a more flexible and adaptable manner.
2.3. Studies Conducted Using Classical Machine Learning Methods
Machine learning algorithms have been quite successful in the field of spam email detection since they were first applied. In these early studies, supervised learning algorithms such as Decision Trees (DT), SVM, NB and LR stood out. These methods primarily work by classifying features extracted from text data, typically making decisions based on term frequency (TF), inverse document frequency (IDF), and specific content structure. Due to their simplicity, short training times, and interpretability, these methods have formed the foundation of many systems for years.
One of the most well-known studies in this field was conducted by Metsis and colleagues in 2006. In this study, various NB variants were tested comparatively, and it was reported that the classic NB model offered a fast, lightweight, and highly effective solution for spam detection. In experiments using the SpamAssassin dataset, the model was reported to achieve accuracy rates of up to 96% [
10]. Additionally, in another study conducted by Almeida and Hidalgo in 2011, methods such as SVM and NB were evaluated on SMS spam datasets, and these classical models achieved 98% accuracy [
11]. These studies are important in demonstrating how effective classical methods can be, especially with small and balanced datasets. However, with the increasing diversity of linguistic structures in spam content and the evolution of attack techniques, the flexibility and adaptability of these traditional methods have begun to fall short over time.
Table 1 presents recent studies that employed classical machine learning methods for spam email detection.
As presented in
Table 1, classical machine learning methods continue to demonstrate strong performance, particularly in well-structured and balanced datasets, yielding high accuracy and F1 scores. These approaches commonly leverage algorithms such as SVM, NB, LR and RF, often in conjunction with feature extraction techniques such as TF-IDF. Salihi [
12], for example, employed Word2Vec embeddings combined with a Multi-Layer Perceptron (MLP) classifier to detect spam in social media content. However, the reliance on Twitter data, rather than email-based datasets, limits the generalizability of the findings. Dedetürk and Akay [
14] applied a LR model enhanced by the Artificial Bee Colony algorithm on spam emails collected from multiple sources, achieving an accuracy exceeding 98%, thereby highlighting the effectiveness of meta-heuristic optimization techniques. Junnarkar et al. [
16] conducted a comparative evaluation of various classical models and identified RF as the most effective. Similarly, Rayan [
17] proposed a hybrid bagging approach that further improved predictive accuracy. Alsuwit et al. [
18] reported comparable performance by benchmarking classical methods against a basic ANN. Nevertheless, a common limitation among these studies is the absence of direct comparisons with deep learning models and the use of domain-specific or relatively small datasets, which may hinder the broader applicability of the results in more complex or linguistically diverse environments.
2.4. Studies Conducted Using Deep Learning Methods
Deep learning methods, developed to overcome the limited pattern recognition capabilities of machine learning, have rapidly become widespread in spam email detection, especially since 2015. Unlike classical models, deep learning architectures can extract features directly from data and learn complex structures in language, enabling them to achieve higher accuracy. Among the most commonly used architectures during this period are Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). The CNN architecture successfully learns local patterns within text, especially in messages containing short phrases. LSTM, on the other hand, takes into account time-dependent data structures and can produce effective results in long texts with contextual meaning, such as emails. In a comprehensive review study published by Jáñez-Martino and colleagues in 2022, it was noted that LSTM-based models achieved Receiver Operating Characteristic—Area Under Curve (ROC-AUC) scores above 0.98 in spam email detection, while CNN models attracted attention with their effective performance on short messages. This study emphasized that deep learning architectures provide a clear advantage over classical methods, especially in data with high linguistic complexity [
20]. In addition, in a study conducted by Zavrak and Yılmaz in 2023, features extracted using CNN were transferred to the LSTM layer, and the resulting features were evaluated using a RF classifier to design a hybrid model. This three-stage approach achieved an impressive score of 99.2 [
21]. Such hybrid systems have made it possible to develop solutions that are both flexible and highly accurate by combining the representational power of deep learning with the decision-making mechanisms of classical methods. In conclusion, deep learning approaches have paved the way for systems that can better understand and learn the context in spam detection tasks, enabling the development of more robust filters, especially when dealing with large datasets and a wide variety of spam types.
Table 2 highlights recent studies utilizing deep learning methods such as CNN and LSTM in spam detection tasks. These models offer improved performance by learning contextual patterns and language structures directly from the data.
As highlighted in
Table 2, recent studies using deep learning techniques such as CNN, LSTM, and Bidirectional Encoder Representations from Transformers (BERT) have made significant progress in spam detection. These models eliminate the dependence on manual feature engineering by directly learning layered linguistic structures from text. For example, Basyar et al. [
22] demonstrated the effectiveness of GRU and LSTM architectures by achieving an accuracy rate of over 99% on the Enron dataset. Others have extended spam detection to multilingual contexts. For example, Siddique et al. [
23] contributed to linguistic diversity in this field by working on Urdu datasets. Additionally, researchers such as Nasreen et al. [
26] have developed foundational models such as BERT by incorporating new feature selection methods, demonstrating ongoing innovation in this field.
While these models offer exceptional performance and robustness, it is crucial to evaluate their success within context. The reported high accuracy rates are typically demonstrated on specific and sometimes homogeneous datasets (e.g., Enron, Ling-Spam, Spambase). This does not guarantee that the models will effectively generalize to the diverse range of spam tactics and linguistic styles found in more varied, multi-source email corpora. Additionally, as noted, their practical application may be constrained by high computational requirements and the need for large labeled datasets for fine-tuning. This underscores the critical need for studies that not only push the boundaries of accuracy but also validate model robustness and generalizability on larger, more challenging, and heterogeneous benchmarks.
2.5. Transformer and Language Model-Based Studies
In recent years, a significant advancement in unwanted spam email detection has been the use of pre-trained language models and Transformer architectures, which have greatly transformed NLP. Transformer architectures have the ability to learn sentence context bidirectionally and model the complex structure of language powerfully thanks to their multi-layered structure. The most well-known model in this context is Bidirectional Encoder Representations from Transformers (BERTs).
Meléndez et al. presented a comparative performance analysis of traditional machine learning models and transformative deep learning models in the detection of phishing emails [
28]. They compared the results of transformative models such as distilBERT, BERT, RoBERTa, XLNet, and A Lite BERT (ALBERT) with traditional methods such as LR, SVM, and NB on 119,148 English email samples obtained from various publicly available sources. The results showed that the RoBERTa model achieved the highest performance with an F1-score of 99.51% and an accuracy rate of 99.43%. However, they also demonstrated that traditional models struggled to detect phishing content, particularly with complex language structures.
Similarly, Chandan et al. compare BERT with traditional machine learning algorithms in spam detection [
29]. Their results show that BERT achieves the highest test accuracy of 98% and outperforms LR, Multivariate NB, SVM, and RF algorithms.
In a 2021 study by AbdulNabi and Yaseen, the BERT model was used for spam email detection and compared with classical methods as well as deep learning architectures such as BiLSTM. In this study, the BERT model achieved 98.67% accuracy and 98.66% F1 score, outperforming all alternative methods [
30]. The results of the study showed that BERT not only understands the text but also successfully detects deceptive language patterns, semantic deviations, and expressions that deviate from the context. BERT success highlights the importance of focusing not only on the word level but also on the meaning relationships within sentences in spam detection. In addition, the advantage of these models in detecting phishing and fraudulent messages plays a major role in preventing threats that cannot be detected by traditional methods. However, the high computational costs of Transformer architectures make it difficult to use these models directly in every application. As a result, recent research has focused on the use of lighter and more efficient Transformer variants (such as DistilBERT and ALBERT) [
31,
32]. These models aim to achieve similar success with lower resources without compromising performance.
2.6. Trends and Gaps in the Literature
A review of the literature clearly shows that spam email detection methods have evolved over time. Starting with rule-based systems, this evolution has moved on to classical machine learning algorithms, then deep learning architectures, and finally Transformer-based pre-trained language models [
3]. This transition process has brought about significant advances not only in the methods used but also in data representation, contextual understanding, and modeling capabilities. Considering the success rates achieved in recent studies, it is observed that Transformer-based models achieve the highest accuracy and F1 scores in spam detection [
28,
31,
33]. However, this situation also brings some challenges.
The training and operation of these models require considerable computational resources. Furthermore, the fact that most studies use English-focused datasets makes it difficult to replicate similar successes in multilingual or low-resource languages, as publicly available and balanced spam datasets for languages such as Turkish and Arabic are quite limited. Another significant challenge is the constantly changing structure of spam messages, especially with the proliferation of deceptive content generated by artificial intelligence. This problem is further compounded by the emergence of adversarial attacks, where spam messages are intentionally manipulated with subtle, human-imperceptible perturbations to bypass detection. Recent studies, such as the work by Ali Owfi et al. [
27], have demonstrated that even state-of-the-art models are susceptible to these evasion techniques, highlighting the importance of robust defense mechanisms such as adversarial training [
28].
In recent years, there has been growing interest in integrating Explainable Artificial Intelligence (XAI) techniques into spam and phishing detection systems. While Transformer-based models such as BERT and DistilBERT achieve state-of-the-art performance, their decision-making processes often remain opaque, raising concerns about user trust and system transparency. To address this, researchers have explored methods such as SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) to provide both global and local interpretability. For example, Abdellaoui Alaoui et al. [
34] demonstrated how SHAP explanations improved transparency in a BERT-based spam classifier, while Shafin et al. [
35] proposed an explainable feature selection framework for phishing detection by combining SHAP and LIME. Similarly, SHAP-supported ensemble models have reached high accuracy (≈97–98%) and offered interpretable feature contributions that assist analysts in understanding model predictions [
36]. For the hybrid approach, XAI could clarify the relative contributions of semantic Transformer embeddings and structural features, and help explain false positives, thereby improving model reliability. However, due to computational and resource constraints, XAI integration was not included in this study and is explicitly acknowledged as a limitation. Its systematic incorporation remains a priority for future work to further enhance interpretability and user trust.
Consequently, the primary goal of the research community has shifted towards developing systems that not only offer high accuracy but are also robust, updatable, efficient, and capable of real-time operation. In this context, the literature is showing significant progress at both the academic and industrial levels; next-generation AI-based spam filtering systems continue to play a critical role in secure email communication.
This work directly addresses some of the identified research gaps. Initially, seven heterogeneous, publicly available spam datasets were integrated into a unified corpus. This provides a broader and more representative base for model training, which is essential for improving generalizability beyond single-domain datasets. While the current work focuses on English emails, the diversity and size of the corpus lay the groundwork for future multilingual experiments, especially in low-resource languages such as Turkish and Arabic. Second, the proposed multimodal approach (by combining Transformer-based contextual embeddings from DistilBERT with lightweight structural features) provides a balanced solution that maintains high semantic understanding while improving computational efficiency during both training and inference. Third, the flexibility of the feature set and model architecture enables periodic updates, adapting to evolving spam patterns, including AI-generated deceptive content. Finally, by emphasizing scalability, reproducibility, and real-time applicability, this work bridges the gap between theoretical research and the practical requirements of operational spam detection systems.
4. Findings and Evaluations
In this study, a comprehensive modeling process for spam email detection was carried out using a large and balanced dataset compiled from various sources. In this process, both classical machine learning algorithms and deep learning and transformer-based approaches were tested. Based on the results obtained, the models that achieved the highest success were identified, and their advantages and limitations were evaluated.
4.1. Performance Benchmark on a Combined Static Dataset
Among classic machine learning algorithms, the RF algorithm stands out with 98.85% accuracy and a high F1 score. Upon examining confusion matrix in
Figure 6, it is observed that while the model classifies spam data with high accuracy, it produces a higher number of false positive predictions on ham emails compared to other models. This indicates that the model has high sensitivity on spam data but slightly lower specificity on the ham class.
The combination of CNN + LSTM + RF, which is based on deep learning models, has attracted attention with an accuracy rate of 99.17%. Looking at the confusion matrix in
Figure 7, it can be seen that the model classifies spam and ham samples more evenly than the RF model. The CNN layer learns local text patterns, while the LSTM captures temporal structures. Meanwhile, the RF effectively performs the final classification. This approach combines the best features of both classical and deep learning methods.
Among transformer-based models, the model that achieved the highest success with 99.62% accuracy is this multimodal architecture, which includes content-based statistical features in addition to language representations obtained with DistilBERT. The success of this model stems not only from the power of DistilBERT, which deeply understands the semantic content of the text, but also from the additional information provided by structural text features. As seen in the confusion matrix in
Figure 8, the number of misclassified examples is significantly lower compared to other models. This result demonstrates the model capability to make decisions not only based on meaning but also on formal features. The combination of contextual meaning and surface-level features makes this model the most effective solution.
4.2. The Challenge of Concept Drift: A Temporal Performance Analysis
While the static benchmark provides a comprehensive overview of model capabilities on a diverse dataset, it does not capture the critical dimension of time. To address this, a temporal analysis was conducted as described in
Section 3.5. The results reveal a significant degradation in performance, providing stark evidence of concept drift.
4.2.1. Quantifying Performance Degradation over Time
Figure 9 illustrates the performance of the RF and DistilBERT + Features models when trained exclusively on the Classic Era dataset and subsequently tested on both the Classic and Modern Era test sets. A sharp decline in performance is evident for both models when confronted with modern spam.
The RF model, which achieved a high F1-score of 98% on its native Classic Era data, experienced a catastrophic drop to 66% on the Modern Era data—a relative performance loss of over 32%. More critically, its precision for the spam class plummeted from 98% to just 50.3%, indicating an unacceptably high rate of false positives where legitimate emails are incorrectly flagged as spam. The DistilBERT + Features model, while more robust, was not immune. Its F1-score for spam detection decreased from 94% to 70%. Although its deeper semantic understanding provided a buffer against the temporal decay, this still represents a significant vulnerability.
4.2.2. The Linguistic Evolution of Spam Tactics
The underlying cause of this performance degradation is a fundamental shift in the language and tactics of spam.
Figure 10 visualizes the “signature” keywords that are most characteristic of each era, as determined by a log-odds ratio analysis. It is important to note that this method intentionally suppresses words that are common to both eras (such as ‘free’, ‘money’, ‘click’) to specifically highlight the terms that define the tactical differences between the periods.
The Classic Era (left) is defined by terms related to direct commercial offers, financial schemes, and rudimentary filter evasion tactics (e.g., eyewear, vibrator, jigget, equivalence). The spam of this period often resembles blatant product advertisements or uses “word salad” to confuse early statistical filters.
In stark contrast, the Modern Era (right) is characterized by a vocabulary centered on mobile services and psychological manipulation. Signature terms such as polyphonic, landline, and voucher highlight the shift towards mobile-centric scams and fake offers. Furthermore, words such as unredeemed and emotionally charged adjectives (unbreakable, irresistible) signify a move towards more sophisticated social engineering and phishing tactics that mimic legitimate service notifications and exploit user psychology.
4.3. In-Depth Robustness and Generalization Analysis
This section advances beyond headline accuracy metrics in the range of 99–99.6% and positions the results under a singular, standardized evaluation protocol, emphasizing the contribution of feature fusion. We compare a broad set of traditional, neural, and Transformer baselines trained and tested with identical preprocessing and metrics. Our emphasis is on reliability and generalization across sources and on deployment-relevant behavior (e.g., stability at low false-positive operating points), rather than on marginal absolute gains that may not be practically meaningful. The multimodal variant—DistilBERT embeddings fused with a lightweight set of structural cues—serves to quantify the value of combining deep semantics with simple text-level regularities. All confidence intervals and p-values reported in this subsection are computed according to the protocol specified in Methods (Statistical Procedures).
4.3.1. Statistical Robustness and Calibration
The point estimates presented in
Figure 5 are complemented by reporting statistical uncertainty, threshold-free metrics, paired significance, and probability calibration. Our goal is to characterize robustness and deployment-relevant behavior rather than repeat Accuracy/Precision/Recall/F1 numbers.
Table 5 summarizes the robustness of the models using 95% bootstrap confidence intervals (2000 stratified samples) and threshold-independent measurements, without resorting to the repetition of individual metrics. Under the combined evaluation protocol and balanced test section (
n = 4992), accuracy values converged at a level of ≳0.995, and due to the associated “ceiling effect,” Accuracy/F1 ranges largely overlapped; this indicates that the observed differences remained within sampling variability and did not indicate a significant performance gap. However, threshold-independent metrics show small but consistent distinctions: Multimodal (DistilBERT + Features) achieved the highest values in PR-AUC and ROC-AUC (0.9998/0.9998), as well as in Matthews Correlation Coefficient (MCC) (0.9948) and Balanced Accuracy (0.9974) metrics; RF and CNN + LSTM + RF, on the other hand, fall within a statistically similar range in these metrics. The following sections complete this picture with matched significance (McNemar) and calibration findings.
Figure 11 illustrates the ROC curve (left) and the Precision-Recall curve (right) for the three models. The ROC panel emphasizes the operating points at false positive rates of 0.1%, 0.5%, and 1.0%. The PR panel shows the corresponding precision–recall operating points at these thresholds.
The McNemar test was applied to the accuracy flags (1 = correct, 0 = incorrect) on the same samples. b01 is the number of pairs where “the contrast model is correct, Multimodal is incorrect”; b10 is the number of pairs where “Multimodal is correct, the contrast model is incorrect.” In both comparisons, the two-tailed p-values are p > 0.05, so there is no statistically significant difference in this test section (the difference for Multimodal vs. RF is b10 − b01 = +10, and although the trend favors Multimodal, p = 0.076 does not fall below the significance threshold).
As illustrated in
Figure 12, the reliability (calibration) curves indicate that probability estimates are well calibrated across models, with the multimodal approach showing the smallest miscalibration in the low-FPR operating region.
The findings in this subsection show that, according to the McNemar analysis in
Table 6, there is no statistically significant difference between the top-performing models (two-tailed
p-values > 0.05). However, in threshold-independent measurements shown in
Table 5 and
Figure 11, the Multimodal (DistilBERT + Features) model consistently achieves the highest values and demonstrates superior sensitivity, particularly in the very low false-positive region (FPR = 0.1%).
In terms of probability quality, deep models are significantly better calibrated.
Table 7 summarizes these results using the Expected Calibration Error (ECE) and Brier score. The CNN + LSTM + RF model attains the lowest ECE (0.0020), while the Multimodal model has the lowest Brier score (0.0024). In contrast, the higher ECE and Brier values for the RF model are consistent with its deviation in the reliability curve shown in
Figure 12.
Although CNN + LSTM + RF attains the lowest ECE, the multimodal model delivers superior sensitivity in the very low-FPR region while maintaining competitive Brier performance. Therefore, the Multimodal approach is preferred as the primary model for field deployment and subsequent experiments, while the other two models are retained as a robust and explanatory basis for comparison.
4.3.2. Cross-Dataset Generalization Analysis
It is critical to confirm that the high success metrics presented in the study stem from the model ability to generalize to data types it has never seen before, rather than memorizing the distributions of different corpora in the combined dataset. To this end, a two-stage robustness analysis was performed.
First, to prevent potential data leakage between different datasets, exact duplicate email texts across all corpora were identified and deduplicated prior to merging. The separation of training, validation, and test sets was performed before any oversampling operations.
Second, and more importantly, a cross-validation strategy called “Leave-One-Corpus-Out” (LOCO) was applied. In this strategy, the models were trained on a combination of six of the seven datasets and tested on the seventh dataset, which they had never seen before. This process was repeated seven times, with each dataset being excluded in turn. This test measures how well the model can adapt to completely new emails from different sources and structures. The results of this comprehensive test are summarized in
Table 8.
The best performance values are marked in bold. Dataset names have been standardized to ensure consistency with the definitions in
Section 3.2.1.
The LOCO analysis results presented in
Table 8 validate the superior generalization ability of the DistilBERT-based model across multiple metrics. The model outperforms classical models by a wide margin in most scenarios, both in terms of accuracy and in metrics such as Macro-F1 and PR-AUC, which are more robust to class imbalance. In particular, the F1 and PR-AUC values exceeding 99% achieved on the Enron Spam dataset, which reflects the structure of corporate emails, demonstrate the model ability to adapt to complex content.
The analysis also highlights scenarios where the models struggle. The Ling-Spam corpus, which stands out from the others due to its linguistic and academic content, is where both models perform the worst. The fact that DistilBERT performs significantly better than RandomForest in all metrics even on this challenging dataset emphasizes the importance of deep semantic understanding.
Another noteworthy finding emerged in the SpamAssassin corpora. In these datasets, RandomForest performed better than DistilBERT. In particular, in the SpamAssassin (v2) dataset, despite DistilBERT high Accuracy value, its Macro-F1 score dropped to 49.61%, indicating that the model was largely unsuccessful in correctly predicting one of the classes (probably spam) in this dataset. The Macro-F1 score is the unweighted average of the F1 scores calculated for each class; therefore, when the F1 score of a class approaches zero, the Macro-F1 value drops to around 50%. This situation indicates that the model cannot distinguish spam patterns in the SpamAssassin (v2) corpus and assigns almost all examples to a single class (likely raw). Additionally, the PR-AUC value being reported as ‘NA’ (Not Available) in the same scenario confirms this situation; indeed, this metric cannot be mathematically calculated when there are no examples belonging to the positive class (spam) in the test set. These findings suggest that spam patterns in SpamAssassin datasets rely more on prominent keywords and that traditional feature extraction methods such as TF-IDF may be more stable than deep learning models in capturing such signals.
In conclusion, these comprehensive generalization tests prove that the 99.62% main performance score presented in the article not only reflects success in a mixture of datasets, but also demonstrates the robust model generalization ability against unknown data distributions (in most cases). This analysis provides valuable insights for real-world applications by highlighting the model strengths and weaknesses.
4.4. Discussion on Contribution and Practical Relevance
Although the highest accuracy rate of 99.62% obtained is numerically comparable to the highest scores reported in the literature, its significance lies in the specific context in which it was achieved. Unlike studies based on a single, homogeneous dataset, our model was tested on a large-scale, heterogeneous corpus. The contribution of this work is thus a demonstration of a model resilience in a challenging, real-world-like scenario.
In order to enhance the contextualization of these findings regarding practical application, an evaluation of the models is conducted within the operational constraints encountered by significant real-world systems.
4.4.1. Performance in Deployment-Oriented Scenarios
A direct comparison with proprietary systems such as Gmail is not feasible due to their closed-source nature. However, the results can be contextualized by evaluating the models under the operational constraints faced by real-world systems, particularly regarding the prioritization of extremely low false-positive rates (FPR).
To achieve this, a dedicated threshold tuning process was conducted, as detailed in the methodology (
Section 3.4.4), to identify the specific operational points relevant to the analysis. Therefore, reporting performance at fixed FPR points is more meaningful for practical deployment than relying on overall accuracy alone. Accuracy can be misleading in imbalanced data conditions; thus, it is recommended to present ROC-AUC and PR-AUC together and focus on low-FPR regions [
46,
47]. Recent studies emphasize that the generalization that ROC-AUC “inflates in imbalance” is conditional and that PR-AUC should be interpreted in conjunction with threshold selection and operating points in practice due to its prevalence sensitivity [
46,
48]. In the security literature, low false-positive budgets (e.g., 0.1–1.0% FPR) are also highlighted as operational targets [
49].
As shown in
Table 9, fixing FPR = 0.1%/0.5%/1.0% corresponds to an expected rate of 1000/5000/10,000 false positives per 1 million legitimate messages, respectively. This conversion directly quantifies threshold selection under low-FPR constraints and aligns evaluation with operational error budgets. Since Accuracy alone can be misleading especially on imbalanced datasets, it is recommended to interpret the results in conjunction with ROC-AUC and PR-AUC. In large-scale security deployments, strict false-positive budgets are common in large-scale security applications; therefore, low-FPR-focused reporting is preferred [
49].
The results in
Table 10 highlight the practical superiority of the proposed approach under these tuned, deployment-oriented conditions. At a stringent 0.1% FPR constraint, the Multimodal model achieves the highest sensitivity (TPR = 0.9848), outperforming both the classical RF (0.9692) and the deep learning hybrid (0.9351). In the 0.5–1.0% FPR range, the sensitivity of all three models is ≥0.995. This demonstrates the model developed in this study and its ability to maintain a high spam capture rate while minimizing the critical risk of misclassifying legitimate emails, a crucial capability for real-world deployment.
4.4.2. Temporal Robustness: A Critical Dimension of Practical Relevance
While optimizing for low false-positive rates on a static dataset is crucial, temporal analysis results reveals a deeper, more challenging aspect of practical relevance: the decay of model performance over time due to concept drift.
The results presented in
Section 4.2 provide a stark warning. The catastrophic drop in precision to ~50–56% for models trained on historical data demonstrates that even a perfectly tuned static model can become unacceptably aggressive in the future, leading to a high volume of false positives. This performance degradation is a direct consequence of the tactical and linguistic evolution of spam, as illustrated by the shift in “signature” keywords (
Figure 10). The vocabulary of spam has evolved from direct commercial offers (eyewear, vibrator) to mimicking legitimate services with terms such as polyphonic, unredeemed, and voucher.
This temporal vulnerability underscores a fundamental limitation of static evaluation. Although both models exhibit performance degradation, a key finding is the relative robustness of the DistilBERT + Features model compared to RF. The RF model, which relies on explicit keywords (TF-IDF features), demonstrates brittleness when the spam vocabulary shifts dramatically (
Figure 10). The keywords associated with “Classic Era” spam (e.g., ‘vibrator’, ‘eyewear’) are largely absent in “Modern Era” spam, which employs a more subtle vocabulary mimicking legitimate services (e.g., ‘voucher’, ‘unredeemed’, ‘polyphonic’).
In contrast, the Transformer-based model’s superior resilience can be attributed to its ability to learn abstract, contextual patterns beyond specific keywords. The model learns not only which words are indicative of spam but also how they are used in malicious contexts (e.g., deceptive sentence structures, false urgency). Consequently, even when the specific lexicon of spam evolves, some underlying malicious patterns persist, providing the model with a discernible signal. This distinction—keyword dependency versus contextual pattern recognition—provides a critical insight into the mechanisms by which modern architectures can offer an inherent, albeit incomplete, defense against concept drift. Nevertheless, the significant performance drop observed in both models confirms that no static architecture is a substitute for a dynamic, continually learning system.
4.5. Content and Practicality Assessment
In addition to quantitative model evaluations, a content-based analysis was performed to better understand the linguistic patterns in the dataset.
As shown in
Figure 13, the most frequent words in spam and ham (legitimate) emails differ significantly. To ensure that these word clouds genuinely reflect class-specific linguistic patterns rather than artifacts from dataset sizes, the word frequencies were normalized to account for the varying number of samples and text lengths across the different source corpora. This reveals that spam emails often feature persuasive and commercial terms such as “free,” “company,” and “click,” while ham emails include more neutral, professional, or communication-oriented vocabulary such as “group,” “university,” and “thanks.” These distinct lexical patterns clearly demonstrate the types of cues leveraged by the models for classification.
However, achieving high accuracy by learning these patterns comes with a practical trade-off: computational cost. As detailed in
Section 4.5, while a classical model such as RF can be trained in minutes, fine-tuning Transformer-based models requires significantly more computational power and time. This highlights the critical need to strike a balance between the highest predictive accuracy and the operational efficiency required for systems managing real-time, high-volume email traffic.
4.6. Computational Cost and Practicality Analysis
While predictive performance is a primary goal, the practical utility of a model practical utility is equally dependent on its computational efficiency and resource requirements. To provide a holistic view, this section analyzes the computational costs of the key models, emphasizing the trade-offs between accuracy, training time, and inference speed.
Table 11 provides a detailed comparative summary of these metrics.
The analysis in
Table 11 reveals a clear performance-cost hierarchy across the model families. Classical machine-learning models (LR, RF, SVM) are highly efficient, training on a CPU within 3–9 min and using about 0.5–2.0 GB of RAM, while already achieving strong accuracy.
The Deep Learning hybrid (CNN + LSTM + RF) increases both performance and cost, requiring approximately 850 K trainable parameters and 55 min of training on a T4 GPU.
Transformer-based models represent the top tier of performance but at the highest computational cost. The RoBERTa model, with its 125 M parameters, is the most resource-intensive, requiring 85 min of training and 7.2 GB of VRAM. In contrast, the proposed DistilBERT + Features two-stage workflow achieves the best scores (Accuracy 0.9962, F1-Score 0.9961) with more moderate resources (4.0 GB VRAM) and a competitive end-to-end time of 48 min.
For real-world deployment, inference latency is a critical factor. While a baseline Transformer can classify an email in 74 ms on a GPU, this figure would be substantially higher on CPU-only infrastructure. The key advantage of the proposed model lies in its efficiency: once embeddings are pre-computed, the lightweight head (only 107,649 parameters) achieves an inference latency of less than 1 ms. This makes it exceptionally suitable for scalable, low-latency applications, offering a compelling balance of accuracy and efficiency. Given that false positives are highly costly in this setting, performance at very low FPR (≤0.1%) is prioritized. The multimodal model is therefore selected for deployment, as it combines low-FPR sensitivity with competitive calibration and efficiency.
4.7. Error Analysis and Model Limitations
Even the most successful model cannot achieve 100% performance, which provides an opportunity to understand the model weaknesses. The error analysis revealed the types of emails that the model struggled with:
False Negatives (Spam that Escapes the Filter): Emails that are classified as “ham” by the model despite being spam are often sophisticated and personalized phishing texts that do not contain traditional spam keywords. For example, emails that mimic normal business correspondence but contain malicious links have been difficult to detect despite the model semantic analysis capabilities.
False Positives (Incorrectly Blocked Ham): Emails that are labeled as “spam” despite being ham are usually legitimate marketing newsletters or notifications. These emails, which contain phrases such as “offer,” “opportunity,” and “click now,” may be misclassified by the model because they structurally resemble spam. This situation carries the risk of important emails being lost for users.
These findings show that the model successfully analyzes both the semantic and structural features of the text, but there is still room for improvement in the most challenging and ambiguous cases (especially attacks targeting human psychology and legitimate marketing language). In addition to these intrinsic limitations, it is also important to acknowledge the scope of this study. A key limitation is that all datasets used were in English. Since real-world email communication often involves multiple languages, this restricts the immediate applicability of the framework in multilingual environments. Another limitation not addressed in the experiments is the threat of adversarial attacks, in which spam messages are intentionally manipulated to evade detection. Although confrontational training is considered an effective defense method against such attacks, it was not used in this study for two main reasons. First, the primary objective of this research was a comprehensive comparative evaluation, rather than the development of new defense mechanisms. Second, adversarial training requires substantial computational resources that exceeded the timeline of this study. Consequently, adversarial robustness is identified as a critical direction for future work. In conclusion, in text classification problems such as spam email detection, data preprocessing quality, feature diversity, and correct model selection directly affect success, but being aware of the model limitations is also of critical importance.
5. Results and Recommendations
This study addresses key limitations in spam detection research by providing a rigorous, multi-faceted evaluation framework. The primary contribution is methodological, focusing on robustness and generalizability by establishing a large-scale, heterogeneous benchmark from seven public corpora. This approach enables a realistic and fair comparison of classical, deep learning, and Transformer-based models, culminating in a lightweight multimodal classifier that achieves state-of-the-art performance with high consistency.
Furthermore, this work extends beyond a static evaluation to investigate the critical challenge of temporal concept drift. Through a protocol of training models on historical data (Classic Era) and testing them on contemporary threats (Modern Era), a significant performance degradation was quantified across all model types. The analysis suggests that the superior resilience of Transformer-based models stems from their ability to capture abstract contextual patterns, in contrast to classical models that exhibit brittleness due to their reliance on a rapidly evolving spam vocabulary. This additional investigation provides crucial empirical evidence on the necessity of adaptive systems, adding a critical dimension of practical relevance to the study’s findings.
The findings highlight the effectiveness of a lightweight multimodal classifier that combines deep semantic features from DistilBERT with simple structural attributes, achieving state-of-the-art performance with high consistency. The evaluation extends beyond single accuracy metrics. Crucially, by performing dedicated decision threshold tuning, the evaluation is shifted to focus on deployment-oriented metrics, specifically assessing performance at pre-defined low-false-positive rate (FPR) operating points. This is complemented by robust statistical validation (significance tests, ROC/PR-AUC, MCC, and calibration). By grounding performance in these practical terms, this work provides a clearer and more reliable picture of what constitutes an effective spam detection system for large-scale, high-volume email environments.
In light of the findings of this study, several key areas are recommended for improvement in future research. Firstly, in terms of model reliability and interpretability, incorporating Explainable Artificial Intelligence (XAI) techniques such as LIME or SHAP is of critical importance. Although the DistilBERT-based model demonstrated high accuracy, its internal reasoning remains largely opaque, making it challenging to interpret how specific classification decisions are made. The application of XAI tools can improve transparency by clarifying why certain emails—particularly legitimate ones—are erroneously flagged as spam, thereby enhancing user trust and system usability. This necessity is also emphasized by Mun et al. (2025) [
50], who showed through the PhiShield framework that XAI integration significantly advances model transparency and user confidence. Furthermore, to strengthen the model’s robustness against sophisticated phishing strategies, adversarial training techniques should be adopted. Such methods would enable the system to not only recognize known spam patterns but also adapt to maliciously engineered content designed to evade detection. In future work, we plan to integrate adversarial training methods in order to increase the robustness of the proposed framework against deliberately manipulated spam content. In this context, adversarial examples will be generated using current text attack libraries (e.g., TextAttack, OpenAttack), and the model will be retrained with these enriched datasets and evaluated under real-world evasion scenarios.
Secondly, addressing the issue of productivity and real-time application is crucial. Transformer-based models, despite their success, impose high computational costs that may hinder deployment in resource-limited environments. Optimization strategies such as model distillation, quantization, and pruning should be explored to reduce model complexity while maintaining performance. This will allow efficient operation on mobile devices or low-power servers, making the system more practical for real-time use.
Most importantly, this study moves beyond proposing time-aware evaluation as a future endeavor by providing a direct empirical investigation into this critical issue. A temporal analysis, representing a key contribution of this work, confirms that spam tactics are continuously evolving. Through a protocol of training models on historical data and testing them against modern threats, it was demonstrated that static training data indeed becomes obsolete, leading to significant performance degradation—a phenomenon known as concept drift.
The findings provide concrete evidence for the necessity of dynamic adaptation. The catastrophic rise in false positives when static models face contemporary spam highlights the limitations of traditional training paradigms. The results strongly support the conclusions that future systems must incorporate the following principles:
Continuous Learning and Adaptation: The empirical evidence presented herein justifies the need to move beyond static training. The demonstrated performance decay necessitates a shift towards adaptive systems capable of evolving over time. Future work should focus on developing and validating continuous learning pipelines that can maintain high performance against dynamically evolving spam tactics.
Longitudinal Evaluation: This study serves as a foundational longitudinal analysis. It has been shown that chronological data splits are not just a theoretical consideration but a practical necessity for revealing temporal drift. Future benchmarks and model evaluations should adopt this temporal perspective to gain deeper insights into the true long-term robustness of detection models.
Through these enhancements—integrating explainability, optimizing for efficiency, expanding to new modalities and languages, and, most critically, embracing dynamic and time-aware learning—spam detection models can become more interpretable, robust, and scalable in real-world environments.