You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

3 February 2023

Hyperparameter Optimization of Ensemble Models for Spam Email Detection

and
1
Department of Mathematics and Computer Science, Federal University of Health Sciences, Otukpo P.M.B. 145, Nigeria
2
Department of Mathematics and Statistics, Federal University Kashere, Gombe P.M.B. 0182, Nigeria
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Application of Machine Learning in Text Mining

Abstract

Unsolicited emails, popularly referred to as spam, have remained one of the biggest threats to cybersecurity globally. More than half of the emails sent in 2021 were spam, resulting in huge financial losses. The tenacity and perpetual presence of the adversary, the spammer, has necessitated the need for improved efforts at filtering spam. This study, therefore, developed baseline models of random forest and extreme gradient boost (XGBoost) ensemble algorithms for the detection and classification of spam emails using the Enron1 dataset. The developed ensemble models were then optimized using the grid-search cross-validation technique to search the hyperparameter space for optimal hyperparameter values. The performance of the baseline (un-tuned) and the tuned models of both algorithms were evaluated and compared. The impact of hyperparameter tuning on both models was also examined. The findings of the experimental study revealed that the hyperparameter tuning improved the performance of both models when compared with the baseline models. The tuned RF and XGBoost models achieved an accuracy of 97.78% and 98.09%, a sensitivity of 98.44% and 98.84%, and an F1 score of 97.85% and 98.16%, respectively. The XGBoost model outperformed the random forest model. The developed XGBoost model is effective and efficient for spam email detection.

1. Introduction

Unsolicited emails, popularly referred to as spam, have remained one of the biggest threats to cybersecurity globally. Between October 2020 and September 2021, a total of 336.41 billion emails were sent globally, and about 84% (more than half) of these emails were spam [1]. The huge financial loss resulting from email fraud is quite enormous and increasing. According to the FBI center for crime complaint reports [2], in 2021 about USD2.4 billion was lost as a result of scams associated with business and email account compromises. In the same year, the bureau received 19,954 scam email complaints. The IC3 data also showed that 3729 ransomware incidents were reported with an associated financial loss of over USD49 million. According to the spam and phishing report by Kaspersky on Securelist [3], between February and June 2022, 1.8 million 419 scam emails were detected. These statistics imply that spammers are relentless. Researchers have continued to propose different techniques to combat the spam menace [4,5,6,7,8,9,10]. However, the tenacity and perpetual presence of the adversary, the spammer, has necessitated the need for improved efforts at filtering spam. A spam-filtering model with improved accuracy will help in the fight against spam-based fraud. Many current spam-email-detection techniques rely on a single model, which can be prone to errors and overfitting [10,11,12,13,14]. Ensemble models, which combine the predictions of multiple models, have the potential to improve the accuracy and robustness of spam detection. While ensemble models have been widely used in other areas of machine learning, they have not been widely applied to spam email detection. Hyperparameters, such as the number of decision trees in a random forest or the regularization parameter in an extreme gradient boost algorithm, can greatly affect the performance of a model. However, finding the optimal hyperparameters are often ignored because it is a time-consuming and computationally expensive task. Therefore, this study is aimed at the hyperparameter optimization of the random forest (RF) and extreme gradient boosting (XGBoost) ensemble algorithms. This is in a bid to enhance the predictive accuracy of the two ensemble models and to determine the best-performing model, robust enough for efficient spam email detection. Ensemble algorithms rely on a combination of predictions from two or more base models to obtain an improved prediction performance on a dataset [15]. In this study:
Spam-email-detection models based on the random forest and XGBoost machine-learning algorithms were developed.
The performances of the ensemble models were optimized through hyperparameter tuning.
The performances of the ensemble models were evaluated and compared before and after hyperparameter tuning.
The convergence time of the models were also established.
The other sections of this study are presented thus: In the second section, a brief highlight of related research on spam email classification and detection was presented. The third section described in detail the dataset and preprocessing techniques, methods, and performance evaluation metrics. The results of the experiments are presented in the fourth section. Finally, a conclusion was drawn with a perspective for further studies in the last section.

3. Methodology

3.1. Dataset

The Enron dataset was used in this research because it is the only substantial collection of an actual email that is public and also because of its high level of usage among researchers. The Enron dataset is made up of 6 main directories, each directory has several subdirectories, each containing emails as a single text file [28]. In this study, the Enron1 dataset was used. These emails were converted to a single CSV file by Marcel Wiechmann [29]. The CSV file contained about 33,000 emails. However, during conversion, some of the email messages were not correctly aligned with their labels. The non-aligning messages were removed alongside the orphaned labels through Microsoft Excel. On completion of the removal, the CSV file contained a total of 32,860 emails. Of the total emails, 16,026 (49%) are legitimate emails (ham) and 16,834 (51%) are spam. The CSV file contained five columns labeled message ID, subject, spam/ham, and date. The subject and date columns were not used in this study. The needed columns were readjusted as serial numbers. Column two contains the class label of each of the emails and column three contains the text of each email.

3.2. Dataset Cleaning

To improve the quality of classification, it is important to get rid of unwanted characters or features that constitute noise from the data. The cleaning activities and functions used are presented in Table 1.
Table 1. Collection of functions used in data cleaning.
The noise constituent of the dataset such as non-ASCII characters, HTML tags, extra white spaces, URLs, punctuations, numbers, and stop words were removed in steps 1–7 (Table 1). Stop words are a collection of words in any language that occur with a high frequency but convey considerably less meaningful information about the significance of an expression. The removal of stop words and other noise constituents shrinks the size of the data and reduces the burden of computational expenses in model training, with the potential of improving model performance since there are only meaningful words left to learn from. The essence of stemming is also to reduce the data size by reducing words to their root word. The text-mining map function, defined in the text-mining (tm) package [30], received the parameters specified in steps 5–8 to execute each cleaning activity. The DocumentTermMatrix() and removeSparseTerms() functions are also defined in the tm package in R.
On completion of the data cleaning process, the data was divided into two; the train set and the test set. The train set, which was composed of 70% (23,107) of the original dataset, has 11,282 legitimate emails and 11,825 spam emails (Figure 1). The test set, which is composed of 30% (9753) of the original dataset, has 4744 legitimate emails and 5009 spam emails.
Figure 1. Distribution of ham and spam emails in train and test dataset.
Baseline models of random forest and extreme gradient boost (XGBoost) were developed by training and testing each model independently with 70% and 30% of the preprocessed dataset (Figure 2). All the parameters were set to their default values during the training and testing of the baseline models. The performance of these models on the test data was recorded as the baseline performance to be improved via hyperparameter tuning. To reduce the computation time, only the important predictors were passed to the random forest. The random forest has an inbuilt feature for ranking variables or features based on their importance in arriving at a prediction [31].
Figure 2. The proposed model workflow.

3.3. Methods and Machine Learning Classifiers

3.3.1. Random Forest

Random forest is a supervised ensemble classifier that is used for classification and regression. It is an ensemble learning method that creates a set of decision trees and combines them to make a final prediction. To arrive at a prediction, the random forest follows these steps:
Step 1: Select a random sample of data from the dataset.
Step 2: Build a decision tree using the sample data.
Step 3: Repeat the process a certain number of times, creating a new decision tree each time.
Step 4: Combine the decision trees by taking the average of their predictions.
Each decision tree in the random forest makes a prediction, and the final prediction is made by taking the average of all the predictions made by the individual decision trees. This helps to reduce overfitting and improve the overall accuracy of the model. The two important hyperparameters that must be defined by the user when generating a random forest are mtry and ntree [32]. In a random forest, mtry is the number of features that are randomly sampled as candidates for splitting at each decision tree node and the ntree is the number of decision trees in the random forest [32]. The mtry parameter determines how much randomness is injected into the model. A smaller mtry value will make the model more deterministic and potentially more accurate but at the cost of a more complex model that is more prone to overfitting. A larger mtry value will make the model more robust to noise in the data, but at the cost of accuracy. The ntree parameter determines the overall complexity of the model. A larger value of ntree will make the model more accurate but at the cost of increased computational resources and longer training time. The number of trees in a forest (ntree) is not limited by computational resources, but the performance improvement from having a large number of trees is minimal, according to [33]. However, [34] states that computational resources are the limiting factor for the number of trees in a forest.

3.3.2. XGBoost

Extreme gradient boost (XGBoost) is a supervised ensemble machine-learning algorithm. The ensemble technique used by the algorithm is the boosting technique. The algorithm trains Classification and Regression Tree (CART) learners sequentially. Models are trained one after another, and the succeeding models are targeted at optimizing the loss function or residuals of the previous models in a bid to improve predictability. The algorithm works by calculating the similarity scores (Equation (1)) of the residuals of the independent email features at each node of the base CART [35].
S S = i = 1 n ( y i f ( x i ) ) 2 n λ
S S = S i m i l a r i t y   S c o r e
y i = t h e   v a l u e   o f   t h e   i t h   v a r i a b l e   t o   b e   p r e d i c t e d  
f ( x i ) = t h e   c u r r e n t   p r e d i c t i o n   f o r   i n d e p e n d e n t   f e a t u r e s  
n = t h e   n u m b e r   o f   r e s i d u a l s
λ = r e g u l a r i z a t i o n   p a r a m e t e r
The similarity score of each node within the tree is used to determine the information gain (Equation (2)).
G = S S a S S b
S S a = S i m i l a r i t y   S c o r e   a f t e r   s p l i t  
S S b = S i m i l a r i t y   S c o r e   b e f o r e   s p l i t  
The regularization parameter λ , reduces the effect of outliers and controls the pruning of the trees within the model. A higher value of λ will reduce the value of the similarity score which results in a smaller information gain (Equation (2)). The gain is compared with gamma γ , another important parameter used in the XGBoost algorithm. Gamma determines if there will be a split at a node. If gamma ( γ ) is less than gain (G), there will be a split at the node else, there will be no split. This serves as a mechanism to control overfitting. A high gamma would result in an extremely pruned tree. In this study, R default values of 0 and 1 were retained for γ and λ , respectively.
The prediction of the base CART model based on the selected leaf node is determined as:
O u t p u t = i = 1 n ( y i f ( x i ) ) n   λ
The sum of residuals is divided by the product of the number of residuals and the regularization parameter, λ . The predictions of successive CART models are calculated as in Equation (4).
N e w p = P r e v i o u s p + ( e t a O u t p u t )
  N e w p = t h e   p r e d i c t i o n   o f   t h e   c u r r e n t   C A R T   l e a r n e r  
P r e v i o u s p = t h e   p r e d i c t i o n   o f   t h e   P r e v i o u s   C A R T   l e a r n e r  
e t a = t h i s   i s   t h e   l e a r n i n g   r a t e
The output (Equation (3)) is the prediction of the base CART learner. The learning rate or eta is another important parameter that reduces the feature weight to prevent overfitting. It also determines how quickly a model will converge. The default value for this parameter in R is 0.3.
The workflow of the XGBoost algorithms discussed in Equations (1)–(4) can be summarized in seven steps as thus:
Step 1: Initialization—A set of decision-tree models is trained to classify a small set of emails as spam or not spam.
Step 2: Boosting—New models are added to the ensemble and trained to correct mistakes made by previous models by focusing on misclassified emails.
Step 3: Gradient Descent—Parameters of new models are optimized by minimizing the ensemble’s loss function using gradient descent.
Step 4: Regularization—Regularization is used to prevent overfitting by penalizing models for having too many parameters or complex decision boundaries.
Step 5: Pruning—The decision tree is pruned by removing leaves with low weight to prevent overfitting and improve generalization.
Step 6: Repeat—Steps 2 through 5 are repeated for a fixed number of iterations or until a stopping criterion is met, such as reaching a certain level of accuracy or the number of boosting rounds.
Step 7: Return—The final ensemble of base models is returned as the final model and can be used to classify new emails as spam or not spam by taking a majority vote of the base models.

3.3.3. Hyperparameter Tuning Technique–Grid Search Method

The grid-search technique is an exhaustive search technique for finding optimal hyperparameters over a manually specified range of subsets. The choice of the grid-search technique in this study, among other hyperparameter-tuning techniques such as random search, gradient search, and so on, was informed by the less complex implementation and the fact that its execution can be parallelized. In this study, the grid search was used to determine the optimal values for hyperparameters such as the number of trees (max_depth), mtry, number of iterations (nrounds), and the learning rate (eta). The experimental setup used in searching the hyperparameter space of both random forest and XGBoost is presented in Table 2.
Table 2. Hyperparameter settings for the algorithms.

3.3.4. Cross Validation

The efficiency of both RF and XGBoost was validated through K-fold cross-validation. This technique involves splitting the training set into K numbers of a subset. The K-1 subset of the entire dataset is trained and evaluated on a subset (i.e., 1-fold). This is repeated K number of times with a different subset of K used for evaluation at each iteration [36]. The essence is to rid the models of overfitting and sample bias. In this study K = 10. The 10-fold cross-validation was applied to the tuned models whereas the train/test split was applied to the baseline models.
The choice of random forest and XGBoost for this experimental study was informed by their high performance on similar tasks in other studies. Furthermore, the random forest model uses the bagging technique which reduces variance and avoids overfitting through inbuilt out-of-bag error estimates. The outcome of several predictions is aggregated to obtain a final result. The XGBoost model uses the boosting technique which sequentially tries to optimize the loss function of a preceding CART learner through successive ones.

4. Results and Discussion

4.1. Experimental Testbed and Settings

The computation was carried out on a Windows 10 Pro 64bit Operating system with 16 GB RAM, 500 GB SSD, and Corei-7-1165G7@2.8GHz processor. The algorithms were implemented in R programming using R version 4.2.1 via Rstudio Integrated Development Environment (IDE). The major libraries used are randomForest, xgboost, tm, Caret, ggplot2, and ROCR. The hyperparameter values were set as presented in Table 2.

4.2. Results and Discussion

To optimize the performance of the baseline random forest model, efforts were targeted at determining the best ntree and mtry by tuning these parameters through grid search. The mtry of the baseline random forest model is 23; hence, mtry values between 6 and 41 were examined during the first run (Table 2). After 10-fold cross-validation, accuracy was used to determine the optimal ntree and mtry which were 1500 and 7, respectively, as shown in Figure 3. At ntree = 1500 and mtry = 7, the highest training accuracy of 0.9750725 was obtained. It took about a week for the model to converge.
Figure 3. First run: model plot of the 10-fold cross-validation for the determination of best hyperparameter values for random forest.
From Figure 3, it can also be observed that as the value of mtry increases the predictive accuracy of the model was decreasing. This necessitated the need for a second run to examine mtry values between 1 and 7 given ntree values of 1000 and 1500. After about four days, the 10-fold cross-validation training converged, at the accuracy of 0.9752454, the optimal ntree = 1500, and mtry = 6 (Figure 4).
Figure 4. Second run: model plot of the 10-fold cross-validation for the determination of best hyperparameter values for random forest.
The grid search with 10-fold cross-validation based on accuracy returned the optimal values of 7560, and 0.25 for max.depth, nrounds, and eta, respectively, for the XGBoost model. From Figure 5, it can be observed that the accuracy improved progressively as the nrounds increased for each of the max.depths. However, as the number of boosting iterations (nrounds) tends toward 700, the improvement in accuracy became marginal. With a boosting iteration of 560, the accuracy peaked at 0.9781017 on the third face grid where max.depth is 7 and eta is 0.25.
Figure 5. The model plot of the 10-fold cross-validation for the determination of best hyperparameter values for XGBoost. The face grid shows the performance of the model in terms of accuracy at max.depth of 5, 6, and 7 with varying nrounds and learning rates (eta). The colsample_bytree was held constant at 0.6.

4.2.1. Performance Evaluation Metrics

The metrics that were used to evaluate the performance of the ensemble algorithms are accuracy, sensitivity (recall), precision, F1-score, specificity, and the Area under the Receiver Operating Curve (ROC) curve. The true positive (TP), true negative (TN), false positive (FP), and false negative (FN), which are components of the confusion matrix (Table 3), were used to compute the performance metrics. The confusion matrixes for the random forest baseline model, XGBoost baseline model, random forest tuned model, and the XGBoost tuned model are presented in Table 4.
Table 3. A confusion matrix and its constituents.
Table 4. Confusion matrix of the baseline and tuned spam detection models of random forest (RF) and extreme gradient boost (XGBoost).
where
Positive (P): Absolute number of spam emails.
Negative (N): Absolute number of ham emails.
True positive (TP): Absolute number of e-mails correctly classified as spam.
True negative (TN): Absolute number of e-mails classified as ham (not spam) and are truly ham.
False positive (FP): Absolute number of e-mails classified as spam, but they are not spam.
False negative (FN): Absolute number of e-mails classified as ham, but they are spam.

Accuracy

Accuracy measures the number of emails correctly predicted against the total number of predictions (Equation (5)). The accuracy metric is a reliable metric for performance evaluation, considering that there is no significant imbalance in the class distribution in the dataset used.
A c c u r a c y = ( T P + T N ) ( P + N )
The random forest baseline model, XGBoost baseline model, random forest tuned model, and XGBoost tuned model achieved an accuracy of 0.9752, 0.9748, 0.9778, and 0.9809, respectively. This implies that for all classes 98.09% of the emails were correctly predicted by the tuned XGBoost model which achieved the highest accuracy as shown in Figure 6.
Figure 6. Performance evaluation metric comparison of the baseline and tuned random forest and XGBoost models.

Sensitivity (Recall)

Sensitivity measures the number of positives that are accurately predicted as a positive class for all positive data points. This metric is determined as presented in Equation (6).
R e c a l l = T P ( T P + F N )
The random forest baseline model, XGBoost baseline model, random forest tuned model, and XGBoost tuned model achieved a sensitivity of 0.9796, 0.9896, 0.9844, and 0.9884, respectively. As observed in Figure 6, the XGBoost baseline model has the highest (sensitivity) spam email detection rate. However, it was outperformed by the other models in terms of accuracy, precision, specificity, and F1 score.

Precision

Precision measures the amount of accurately predicted spam emails against the total number of positive predictions. In other words, it measures the performance of the model with respect to the false positive. It is determined as thus:
P r e c i s i o n = T P ( T P + F P )
Figure 6 revealed that the tuned XGBoost model has the highest precision of 0.9748. This implies that for every email that was predicted spam, 97% of those emails are spam. The random forest base model, XGBoost-baseline model, and the random forest tuned model achieved a precision of 0.9723, 0.9623, and 0.9726, respectively.

F1-Score

The F1-score is a weighted mean of precision and sensitivity. It is determined as presented in Equation (8). The F1 score helps to measure the impact of false positives and false negatives in the models. In other words, it balances the impact of low or high precision and recall (sensitivity) on the model.
F 1 S c o r e = 2 P r e c i s i o n S e n s i t i v i t y P r e c i s i o n + S e n s i t i v i t y
As seen in Figure 6, the tuned XGBoost model achieved the highest F1 score of 0.9816. The random forest baseline model, XGBoost baseline model, and the random forest tuned model achieved an F1-score of 0.9759, 0.9758, and 0.9785, respectively.

Specificity

Specificity measures the ability of the model to classify hams correctly. It is the percentage of negatives that are correctly detected as a negative class. It is determined as presented in Equation (9). The tuned random forest and XGBoost models were better at detecting legitimate emails, with the XGBoost tuned model achieving the highest specificity of 0.9730.
S p e c i f i c i t y = T N ( T N + F P )

Receiver Operating Curve (ROC)

The ROC is a plot of the number of accurate predictions in the prediction of the positive class against the number of negative samples wrongly predicted as being in a positive class. It ranks the ability of the model to distinguish between the classes of emails. It measures the quality of predictions by the model. The Area Under the ROC Curve (AUC) measures the space under the ROC curve. The AUC ranged from 0 to 1. As the value of the AUC tends to 1, the better its distinguishing ability. The ROC curves for the random forest baseline model, XGBoost baseline model, random forest tuned model, and XGBoost tuned model are shown in Figure 7. The models’ Areas Under the Curve (AUC) are 0.9894, 0.9892, 0.9966, and 0.9978, as shown in Figure 7. The AUC for the random forest baseline model, with the highest AUC value of 0.9978, indicates that the tuned XGBoost model distinguished between the spam and ham emails better than the other models.
Figure 7. Area Under the Receiver Operating Curves of random forest and XGBoost models. The yellow plot line is the TPR against the FPR plot of the random forest-based model while the red, blue, and green are for the tuned random forest model, XGBoost baseline model, and XGBoost tuned model, respectively.
The hyperparameter tuning with 10-fold cross-validation significantly improved the performance of the ensemble models. Hyperparameter tuning does not improve model performance in all cases [37]. The tuned RF and XGBoost model performed better than their un-tuned baseline models for all metrics except for the XGBoost baseline model with the highest sensitivity. The tuned XGBoost model outperformed the tuned random forest model for all measures evaluated as shown in Figure 6 and Figure 7. It is also important to note that the XGBoost model is a faster algorithm (the XGBoost algorithm supports parallel computation) when compared with random forest. This is also established in the study by [38]. The random forest baseline model took 3762.38 s to train while it took 55.45 s to train the baseline XGBoost model. The hyperparameter tuning of the XGboost model took 23,925.48 s. The first search of the hyperparameter space for random forest took 585,857.11 s while the second run took 317,912.14 s.

5. Conclusions

This study evaluated and compared the performance of two ensemble models based on the random forest and extreme gradient boost ensemble algorithms. Baseline random forest and XGBoost spam detection models were developed based on the train/test split technique using the default parameters. The grid-search technique with 10-fold cross-validation was applied to search the hyperparameter space to determine the optimal hyperparameter values that optimized the performance of the random forest and XGBoost models. The performance of the baseline models was evaluated and compared with that of the tuned random forest and XGBoost models to examine the impact of hyperparameter tuning. The findings revealed that hyperparameter tuning improved the performance of the random forest and XGBoost models. The results also showed that the tuned XGBoost model outperformed the tuned random forest model for all metrics evaluated. The effectiveness of these ensemble models in spam email detection and classification was demonstrated.
It will be interesting to compare the XGBoost model and deep-learning models for spam detection in a future study in a bid to gain further insight into the development of efficient and effective spam email detection systems.
It is important to note that the distribution of classes in the dataset used in this study is complementarily balanced. The behavior of these models will be different with a significantly imbalanced dataset. Future studies will look at the performance of these models on an imbalanced dataset.

Author Contributions

Conceptualization, T.O.O.; methodology, T.O.O. and D.O.O.; software, T.O.O.; validation, T.O.O. and D.O.O.; formal analysis, T.O.O. and D.O.O.; investigation, T.O.O.; resources, T.O.O. and D.O.O.; data curation T.O.O. and D.O.O.; writing—original draft preparation, T.O.O. and D.O.O.; writing—review and editing, T.O.O. and D.O.O.; visualization, T.O.O. and D.O.O.; supervision, T.O.O.; project administration, T.O.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available in GITHUB at https://github.com/MWiechmann/enron_spam_data (accessed on 17 August 2022). Application for consent to use the dataset is not required.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dixon, S. Global Average Daily Spam Volume 2021. Available online: https://www.statista.com/statistics/1270424/daily-spam-volume-global/ (accessed on 18 July 2022).
  2. FBI. Federal Bureau of Investigation: Internet Crime Report 2021. Available online: https://www.ic3.gov/Media/PDF/AnnualReport/2021_IC3Report.pdf (accessed on 6 August 2022).
  3. Securelist Types of Text-Based Fraud. Available online: https://securelist.com/mail-text-scam/106926/ (accessed on 4 August 2022).
  4. Onova, C.U.; Omotehinwa, T.O. Development of a Machine Learning Model for Image-Based Email Spam Detection. FUOYE J. Eng. Technol. 2021, 6, 336–340. [Google Scholar] [CrossRef]
  5. Bindu, V.; Thomas, C. Knowledge Base Representation of Emails Using Ontology for Spam Filtering. Adv. Intell. Syst. Comput. 2021, 1133, 723–735. [Google Scholar] [CrossRef]
  6. Kaddoura, S.; Chandrasekaran, G.; Popescu, D.E.; Duraisamy, J.H. A Systematic Literature Review on Spam Content Detection and Classification. PeerJ Comput. Sci. 2022, 8, e830. [Google Scholar] [CrossRef] [PubMed]
  7. Méndez, J.R.; Cotos-Yañez, T.R.; Ruano-Ordás, D. A New Semantic-Based Feature Selection Method for Spam Filtering. Appl. Soft Comput. 2019, 76, 89–104. [Google Scholar] [CrossRef]
  8. Ahmed, N.; Amin, R.; Aldabbas, H.; Koundal, D.; Alouffi, B.; Shah, T. Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges. Secur. Commun. Networks 2022, 2022, 1862888. [Google Scholar] [CrossRef]
  9. Hosseinalipour, A.; Ghanbarzadeh, R. A Novel Approach for Spam Detection Using Horse Herd Optimization Algorithm. Neural Comput. Appl. 2022, 34, 13091–13105. [Google Scholar] [CrossRef]
  10. Ismail, S.S.I.; Mansour, R.F.; Abd El-Aziz, R.M.; Taloba, A.I. Efficient E-Mail Spam Detection Strategy Using Genetic Decision Tree Processing with NLP Features. Comput. Intell. Neurosci. 2022, 2022, 7710005. [Google Scholar] [CrossRef]
  11. Ravi Kumar, G.; Murthuja, P.; Anjan Babu, G.; Nagamani, K. An Efficient Email Spam Detection Utilizing Machine Learning Approaches. Proc. Lect. Notes Data Eng. Commun. Technol. 2022, 96, 141–151. [Google Scholar]
  12. Kontsewaya, Y.; Antonov, E.; Artamonov, A. Evaluating the Effectiveness of Machine Learning Methods for Spam Detection. Procedia Comput. Sci. 2021, 190, 479–486. [Google Scholar] [CrossRef]
  13. Batra, J.; Jain, R.; Tikkiwal, V.A.; Chakraborty, A. A Comprehensive Study of Spam Detection in E-Mails Using Bio-Inspired Optimization Techniques. Int. J. Inf. Manag. Data Insights 2021, 1, 100006. [Google Scholar] [CrossRef]
  14. Dedeturk, B.K.; Akay, B. Spam Filtering Using a Logistic Regression Model Trained by an Artificial Bee Colony Algorithm. Appl. Soft Comput. J. 2020, 91, 106229. [Google Scholar] [CrossRef]
  15. Sagi, O.; Rokach, L. Ensemble Learning: A Survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
  16. Sheu, J.J.; Chu, K.T.; Li, N.F.; Lee, C.C. An Efficient Incremental Learning Mechanism for Tracking Concept Drift in Spam Filtering. PLoS ONE 2017, 12, e0171518. [Google Scholar] [CrossRef]
  17. Liu, X.; Zou, P.; Zhang, W.; Zhou, J.; Dai, C.; Wang, F.; Zhang, X. CPSFS: A Credible Personalized Spam Filtering Scheme by Crowdsourcing. Wirel. Commun. Mob. Comput. 2017, 2017, 1457870. [Google Scholar] [CrossRef]
  18. Bahgat, E.M.; Rady, S.; Gad, W.; Moawad, I.F. Efficient Email Classification Approach Based on Semantic Methods. Ain Shams Eng. J. 2018, 9, 3259–3269. [Google Scholar] [CrossRef]
  19. Agarwal, K.; Kumar, T. Email Spam Detection Using Integrated Approach of Naïve Bayes and Particle Swarm Optimization. In Proceedings of the 2nd International Conference on Intelligent Computing and Control Systems, ICICCS 2018, Madurai, India, 14–15 June 2018; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 685–690. [Google Scholar]
  20. Dada, E.G.; Bassi, J.S.; Chiroma, H.; Abdulhamid, S.M.; Adetunmbi, A.O.; Ajibuwa, O.E. Machine Learning for Email Spam Filtering: Review, Approaches and Open Research Problems. Heliyon 2019, 5, e01802. [Google Scholar] [CrossRef] [PubMed]
  21. Saha, S.; DasGupta, S.; Das, S.K. Spam Mail Detection Using Data Mining: A Comparative Analysis. Smart Innov. Syst. Technol. 2019, 104, 571–580. [Google Scholar] [CrossRef]
  22. Nandhini, S.; Marseline, D.J. Performance Evaluation of Machine Learning Algorithms for Email Spam Detection. In Proceedings of the International Conference on Emerging Trends in Information Technology and Engineering, ic-ETITE 2020, Vellore, India, 24–25 February 2020; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020. [Google Scholar]
  23. Guangjun, L.; Nazir, S.; Khan, H.U.; Haq, A.U. Spam Detection Approach for Secure Mobile Message Communication Using Machine Learning Algorithms. Secur. Commun. Networks 2020, 2020, 8873639. [Google Scholar] [CrossRef]
  24. Jancy Sickory Daisy, S.; Rijuvana Begum, A. Smart Material to Build Mail Spam Filtering Technique Using Naive Bayes and MRF Methodologies. Proc. Mater. Today 2021, 47, 446–452. [Google Scholar] [CrossRef]
  25. Xia, T.; Chen, X. A Weighted Feature Enhanced Hidden Markov Model for Spam SMS Filtering. Neurocomputing 2021, 444, 48–58. [Google Scholar] [CrossRef]
  26. Şimşek, H.; Aydemir, E. Classification of Unwanted E-Mails (Spam) with Turkish Text by Different Algorithms in Weka Program. J. Soft Comput. Artif. Intell. 2022, 3, 1–10. [Google Scholar] [CrossRef]
  27. Xia, T.; Chen, X. Category-Learning Attention Mechanism for Short Text Filtering. Neurocomputing 2022, 510, 15–23. [Google Scholar] [CrossRef]
  28. ENRON. The Enron-Spam Datasets. Available online: https://www2.aueb.gr/users/ion/data/enron-spam/ (accessed on 16 August 2022).
  29. Wiechmann, M. GitHub—MWiechmann/Enron_spam_data: The Enron-Spam Dataset Preprocessed in a Single, Clean Csv File. Available online: https://github.com/MWiechmann/enron_spam_data (accessed on 17 August 2022).
  30. Feinerer, I. Introduction to the Tm Package Text Mining in R. Available online: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf (accessed on 16 August 2022).
  31. Kolog, E.A.; Balogun, O.S.; Adjei, R.O.; Devine, S.N.O.; Atsa’am, D.D.; Dada, O.A.; Omotehinwa, T.O. Predictive Model for Early Detection of Mother’s Mode of Delivery with Feature Selection. In Delivering Distinctive Value in Emerging Economies; Anning-Dorson, T., Boateng, S.L., Boateng, R., Eds.; Productivity Press: New York, NY, USA, 2022; pp. 241–264. ISBN 9781003152217. [Google Scholar]
  32. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  33. Oshiro, T.M.; Perez, P.S.; Baranauskas, J.A. How Many Trees in a Random Forest? Proc. Lect. Notes Comput. Sci. 2012, 7376, 154–168. [Google Scholar]
  34. Guan, H.; Li, J.; Chapman, M.; Deng, F.; Ji, Z.; Yang, X. Integration of Orthoimagery and Lidar Data for Object-Based Urban Thematic Mapping Using Random Forests. Int. J. Remote Sens. 2013, 34, 5166–5186. [Google Scholar] [CrossRef]
  35. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
  36. Oyewola, D.O.; Dada, E.G.; Omotehinwa, T.O.; Emebo, O.; Oluwagbemi, O.O. Application of Deep Learning Techniques and Bayesian Optimization with Tree Parzen Estimator in the Classification of Supply Chain Pricing Datasets of Health Medications. Appl. Sci. 2022, 12, 10166. [Google Scholar] [CrossRef]
  37. Hoque, K.E.; Aljamaan, H. Impact of Hyperparameter Tuning on Machine Learning Models in Stock Price Forecasting. IEEE Access 2021, 9, 163815–163830. [Google Scholar] [CrossRef]
  38. Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting Algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.