Evaluating the Effectiveness of Large Language Models (LLMs) Versus Machine Learning (ML) in Identifying and Detecting Phishing Email Attempts
Abstract
1. Introduction
2. Research Objectives
- To compare the classification performance of large language models (LLMs) and traditional machine learning (ML) models in phishing email detection, using metrics such as precision, recall, F1 score, accuracy, and balanced accuracy.
- To assess the computational efficiency of both LLMs and ML models during training and fine-tuning, including processor usage, memory consumption, and overall resource requirements.
- To evaluate the capability of LLMs to handle complex and nuanced phishing content, highlighting their advancements in text classification.
- To analyze misclassification cases from both LLMs and ML models to identify potential areas for improvement in future phishing detection systems.
3. Materials and Methods
3.1. Overview of Dataset Collection
3.2. Data Processing
3.2.1. Text Cleaning and Normalization
3.2.2. Data Preparation for ML
3.2.3. Data Preparation for LLMs
3.3. Model Selection for ML and LLMs
3.4. Data Splitting
3.5. Evaluation Metrics
3.6. Experimental Configuration
3.6.1. ML Model Training Configuration
- max_features = 20,000
- ngram_range = (1, 2)
3.6.2. LLM Training Configuration
3.7. A Pipeline for Phishing Email Detection Using Vectorization and Tokenization
4. Results
4.1. Results for ML Approaches
4.1.1. Email Content Analysis Using ML
4.1.2. URL Content Model Analysis Using ML
4.1.3. ROC Curves of ML Models on Email and URL Data
4.2. Results for LLMs Approaches
4.2.1. Email Content Analysis Using LLMs
4.2.2. URL Content Analysis Using LLMs
4.2.3. ROC Curves of LLMs on Email and URL Data
4.3. Analysis of Misclassifications
4.4. Prediction Error Analysis
4.4.1. Prediction Error in ML Models
4.4.2. Prediction Errors in LLMs
4.5. Impact of Dataset Balancing on Accuracy
4.6. Impact of Dataset Size on Real-Time
5. Discussion
5.1. Comparative the Effectiveness of ML and LLMs
5.2. Computational Training Demands ML vs. LLMs
5.3. Handling Complexity Content
5.4. Distribution Error
6. Conclusions and Future Research Suggestions
6.1. Conclusions
6.2. Future Research Suggestions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
AOL | America Online |
AUC | Area Under the Curve |
APWG | Anti-Phishing Working Group |
CyBOK | Cyber Security Body Of Knowledge |
DL | Deep Learning |
DT | Decision Trees |
FBI | Federal Bureau of Investigation’s |
HTML | Hyper Text Markup Language |
IC3 | Internet Crime Complaint Center |
GPU | Graphics Processing Unit |
GUI | Graphical User Interface |
K-NN | K-Nearest Neighbors |
LLMs | Large Language models |
LR | Logistic Regression |
ML | Machine Learning |
NLP | Natural Language Processing |
RF | Random Forest |
ROC | Receiver Operating Characteristic |
Appendix A
No. | Dataset Name | Phishing | Legitimate | Total | File Size |
---|---|---|---|---|---|
Zenodo | |||||
1 | Enron | 17,171 | 16,545 | 33,716 | 44.8 MB |
2 | Nigerian_Fraud | 5186 | 6742 | 11,928 | 9.2 MB |
3 | LingSpam | 481 | 2412 | 2893 | 9.3 MB |
4 | Nazario | 1561 | 1454 | 3015 | 7.8 MB |
5 | SpamAssassin | 1662 | 1135 | 2797 | 14.9 MB |
6 | TREC-06 | 3988 | 12,393 | 16,381 | 41.9 MB |
Hugging Face | |||||
1 | Phishing-huggingface | 32,702 | 44,975 | 77,677 | 5.71 MB |
Kaggle | |||||
1 | LLMs-Generated Emails | 1000 | 1000 | 2000 | 1.28 MB |
No. | Dataset Name | Phishing | Legitimate | Total | File Size |
---|---|---|---|---|---|
Mendeley Repository | |||||
1 | URL Dataset | 104,438 | 345,738 | 450,176 | 8.79 MB |
Appendix B
No. | Feature | Description |
---|---|---|
1 | Length of URL | Total length of the URL string. |
2 | Number of dots in the URL | Counts the number of periods in the URL that might indicate subdomains or unusual domain names. |
3 | Number of slashes in the URL | The count of slashes in the URL, excluding the protocol (http://or https://). |
4 | Presence of ‘www’ | Checks whether “www” is present in the domain. |
5 | Presence of HTTP/HTTPS | Identifies if the URL starts with “http” or “https”. |
6 | Presence of a query string (?) | Checks for a query string in the URL, which indicates phishing attempt parameters. |
7 | Presence of a fragment (#) | Indicates if the URL contains a fragment, which is often used in phishing attempts to confuse users. |
8 | Number of query parameters | Counts the number of query parameters in the URL (indicated by ? and &). |
9 | Domain name length | Length of the domain name (excluding protocol and path). |
10 | Number of subdomains | Counts the number of subdomains in the URL’s domain. More subdomains might suggest a suspicious URL. |
11 | Presence of a port number | Indicates if the domain contains a port number (e.g., example.com:8080). |
12 | Presence of an IP address | Detects if the domain is an IP address rather than a domain name, which is often used in phishing. |
13 | Number of uppercase letters | Counts the uppercase letters in the URL, as phishing URLs sometimes use unusual capitalization. |
14 | Number of digits | Counts the number of digits in the URL, often used in phishing attempts to mimic legitimate URLs. |
15 | Presence of special characters | Counts the special characters, which could be used to mislead or confuse users. |
16 | Top-level domain (TLD) length | The length of the top-level domain (e.g., .com, .org) can be indicative of domain legitimacy. |
17 | Length of the domain name | Length of the domain name without subdomains. |
18 | Length of the path | The length of the URL path after the domain. |
19 | Number of parameters in the path | Counts how many parameters are in the URL path, often used in phishing to mask malicious content? |
20 | Number of subdirectories in the path | Identifies how many subdirectories are in the URL path, often a characteristic of phishing sites. |
21 | Presence of a secure connection | Checks if the URL uses HTTPS, indicating a secure connection (or lack thereof). |
22 | Presence of login/register keyword | Detects if the URL contains “login” or “register”, which might be used for phishing login pages. |
23 | Number of underscores | Counts the number of underscores in the URL, as phishing URLs may contain underscores to imitate legitimate domains. |
24 | Presence of specific keywords (login, admin, secure) | Flags URLs containing specific keywords often found in phishing attempts, like “admin” or “login”. |
25 | Presence of file extensions | Checks if the URL ends with specific file extensions (e.g., .php, .html), common in phishing pages. |
26 | Presence of a session ID | Detects the presence of session identifiers in the URL, which could be used in phishing attacks. |
27 | Ratio of digits to characters | Measures the ratio of digits to other characters in the URL, which can help identify irregular or suspicious URLs. |
28 | Ratio of uppercase to lowercase letters | Measures the ratio of uppercase to lowercase letters, as phishing URLs often use odd capitalization patterns. |
29 | URL entropy (complexity) | Measures the entropy (complexity) of the URL, which can indicate whether the URL is randomly generated or suspiciously complex. |
30 | Presence of an email address | Checks if the URL contains an email address, which is often seen in phishing URLs to capture user data. |
Appendix C
ML Models | Description |
---|---|
Decision Tree (DT) | A nonparametric model that classifies data by recursively splitting it into nodes based on impurity measures, ending with tree building and pruning phases. |
Logistic Regression (LR) | A simple, widely used model for binary classification based on the logit function. |
Random Forest (RF) | An ensemble model of multiple decision trees, where each tree votes for the most common class using randomly selected data and attributes. |
Naïve Bayes (NB) | A probabilistic classifier based on Bayes’ theorem, assuming that all features are independent. It is efficient, especially with large datasets. |
Gradient Boosting (GB) | An ensemble model that builds predictors sequentially, with each new model correcting errors from the previous ones. It is powerful but sensitive to overfitting. |
K-Nearest Neighbors (KNN) | A distance-based classifier that labels data using the majority vote of the K nearest neighbors, known for its simplicity and efficiency. |
Support Vector Machine (SVM) | A supervised ML algorithm that finds the optimal hyperplane to classify data points by maximizing the margin between different classes. |
Appendix D
Model Name | Size | Architecture Details | Hugging Face References |
---|---|---|---|
ALBERT | ~12 M parameters | Shares parameters across all 12 layers and uses factorized embeddings (128 embedding size, 768 hidden size). Reduces redundancy and model size. | albert/albert-base-v2·Hugging Face (https://huggingface.co/albert/albert-base-v2) |
BERT-Tiny | ~4 M parameters | Minimal BERT variant with only 2 layers, 128 hidden size, and 2 attention heads. Designed for extremely lightweight tasks | prajjwal1/bert-tiny·Hugging Face (https://huggingface.co/prajjwal1/bert-tiny) |
DistilBERT | ~66 M parameters | 6-layer transformer distilled from BERT base (12 layers), with 768 hidden size. Offers ~97% of BERT’s performance with a smaller size and faster inference. | distilbert/distilbert-base-uncased·Hugging Face (https://huggingface.co/distilbert/distilbert-base-uncased) |
ELECTRA | ~14 M parameters | 12-layer transformer with 256 hidden size. Uses replaced token detection (generator/discriminator setup) for more sample-efficient training than BERT. | google/electra-small-generator·Hugging Face (https://huggingface.co/google/electra-small-generator) |
MiniLM | ~33 M parameters | 12-layer transformer with a smaller hidden size (384). Trained using knowledge distillation from larger models. Balances speed and performance well. | microsoft/MiniLM-L12-H384-uncased·Hugging Face (https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) |
RoBERTa | ~4 M parameters | A tiny version of RoBERTa trained on the IMDB dataset. Consists of 2 transformer layers, 128 hidden size and is designed for fast and lightweight tasks. | AntoineB/roberta-tiny-imdb·Hugging Face (https://huggingface.co/AntoineB/roberta-tiny-imdb) |
References
- Ripa, S.P.; Islam, F.; Arifuzzaman, M. The emergence threat of phishing attack and the detection techniques using machine learning models. In Proceedings of the 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0 (ACMI), Online, 8–9 July 2021; pp. 8–9. [Google Scholar]
- Rashed, S.; Ozcan, C. A Comprehensive Review of Machine and Deep Learning Approaches for Cyber Security Phishing Email Detection. Al-Iraqia J. Sci. Eng. Res. 2024, 3, 1–12. [Google Scholar] [CrossRef]
- Mittal, K.; Gill, K.S.; Chauhan, R.; Joshi, K.; Banerjee, D. Blockage of Phishing Attacks Through Machine Learning Classification Techniques and Fine Tuning its Accuracy. In Proceedings of the 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), Bangalore, India, 29–31 December 2023; pp. 1–5. [Google Scholar]
- Kaddoura, S.; Alfandi, O.; Dahmani, N. A Spam Email Detection Mechanism for English Language Text Emails Using Deep Learning Approach. In Proceedings of the 2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Bayonne, French, 10–12 June 2020; pp. 193–198. [Google Scholar]
- Koide, T.; Fukushi, N.; Nakano, H.; Chiba, D. Detecting Phishing Sites Using ChatGPT. arXiv 2023, arXiv:2306.05816. [Google Scholar] [CrossRef]
- Franchina, L.; Ferracci, S.; Palmaro, F. Detecting phishing e-mails using text mining and features analysis. CEUR Workshop Proc. 2021, 2940, 106–119. [Google Scholar]
- Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques. IEEE Access 2022, 10, 65703–65727. [Google Scholar] [CrossRef]
- CyBOK. University of Bristol. The Cyber Security Body of Knowledge (Version 1.1). 2021. Available online: https://www.cybok.org/knowledgebase1_1/ (accessed on 30 May 2025).
- Chanti, S.; Chithralekha, T. A literature review on classification of phishing attacks. Int. J. Adv. Technol. Eng. Explor. 2022, 9, 446–476. [Google Scholar] [CrossRef]
- Do, N.Q.; Selamat, A.; Krejcar, O.; Herrera-Viedma, E.; Fujita, H. Deep Learning for Phishing Detection: Taxonomy, Current Challenges and Future Directions. IEEE Access 2022, 10, 36429–36463. [Google Scholar] [CrossRef]
- Aljofey, A.; Jiang, Q.; Qu, Q.; Huang, M.; Niyigena, J.P. An effective phishing detection model based on character level convolutional neural network from URL. Electronics 2020, 9, 1514. [Google Scholar] [CrossRef]
- Al-Subaiey, A.; Al-Thani, M.; Abdullah Alam, N.; Antora, K.F.; Khandakar, A.; Uz Zaman, S.A. Novel interpretable and robust web-based AI platform for phishing email detection. Comput. Electr. Eng. 2024, 120, 109625. [Google Scholar] [CrossRef]
- Anti-Phishing Working Group. Phishing Activity Trends Report 4th Quarter 2023. 2023. Available online: https://docs.apwg.org/reports/apwg_trends_report_q4_2023.pdf (accessed on 3 March 2025).
- SLASHNEXT. The State of Phishing 2023. Available online: https://slashnext.com/wp-content/uploads/2023/10/SlashNext-The-State-of-Phishing-Report-2023.pdf (accessed on 6 March 2025).
- Jaya, T.; Kanyaharini, R.; Navaneesh, B. Appropriate Detection of HAM and Spam Emails Using Machine Learning Algorithm. In Proceedings of the 2nd IEEE International Conference on Cognitive Informatics, ACCAI 2023, Washington, DC, USA, 18–20 August 2023. [Google Scholar]
- Khalid, A.; Hanif, M.; Hameed, A.; Smiee, Z.A.; Alnfiai, M.M.; Alnefaie, S.M.M. LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach. IEEE Access 2024, 12, 193807–193821. [Google Scholar] [CrossRef]
- Bagui, S.; Nandi, D.; Bagui, S.; White, R.J. Machine Learning and Deep Learning for Phishing Email Classification using One-Hot Encoding. J. Comput. Sci. 2021, 17, 610–623. [Google Scholar] [CrossRef]
- An, P.; Shafi, R.; Mughogho, T.; Onyango, O.A. Multilingual Email Phishing Attacks Detection Using OSINT and Machine Learning. arXiv 2025, arXiv:2501.08723. [Google Scholar] [CrossRef]
- Sengar, S.S.; Hasan ABin Kumar, S.; Carroll, F. Generative artificial intelligence: A systematic review and applications. Multimed. Tools Appl. 2024, 84, 23661–23700. [Google Scholar] [CrossRef]
- Bengesi, S.; El-Sayed, H.; Sarker, M.K.; Houkpati, Y.; Irungu, J.; Oladunni, T. Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers. IEEE Access 2024, 12, 69812–69837. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
- Ahmed, N.; Khomh, F. Data Cleaning and Machine Learning: A Systematic Literature Review. Autom. Softw. Eng. 2024, 31, 54. [Google Scholar] [CrossRef]
- Zhan, Z. Comparative Analysis of TF-IDF and Word2Vec in Sentiment Analysis: A Case of Food Reviews. In Proceedings of the 2024 2nd International Conference on Data Science, Advanced Algorithm and Intelligent Computing (DAI 2024), Nanjing, China, 6–8 December 2024; p. 02013. [Google Scholar]
- Elkholy, M.; Sabry, M.; Elbehiery, H. An Efficient Phishing Detection Framework Based on Hybrid. Sustain. Mach. Intell. J. 2025, 11, 11–19. [Google Scholar] [CrossRef]
- Mahendru, S.; Networks, P.A. SecureNet: A Comparative Study of DeBERTa and Large Language Models for Phishing Detection. In Proceedings of the 2024 IEEE 7th International Conference on Big Data and Artificial Intelligence (BDAI), Beijing, China, 5–7 July 2024; pp. 160–169. [Google Scholar]
- Wood, T.; Basto-fernandes, V.; Boiten, E.; Yevseyeva, I. Systematic Literature Review: Anti-Phishing Defences and Their Application to Before-the-Click Phishing Email Detection. arXiv 2022, arXiv:2204.13054. [Google Scholar] [CrossRef]
- Jamal, S.; Wimmer, H.; Sarker, I.H. An improved transformer-based model for detecting phishing, spam and ham emails: A large language model approach. Secur Priv. 2024, 7, e402. [Google Scholar] [CrossRef]
- Karim, A.; Shahroz, M.; Mustofa, K.; Belhaouari, S.B.; Joga, S.R.K. Phishing Detection System Through Hybrid Machine Learning Based on URL. IEEE Access 2023, 11, 36805–36822. [Google Scholar] [CrossRef]
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–30. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–72. [Google Scholar] [CrossRef]
- Wang, D.; Li, Y.; Jiang, J.; Ding, Z.; Luo, Z.; Jiang, G.; Liang, J.; Yang, D. Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization. arXiv 2025, arXiv:2405.17067. [Google Scholar] [CrossRef]
- Lu, Y.; Ji, Z.; Du, J.; Shanqing, Y.; Xuan, Q.; Zhou, T. From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling. arXiv 2025, arXiv:2506.16393. [Google Scholar] [CrossRef]
- Jung, V.; van der Plas, L. Understanding the effects of language-specific class imbalance in multilingual fine-tuning. In Proceedings of the EACL 2024—18th Conference of the European Chapter of the Association for Computational Linguistics Find EACL 2024, St. Julian’s, Malta, 17–22 March 2024; pp. 2368–2376. [Google Scholar]
- Zia, M.F.; Heath, M.; Heath, M. Web Phishing Net (WPN): A scalable machine learning approach for real-time phishing campaign detection. arXiv 2025, arXiv:2502.13171. [Google Scholar] [CrossRef]
Model Name | Configuration |
---|---|
Logistic Regression (LR) | max_iter = 2000, random_state = 42, class_weight = ‘balanced’ |
Random Forest (RF) | n_estimators = 100, random_state = 42, max_features = ‘sqrt’ |
Gradient Boosting (GB) | n_estimators = 150, learning_rate = 0.1, max_depth = 3 |
Support Vector Machine (SVM) | kernel = ‘linear’, C = 10, Gamma = 0.01, Probability = True, random_state = 42 |
K-Nearest Neighbors (KNN) | n_neighbors = 5 |
Dataset | Configuration |
---|---|
Email Dataset | Num_train_epochs = 3,Weigh_dcay = 0.01, Learning_rate = 2 × 10−5, Batch_size = 8, Max_seq_length = 256, |
URL Dataset | Num_train_epochs = 2, Weigh_decay = 0.01, Learning_rate = 2 × 10−5, Batch_size = 16, Max_seq_length = 128, |
Imbalanced Email Dataset | |||||
---|---|---|---|---|---|
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
Decision Tree | 0.9884 | 0.9749 | 0.9850 | 0.9799 | 0.9874 |
Gradient Boosting | 0.9498 | 0.9667 | 0.8554 | 0.9077 | 0.9217 |
K-Nearest Neighbors | 0.9722 | 0.9581 | 0.9448 | 0.9514 | 0.9640 |
Logistic Regression | 0.9747 | 0.9802 | 0.9311 | 0.9550 | 0.9617 |
Naïve Bayes | 0.9521 | 0.9631 | 0.8672 | 0.9126 | 0.9269 |
Random Forest | 0.9947 | 0.9926 | 0.9890 | 0.9908 | 0.9930 |
Support Vector Machine | 0.9935 | 0.9842 | 0.9934 | 0.9888 | 0.9935 |
Balanced Email Dataset | |||||
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
Decision Tree | 0.9896 | 0.9848 | 0.9946 | 0.9897 | 0.9896 |
Gradient Boosting | 0.9545 | 0.9333 | 0.9875 | 0.9596 | 0.9545 |
K-Nearest Neighbors | 0.9755 | 0.9743 | 0.9769 | 0.9756 | 0.9755 |
Logistic Regression | 0.9846 | 0.9774 | 0.9921 | 0.9847 | 0.9846 |
Naïve Bayes | 0.9584 | 0.9823 | 0.9257 | 0.9531 | 0.9584 |
Random Forest | 0.9959 | 0.9943 | 0.9976 | 0.9959 | 0.9959 |
Support Vector Machine | 0.9954 | 0.9943 | 0.9964 | 0.9954 | 0.9954 |
Imbalanced URL Dataset | |||||
---|---|---|---|---|---|
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
Decision Tree | 0.9837 | 0.9606 | 0.9690 | 0.9648 | 0.9785 |
Gradient Boosting | 0.9738 | 0.9338 | 0.9537 | 0.9437 | 0.9667 |
K-Nearest Neighbors | 0.9761 | 0.9387 | 0.9591 | 0.9488 | 0.9702 |
Logistic Regression | 0.8147 | 0.5619 | 0.8921 | 0.6895 | 0.8418 |
Naïve Bayes | 0.8726 | 0.6817 | 0.8396 | 0.7525 | 0.8611 |
Random Forest | 0.9881 | 0.9668 | 0.9822 | 0.9745 | 0.9861 |
Support Vector Machine | 0.9601 | 0.8895 | 0.9442 | 0.9160 | 0.9545 |
Balanced URL Dataset | |||||
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
Decision Tree | 0.9843 | 0.9829 | 0.9857 | 0.9843 | 0.9843 |
Gradient Boosting | 0.9713 | 0.9662 | 0.9767 | 0.9741 | 0.9713 |
K-Nearest Neighbors | 0.9748 | 0.9692 | 0.9807 | 0.9749 | 0.9748 |
Logistic Regression | 0.8415 | 0.8111 | 0.8905 | 0.8489 | 0.8415 |
Naïve Bayes | 0.8835 | 0.8767 | 0.8925 | 0.8846 | 0.8835 |
Random Forest | 0.9984 | 0.9855 | 0.9915 | 0.9885 | 0.9884 |
Support Vector Machine | 0.9619 | 0.9605 | 0.9634 | 0.9620 | 0.9619 |
Imbalanced Email Dataset | |||||
---|---|---|---|---|---|
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
ALBERT | 0.9797 | 0.9621 | 0.9678 | 0.9650 | 0.9762 |
BERT-Tiny | 0.9203 | 0.8530 | 0.8719 | 0.8623 | 0.9055 |
DistilBERT | 0.9844 | 0.9771 | 0.9686 | 0.9728 | 0.9797 |
ELECTRA-Tiny | 0.9704 | 0.9632 | 0.9331 | 0.9480 | 0.9593 |
MiniLM | 0.9769 | 0.9750 | 0.9442 | 0.9594 | 0.9672 |
RoBERTa | 0.9743 | 0.9657 | 0.9446 | 0.9550 | 0.9655 |
Balanced Email Dataset | |||||
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
ALBERT | 0.9782 | 0.9775 | 0.9789 | 0.9782 | 0.9782 |
BERT-Tiny | 0.9198 | 0.9143 | 0.9264 | 0.9203 | 0.9198 |
DistilBERT | 0.9835 | 0.9846 | 0.9823 | 0.9835 | 0.9835 |
ELECTRA-Tiny | 0.9632 | 0.9614 | 0.9653 | 0.9653 | 0.9632 |
MiniLM | 0.9741 | 0.9743 | 0.9738 | 0.9740 | 0.9741 |
RoBERTa | 0.9708 | 0.9809 | 0.9604 | 0.9705 | 0.9708 |
Imbalanced URL Dataset | |||||
---|---|---|---|---|---|
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
ALBERT | 0.9974 | 0.9967 | 0.9922 | 0.9945 | 0.9956 |
BERT-Tiny | 0.9965 | 0.9916 | 0.9932 | 0.9924 | 0.9953 |
DistilBERT | 0.9979 | 0.9970 | 0.9938 | 0.9954 | 0.9965 |
ELECTRA-Tiny | 0.9975 | 0.9967 | 0.9926 | 0.9947 | 0.9958 |
MiniLM | 0.9975 | 0.9961 | 0.9933 | 0.9947 | 0.9960 |
RoBERTa | 0.9977 | 0.9977 | 0.9926 | 0.9952 | 0.9960 |
Balanced URL Dataset | |||||
Model Name | Accuracy | Precision | Recall | F1 Score | Balanced-Accuracy |
ALBERT | 0.9961 | 0.9964 | 0.9958 | 0.9961 | 0.9961 |
BERT-Tiny | 0.9947 | 0.9966 | 0.9927 | 0.9947 | 0.9947 |
DistilBERT | 0.9971 | 0.9979 | 0.9963 | 0.9971 | 0.9971 |
ELECTRA-Tiny | 0.9963 | 0.9967 | 0.9959 | 0.9963 | 0.9993 |
MiniLM | 0.9963 | 99.81 | 99.46 | 99.63 | 99.63 |
RoBERTa | 0.9968 | 0.9979 | 0.9957 | 0.9968 | 0.9968 |
Text | Actual Label | Predict Label | Model |
---|---|---|---|
Grammatical Errors | |||
Dear user, ur accnt info is missing plz verify fast. | legitimate | phishing | DT, LG, RF, SVC, KNN |
HTML-format | |||
You are a winner, your phone is not among the <200> lucky winners’ code Call Michael JOHN on: <08167566152> for a claim | phishing | legitimate | LG, GB |
Multilingual Text | |||
Hello, Por favor, update your password to keep your account secure. Gracias. | phishing | legitimate | DT, NB, GD |
Text | Actual Label | Predict Label | Model |
---|---|---|---|
Common words | |||
Subject: Password Reset Request—University Portal Dear Student, We received your request to reset the password for your University Portal login account. To proceed, please click the link below to create a new password: https://example.com/university/reset-password?token=SIM-2025-EMAIL (accessed 23 September 2025) If you did not request this change, please ignore this email or contact IT Support immediately. Sincerely, University IT Support Team it-support@university.edu | legitimate | phishing | ALBERT, ELECTRA, MiniLM, RoBERTa |
Punctuations Marks | |||
Welcome! Ready to code with fresh updates from Tech Insight? Doesn’t look right? Just click here! Courses, Tools, Tutorials… | legitimate | phishing | DistilBERT, ELECTRA, MiniML |
Formal or Lengthy Formats | |||
I saw your advertisement and I must say, the item looks exactly like what I’ve been looking for. The pictures are clear and the description is satisfactory. Please send me the exact current condition, any issues I should be aware of, and the final asking price. Regarding payment, I would prefer to use PayPal, as it is quick and secure for both of us. Once I make the payment, I will arrange for a private courier service to come to your location for pickup. They will handle everything signing any documents and collecting the item. There’s no need for you to worry about shipping or extra costs. Just let me know your PayPal email address so I can proceed immediately. If you’re not already using PayPal, you can easily register at www.paypal.com it takes a minute. Please also include your full name and pickup address in your reply so my courier can coordinate properly. Looking forward to your response. Kind regards, Derek Mason | phishing | legitimate | BERT-Tiny, RoBERTa |
Models Name | Training Time (hh: mm: ss) | |
---|---|---|
Imbalanced Set | Balanced Set | |
ML models | ||
LR, RF, SVC, DT, Naïve Bayes, Gradient Boosting, K-NN. | 00:09:33 | 00:05:19 |
LLMs/Email Dataset | ||
ALBERT | 36:07:01 | 21:35:14 |
BERT-Tiny | 00:18:11 | 00:22:13 |
DistilBERT | 26:54:40 | 19:10:18 |
ELECTRA-Tiny | 07:41:02 | 14:38:43 |
MiniLM | 13:35:01 | 12:47:20 |
RoBERTa | 08:37:39 | 09:25:17 |
Total Time required | 3 d 21 h 13 m 34 s | 3 d 5 h 59 m 5 s |
LLMs/URL Dataset | ||
ALBERT | 36:35:29 | 18:32:08 |
BERT-Tiny | 00:14:58 | 00:48:46 |
DistilBERT | 10:04:18 | 14:56:19 |
ELECTRA-Tiny | 05:08:03 | 04:32:46 |
MiniLM | 22:49:26 | 06:42:32 |
RoBERTa | 06:31:53 | 07:17:41 |
Total Time required | 3 d 9 h 24 m 7 s | 2 d 4 h 50 m 12 s |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tarapiah, S.; Abbas, L.; Mardawi, O.; Atalla, S.; Himeur, Y.; Mansoor, W. Evaluating the Effectiveness of Large Language Models (LLMs) Versus Machine Learning (ML) in Identifying and Detecting Phishing Email Attempts. Algorithms 2025, 18, 599. https://doi.org/10.3390/a18100599
Tarapiah S, Abbas L, Mardawi O, Atalla S, Himeur Y, Mansoor W. Evaluating the Effectiveness of Large Language Models (LLMs) Versus Machine Learning (ML) in Identifying and Detecting Phishing Email Attempts. Algorithms. 2025; 18(10):599. https://doi.org/10.3390/a18100599
Chicago/Turabian StyleTarapiah, Saed, Linda Abbas, Oula Mardawi, Shadi Atalla, Yassine Himeur, and Wathiq Mansoor. 2025. "Evaluating the Effectiveness of Large Language Models (LLMs) Versus Machine Learning (ML) in Identifying and Detecting Phishing Email Attempts" Algorithms 18, no. 10: 599. https://doi.org/10.3390/a18100599
APA StyleTarapiah, S., Abbas, L., Mardawi, O., Atalla, S., Himeur, Y., & Mansoor, W. (2025). Evaluating the Effectiveness of Large Language Models (LLMs) Versus Machine Learning (ML) in Identifying and Detecting Phishing Email Attempts. Algorithms, 18(10), 599. https://doi.org/10.3390/a18100599