Hybridizing Explainable AI (XAI) for Intelligent Feature Extraction in Phishing Website Detection
Abstract
1. Introduction
- Hybrid XAI-driven feature selection: A unique integration of SHAP, LIME, PDP, and PI is employed to extract and rank the most relevant features, combining multiple interpretability perspectives to improve the robustness of the final feature set.
- Enhanced detection accuracy and transparency: By leveraging the strengths of multiple XAI methods alongside a Random Forest classifier, HXRF achieves high classification accuracy while maintaining model interpretability, enabling cybersecurity experts to understand the rationale behind each prediction.
- Flexible architecture: Although Random Forest is employed as the primary classifier, the framework allows the substitution of other ML or DL models without modifying the XAI-driven feature selection layer, making the system adaptable to future phishing detection advancements.
2. Background
2.1. Features Utilized in Phishing Detection
- URL-Based Features
- HTML/DOM Features
- Content-Based Features
- Network-Based Features
- Visual Features
- F. Behavioral Features
- Hybrid Features
2.2. Explainable Artificial Intelligence (XAI) Algorithms
- Model specificity: whether the explanation is tied to a particular class of models, such as decision trees or linear models, or is model-agnostic and applicable to any black-box function.
- Scope of explanation: whether the explanation targets the global behavior of the model “the approximate functional form of across the input space” or provides a local explanation of a single instance by approximating in a neighborhood around . Many XAI approaches can be viewed as constructing a surrogate function that is interpretable, such as a linear or rule-based function that satisfies a fidelity constraint as in Equation (1).
2.2.1. Model-Specific Interpretable Models
2.2.2. Local Interpretable Model-Agnostic Explanations (LIME)
2.2.3. SHapley Additive exPlanations (SHAP)
2.2.4. Feature Interaction and PDP/ALE
2.2.5. Permutation Feature Importance
2.3. ML and DL in Phishing Detection
| Ref. | Dataset Details | ML/DL | Accuracy | Selected Features | Feature Selection Method(s) |
|---|---|---|---|---|---|
| [37] | Email dataset | BERT, LSTM | 99.61 | Textual features extracted using NLP techniques | None |
| [38] | URL dataset | LSTM, XGBoost | 96.04 | Character-level TF-IDF features | None |
| [27] | URL dataset | Random Forest | 96.83 | Hybrid features: URL-based and hyperlink-based features | Filter-based ranking and incremental removal of less important features |
| [28] | Mendeley dataset | Random Forest | 97.78 | 23 features selected from an original set of 48 features | Explainable feature selection framework |
| [40] | PhiUSIIL dataset | FCNN | 99.3 | 56 features heuristic-based and statistical | None |
| [41] | A new dataset | none | none | 111 features combined of all types | None |
| [42] | DARTH Email | ANN, XGBoost | 99.98 | Combined features from Email content and NLP | None |
| [33] | Email dataset | Reinforcement (RAIDER) | 94.00 | Reduced feature set through reinforcement learning-based feature evaluation | Reinforcement learning-based feature evaluation |
| [29] | Email dataset (English-Arabic) | Random Forest | 97.37 | Domain names, IP addresses, open ports | None |
| [35] | Website Phishing Detection | DNN, LSTM, CNN with Grid Search and Genetic | 97.37 | Combined 48 features (Tan-dataset) | none |
| [39] | AntiPhishStack | LSTM, XGBoost | 96.05 | URL and TF-IDF with 30 features | none |
| [30] | Website phishing | Two models XGboost, RF, SVM, LR | 99.7 | 5 new features: Information (CN), Logo Domain (LD), Form Action Domain (FAD), Most Common Link in Domain (MCLD) and Cookie Domain | Rank the 5 features |
| [32] | Benchmark dataset with 14,000 website samples | Random Forest | 95.00 | Universal feature set selected using Fuzzy Rough Set theory | Fuzzy Rough Set feature selection |
| [43] | Turkish email dataset | Keras-based deep learning model | 93.97 | Textual and structural features | None |
| [31] | UCI ML phishing URL dataset | Random Forest | 99.99 | URL-based features | None |
| [36] | URL phishing detector | 1D CNN | 99.7 | Combined features from multi sources | None |
| [34] | Dataset with 11,449 samples | TCN | 99.8 | Title, copyright information, NER, login form detection, keyword-based retrieval | None |
| [13] | website phishing | CNN | 95 | URL features | None |
| [44] | website phishing | Multi-stacked model with multi-ML algorithms | 97 | Combined features 80 features | None |
| [45] | website phishing | Two layers of multi-ML algorithms | 96.5 | Combined features 80 features | None |
3. The Hybrid XAI-Random Forest (HXRF) Model
4. Experiment
4.1. Phishing URL Collection
4.2. Feature Extraction for Collected URLs
- Google Indexing check (to evaluate whether the page is indexed by Google Search)
- Domain age and registration details via WHOIS lookup libraries
- URL and hostname lexical features (length, number of dots, entropy, etc.)
- SSL certificate attributes collected through TLS inspection
- JavaScript features, form analysis, and link statistics parsed directly from webpage HTML
4.3. Model Training Using Random Forest
Hardware Setup
- Intel Core i7 (12th Generation)
- NVIDIA GeForce RTX 5070 GPU
- 16 GB RAM
- Resistance to overfitting: Through the use of multiple decision trees and bootstrap sampling, RF generalizes well even with large feature sets.
- Interpretability and compatibility with XAI: RF integrates smoothly with SHAP, LIME, PDP, and PDI for explainability, making it ideal for XAI-guided feature selection.
- Feature importance analysis: RF naturally provides importance scores that facilitate downstream ranking and correlation with XAI outputs.
- Scalability: RF can effectively handle datasets with hundreds of thousands of samples without requiring GPU acceleration.
5. Results
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Masoud, M.; Jaradat, Y.; Manasrah, A.; Jannoud, I. Sensors of smart devices in the internet of everything (IoE) era: Big opportunities and massive doubts. J. Sens. 2019, 2019, 6514520. [Google Scholar] [CrossRef]
- Torres-Hernandez, C.M.; Garduño-Aparicio, M.; Rodriguez-Resendiz, J. Smart homes: A meta-study on sense of security and home automation. Technologies 2025, 13, 320. [Google Scholar] [CrossRef]
- Szpilko, D.; Fernando, X.; Nica, E.; Budna, K.; Rzepka, A.; Lăzăroiu, G. Energy in smart cities: Technological trends and prospects. Energies 2024, 17, 6439. [Google Scholar] [CrossRef]
- Ayeni, R.K.; Adebiyi, A.A.; Okesola, J.O.; Igbekele, E. Phishing attacks and detection techniques: A systematic review. In Proceedings of the 2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG), Omu-Aran, Nigeria, 2–4 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–17. [Google Scholar]
- Phishing Statistics. Available online: https://keepnetlabs.com/blog/top-phishing-statistics-and-trends-you-must-know (accessed on 15 November 2025).
- Annual Internet Crime Report. Available online: https://www.fbi.gov/news/press-releases/fbi-releases-annual-internet-crime-report (accessed on 15 November 2025).
- Rao, R.S.; Pais, A.R. An enhanced blacklist method to detect phishing websites. In Proceedings of the International Conference on Information Systems Security, Seoul, Republic of Korea, 10–13 December 2017; Springer International Publishing: Cham, Switzerland, 2017; pp. 323–333. [Google Scholar]
- Jabir, R.; Le, J.; Nguyen, C. Phishing attacks in the age of generative artificial intelligence: A systematic review of human factors. AI 2025, 6, 174. [Google Scholar] [CrossRef]
- Butnaru, A.; Mylonas, A.; Pitropakis, N. Towards lightweight url-based phishing detection. Future Internet 2021, 13, 154. [Google Scholar] [CrossRef]
- Aguirre, A.; Salazar, L. A Systematic Review of Artificial Intelligence Techniques for Phishing Detection. Adv. Artif. Intell. Mach. Learn. 2025, 5, 4115–4153. [Google Scholar] [CrossRef]
- Mughaid, A.; AlZu’bi, S.; Hnaif, A.; Taamneh, S.; Alnajjar, A.; Abu Elsoud, E. An intelligent cyber security phishing detection system using deep learning techniques. Clust. Comput. 2022, 25, 3819–3828. [Google Scholar] [CrossRef]
- Do, N.Q.; Selamat, A.; Krejcar, O.; Herrera-Viedma, E.; Fujita, H. Deep learning for phishing detection: Taxonomy, current challenges and future directions. IEEE Access 2022, 10, 36429–36463. [Google Scholar] [CrossRef]
- Aljofey, A.; Jiang, Q.; Qu, Q.; Huang, M.; Niyigena, J.-P. An effective phishing detection model based on character level convolutional neural network from URL. Electronics 2020, 9, 1514. [Google Scholar] [CrossRef]
- Yoon, J.-H.; Buu, S.-J.; Kim, H.-J. Phishing webpage detection via multi-modal integration of HTML DOM graphs and URL features based on graph convolutional and transformer networks. Electronics 2024, 13, 3344. [Google Scholar] [CrossRef]
- PhishTank, an Online Database for Suspected Online Linkes. Available online: https://www.phishtank.org (accessed on 20 September 2025).
- Ozker, U.; Sahingoz, O.K. Content based phishing detection with machine learning. In Proceedings of the 2020 International Conference on Electrical Engineering (ICEE), Istanbul, Turkey, 25–27 September 2020; IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
- Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access 2022, 10, 65703–65727. [Google Scholar] [CrossRef]
- Zhu, E.; Chen, Y.; Ye, C.; Li, X.; Liu, F. OFS-NN: An effective phishing websites detection model based on optimal feature selection and neural network. IEEE Access 2019, 7, 73271–73284. [Google Scholar] [CrossRef]
- Oest, A.; Safei, Y.; Doupe, A.; Ahn, G.J.; Wardman, B.; Warner, G. Inside a phisher ’s mind: Understanding the anti-phishing eco system through phishing kit analysis. In Proceedings of the 2018 APWG Symposium on Electronic Crime Research (eCrime), San Diego, CA, USA, 14–16 May 2018. [Google Scholar]
- Marchal, S.; Francois, J.; State, R.; Engel, T. PhishStorm: Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 2014, 11, 458–471. [Google Scholar] [CrossRef]
- Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Molnar, C.; Casalicchio, G.; Bischl, B. Interpretable machine learning—A brief history, state-of-the-art and challenges. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Ghent, Belgium, 14–18 September 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 417–431. [Google Scholar]
- Bacevicius, M.; Paulauskaite-Taraseviciene, A.; Zokaityte, G.; Kersys, L.; Moleikaityte, A. Comparative analysis of perturbation techniques in LIME for intrusion detection enhancement. Mach. Learn. Knowl. Extr. 2025, 7, 21. [Google Scholar] [CrossRef]
- Hussein, E.E.; Zerouali, B.; Bailek, N.; Derdour, A.; Ghoneim, S.S.M.; Santos, C.A.G.; Hashim, M.A. Harnessing explainable AI for sustainable agriculture: SHAP-based feature selection in multi-model evaluation of irrigation water quality indices. Water 2024, 17, 59. [Google Scholar] [CrossRef]
- Georgiades, M.; Hussain, F. An Explainable AI Approach for Interpretable Cross-Layer Intrusion Detection in Internet of Medical Things. Electronics 2025, 14, 3218. [Google Scholar] [CrossRef]
- Khan, A.; Ali, A.; Khan, J.; Ullah, F.; Faheem, M. Using Permutation-Based Feature Importance for Improved Machine Learning Model Performance at Reduced Costs. IEEE Access 2025, 13, 36421–36435. [Google Scholar] [CrossRef]
- Guptta, S.D.; Soni, M.S.; Soni, S.S. Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques. Ann. Data Sci. 2022. Online ahead of print. Available online: https://pmc.ncbi.nlm.nih.gov/articles/PMC8935623/ (accessed on 15 November 2025).
- Shafin, S.S. An explainable feature selection framework for web phishing detection with machine learning. Data Sci. Manag. 2025, 8, 127–136. [Google Scholar] [CrossRef]
- An, P.; Shafi, R.; Mughogho, T.; Onyango, O.A. Multilingual Email Phishing Attacks Detection using OSINT and Machine Learning. arXiv 2025, arXiv:2501.08723. [Google Scholar] [CrossRef]
- Mishra, R.; Varshney, G. A Study of Effectiveness of Brand Domain Identification Features for Phishing Detection in 2025. arXiv 2025, arXiv:2503.06487. [Google Scholar] [CrossRef]
- Rehman, A.U.; Imtiaz, I.; Javaid, S.; Muslih, M. Real-Time Phishing URL Detection Using Machine Learning. Eng. Proc. 2025, 107, 108. [Google Scholar]
- Zabihimayvan, M.; Doran, D. Fuzzy rough set feature selection to enhance phishing attack detection. In Proceedings of the 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), New Orleans, LA, USA, 23–26 June 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
- Evans, K.; Abuadbba, A.; Wu, T.; Moore, K.; Ahmed, M.; Pogrebna, G.; Nepal, S.; Johnstone, M. RAIDER: Reinforcement-Aided Spear Phishing Detector. In Proceedings of the International Conference on Information Systems Security and Privacy (ICISSP), Virtual, 9–11 February 2022. [Google Scholar]
- Aljofey, A.; Bello, S.A.; Lu, J.; Xu, C. Comprehensive phishing detection: A multi-channel approach with variants TCN fusion leveraging URL and HTML features. J. Netw. Comput. Appl. 2025, 238, 104170. [Google Scholar] [CrossRef]
- Almousa, M.; Zhang, T.; Sarrafzadeh, A.; Anwar, M. Phishing website detection: How effective are deep learning-based models and hyperparameter optimization? Secur. Priv. 2022, 5, e256. [Google Scholar]
- Haq, Q.E.U.; Faheem, M.H.; Ahmad, I. Detecting phishing URLs based on a deep learning approach to prevent cyber-attacks. Appl. Sci. 2024, 14, 10086. [Google Scholar] [CrossRef]
- Atawneh, S.; Aljehani, H. Phishing Email Detection Model Using Deep Learning. Electronics 2023, 12, 4261. [Google Scholar] [CrossRef]
- Murhej, M.; Nallasivan, G. Multimodal framework for phishing attack detection and mitigation through behavior analysis using EM-BERT and SPCA-BASED EAI-SC-LSTM. Front. Commun. Netw. 2025, 6, 1587654. [Google Scholar]
- Aslam, S.; Aslam, H.; Manzoor, A.; Chen, H.; Rasool, A. AntiPhishStack: LSTM-based stacked generalization model for optimized phishing URL detection. Symmetry 2024, 16, 248. [Google Scholar]
- Rawla, A.; Singh, S.; Daniyal, M.; Dubey, P. Detection of Phishing Attacks in PhiUSIIL Dataset Using Deep Learning. Procedia Comput. Sci. 2025, 259, 543–552. [Google Scholar] [CrossRef]
- Vrbančič, G.; Fister, I.; Podgorelec, V. Datasets for Phishing Websites Detection. Data Brief 2020, 33, 106438. [Google Scholar] [CrossRef]
- Mittal, A.; Engels, D.; Kommanapalli, H.; Sivaraman, R.; Chowdhury, T. Phishing Detection Using Natural Language Processing and Machine Learning. SMU Data Sci. Rev. 2022, 6, 14. [Google Scholar]
- Eryılmaz, E.E.; Şahin, D.Ö.; Kılıç, E. Filtering turkish spam using LSTM from deep learning techniques. In Proceedings of the 2020 8th International Symposium on Digital Forensics and Security (ISDFS), Beirut, Lebanon, 1–2 June 2020; IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
- Masoud, M.; Jaradat, Y.; Alsakarnah, R. A Non-Content Multilayers Hybrid Machine Learning Web Phishing Detection Model. Int. Rev. Model. Simul. (IREMOS) 2022, 15, 108–115. [Google Scholar] [CrossRef]
- Alheyasat, O. Web Phishing Detection and Awareness Utilizing Hybrid Machine Learning Algorithms. Int. J. Adv. Soft Comput. Its Appl. 2025, 17, 283–297. [Google Scholar] [CrossRef]
- URLHaus. Available online: https://urlhaus.abuse.ch/api/ (accessed on 15 November 2025).
- Kaggles Website Phishing Detection. Available online: https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset (accessed on 15 November 2025).
- Hannousse, A.; Yahiouche, S. Web Page Phishing Detection, Mendeley Data, V3. 2021. Available online: https://data.mendeley.com/datasets/c2gw7fy2j4/3 (accessed on 15 November 2025).













| Accuracy | F1 | Recall | Precision | ROC_AUC | |
|---|---|---|---|---|---|
| LR | 0.945669 | 0.945536 | 0.943129 | 0.948018 | 0.985428 |
| RF | 0.967104 | 0.967097 | 0.966579 | 0.967670 | 0.994000 |
| SVM | 0.958705 | 0.958570 | 0.955381 | 0.961821 | 0.990171 |
| KNN | 0.940507 | 0.939810 | 0.929140 | 0.950904 | 0.975920 |
| XGBoost | 0.969291 | 0.969303 | 0.969726 | 0.968933 | 0.994733 |
| Model A | Model B | p-Value |
|---|---|---|
| LR | RF | 2.199845 |
| LR | SVM | 2.590974 |
| LR | KNN | 6.381361 |
| LR | XGBoost | 8.996150 |
| RF | SVN | 4.165687 |
| RF | KNN | 5.770413 |
| RF | XGBoost | 1.709066 × 10−1 |
| SVM | KNN | 3.488691 |
| SVM | XGBoost | 4.657034 |
| KNN | XGBoost | 1.817466 |
| Accuracy | F1 | Recall | Precision | |
|---|---|---|---|---|
| HXRF | 0.98204 | 0.984139 | 0.986754 | 0.981538 |
| RF_All_Features | 0.967104 | 0.967087 | 0.966579 | 0.967595 |
| Union + RF | 0.965442 | 0.965475 | 0.966404 | 0.964548 |
| Intersect + RF | 0.771216 | 0.756676 | 0.711461 | 0.808029 |
| SHAP + RF | 0.930359 | 0.930698 | 0.935258 | 0.926183 |
| LIME + RF | 0.923622 | 0.923535 | 0.922485 | 0.924588 |
| PDP + RF | 0.872966 | 0.878798 | 0.921085 | 0.840223 |
| PDI + RF | 0.920822 | 0.920691 | 0.919160 | 0.922226 |
| Model | Standard Deviation |
|---|---|
| RF+ All Features | 0.003818 |
| HXRF | 0.004155 |
| RF-Union | 005060 |
| RF-Intersect | 0.0132 |
| RF-SHAP | 0.00589 |
| RF-LIME | 0.00563 |
| RF_PDP | 0.01407 |
| RF-PDI | 0.00519 |
| Model | Time (S) |
|---|---|
| SHAP | 0.001342 |
| LIME | 0.272947 |
| PDP | 0.73937 |
| PDI | 0.339744 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alsakarnah, R.; Masoud, M.Z.; Ghababsheh, A. Hybridizing Explainable AI (XAI) for Intelligent Feature Extraction in Phishing Website Detection. Electronics 2026, 15, 350. https://doi.org/10.3390/electronics15020350
Alsakarnah R, Masoud MZ, Ghababsheh A. Hybridizing Explainable AI (XAI) for Intelligent Feature Extraction in Phishing Website Detection. Electronics. 2026; 15(2):350. https://doi.org/10.3390/electronics15020350
Chicago/Turabian StyleAlsakarnah, Rashed, Mohammad Z. Masoud, and Ahmad Ghababsheh. 2026. "Hybridizing Explainable AI (XAI) for Intelligent Feature Extraction in Phishing Website Detection" Electronics 15, no. 2: 350. https://doi.org/10.3390/electronics15020350
APA StyleAlsakarnah, R., Masoud, M. Z., & Ghababsheh, A. (2026). Hybridizing Explainable AI (XAI) for Intelligent Feature Extraction in Phishing Website Detection. Electronics, 15(2), 350. https://doi.org/10.3390/electronics15020350

