Evaluation of AI Models for Phishing Detection Using Open Datasets

Aniyansyah, Nur; Rina, Rina; Puspitasari, Sarah; Erfina, Adhitia

doi:10.3390/engproc2025107037

Open AccessProceeding Paper

Evaluation of AI Models for Phishing Detection Using Open Datasets^†

Information System Departement, Faculty of Engineering, Computing, and Design, Nusa Putra University, Sukabumi 43124, Indonesia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 37; https://doi.org/10.3390/engproc2025107037

Published: 28 August 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download Versions Notes

Abstract

Phishing is a form of cyber-attack that aims to steal sensitive information by impersonating a trusted entity. To overcome this threat, various artificial intelligence (AI) methods have been developed to improve the effectiveness of phishing detection. This study evaluates three machine learning models, namely Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM), using an open dataset containing phishing and non-phishing URLs. The research process includes data preprocessing stages such as cleaning, normalization, categorical feature encoding, feature selection, and dividing the dataset into training and test data. The trained models are then evaluated using accuracy, precision, recall, F1-score, and comparison score metrics to determine the best model in phishing classification. The evaluation results show that the Random Forest model has the best performance with higher accuracy and generalization of 98.64% compared to Decision Tree which is only 98.37% and SVM 92.67%. Decision Tree has advantages in speed and interpretability but is susceptible to overfitting. SVM shows good performance on high-dimensional datasets but is less efficient in computing time. Based on the research results, Random Forest is recommended as the most optimal model for machine learning-based phishing detection.

Keywords:

phishing; machine learning; decision tree; random forest; support vector machine

1. Introduction

In this increasingly digital era, cyber-attacks are becoming a serious threat, especially phishing attacks targeting individuals and organizations. Phishing is a fraudulent technique used to steal sensitive information, such as login credentials, financial data, and personal information by posing as a trusted entity. The increasing number of phishing attacks has driven the need to develop an effective and accurate detection system to identify these threats early.

Artificial intelligence (AI) has become a widely used solution in phishing detection, especially through machine learning methods [1]. By utilizing open datasets, various AI models can be trained to recognize phishing patterns and characteristics based on predetermined features. Some common algorithms used in phishing detection are Decision Tree, Random Forest, and Support Vector Machine (SVM).

Decision Tree is a decision tree-based model that divides data into small subsets based on the most significant features in the classification. Its advantages lie in high interpretability and speed in making predictions. However, this model is prone to overfitting if not pruned.

Random Forest is a development of Decision Tree which consists of many decision trees working together to increase accuracy and reduce the risk of overfitting. This model is more stable than a single Decision Tree and is able to handle complex datasets.

SVM (Support Vector Machine) is a model that works by finding the optimal hyperplane that separates phishing and non-phishing classes in the feature space. This model is effective in handling high-dimensional data and has good generalization but can require higher computation time especially on large datasets.

An evaluation of the Decision Tree, Random Forest, and SVM models needs to be performed to determine the most effective algorithm in detecting phishing based on open datasets. By comparing the performance of these three models using evaluation metrics such as accuracy, precision, recall, and F1-score [2,3], this study aims to identify the best model that can be used in a reliable and efficient phishing detection system.

2. Literature Review

Previous Research

Some previous studies that have examined this include the following:

Amani Alswailem, Bashayr Alabdullah, Norah Alrumayh, Aram Alsedrani [4] with the title Detecting Phishing Websites Using Filter Techniques on Machine Learning Models. The study found that the application of the Naïve Bayes method has an accuracy value of 60.4%, the Decision Tree method has an accuracy value of 94.4%, and the Random Forest method has an accuracy of 96.3%. Therefore, it can be concluded that the most effective method for detecting phishing websites is Random Forest because it has an accuracy level of 96.3%.
Sowmya Jagadeesan, Sameer, Devender Singh, Ritika Ojha, Read Khalid Ibrahim, Malik Bader Alazzam [5] with the title Implementation of Artificial Intelligence-Based Cyber Security System to Overcome Phishing Attacks. This system uses machine learning algorithms such as Support Vector Machine (SVM), Random Forest, and Neural Networks to detect phishing emails and websites with high accuracy. The data used includes phishing emails and URLs collected from various sources. Data preprocessing involves cleaning, feature extraction, and normalization before training the AI model to recognize phishing patterns. Performance evaluation was carried out using metrics such as precision, recall, and F1-score to assess the effectiveness of the system. The results show that the AI-based system achieves 97% accuracy and an F1-score of around 96%, indicating high capability in detecting phishing attacks. The implementation of this system provides a proactive solution that reduces false positives and false negatives, thereby increasing data and information security.
Shraddha Parekh, Dhwanil Parikh, Srushti Kotak, Smita Sankhe [6] with the title Detection of Phishing Websites From URL Analysis Using the Random Forest Algorithm. The contributions made by this study include using the Random Forest algorithm to detect phishing websites and adding detection features that are integrated into websites that discuss phishing. The Random Forest classification algorithm was used because of its high ability to process a large number of detection features. By using 30 detection features, the test results show that the system built is able to achieve optimal performance, with a prediction rate of 96%, recall 92%, accuracy 94%, and F1-score 93%. These results indicate that the proposed method is effective in detecting phishing attacks with a high level of accuracy, making it a very useful tool in protecting users from cyber threats and is considered to be able to solve existing problems because it can work optimally.

3. Research Methodology

A.

Dataset Explanation

The dataset used in this study was obtained from open sources containing samples of phishing and non-phishing URLs. This dataset includes various features that reflect the characteristics of the URL, such as URL length, use of special symbols, presence of suspicious keywords, and SSL certificate information. In addition, the dataset also includes additional metadata such as domain creation time and hosting information that can contribute to phishing detection. This data will be used to train and test machine learning models.

B.

Data Preprocessing

Before being used in model training, the data will go through several preprocessing stages [7,8]:

Data Cleansing
Remove duplication and handle missing data and ensure that all relevant features are available.
Feature Normalization
Converting data into a uniform scale improves model performance, especially for algorithms such as SVM that are sensitive to data scale.
Categorical Encoding
Converting categorical features into numeric format using methods such as one-hot encoding or label encoding can be processed by machine learning models.
Dataset Sharing
The dataset is divided into training data and test data with a certain ratio (e.g., 80:20) for more accurate model evaluation.

C.

AI Models Used

Three machine learning models will be used in this study:

Decision Tree
A decision tree-based model that segments data based on the most significant features. Its advantages are high interpretability and fast execution.
Random Forest
An ensemble model consisting of multiple decision trees to improve stability and accuracy and is more resistant to overfitting.
Support Vector Machine (SVM)
A model that searches for the optimal hyperlane to separate phishing and non-phishing classes, with high performance especially on high-dimensional data.

D.

Model Evaluation

Once the model is trained and tested, evaluation will be performed using several key metrics [9,10,11]:

Accuracy
Measures the percentage of correct predictions against the total data.
Precision
Measures the extent to which the model does not give false positive predictions.
Recall
Recall measures the extent to which the model can detect all correct phishing cases.
F-1 Score
Combines precision and recall in one metric to provide a balanced picture of model performance.
Confusion Matrix
Used to further analyze classification errors.

4. Model Evaluation Results and Performance Analysis

A.

Model Evaluation Results

(1)

Decision Tree

The Decision Tree model shows an accuracy of 98.37% with the following evaluation metrics as shown by Table 1:

Confusion matrix Decision Tree:

[1924 35]

39 2426

False positives (FP): 35
False negatives (FN): 37

From the confusion matrix, it can be seen from Table 1 that this model has 35 false positives (FP) and 37 false negatives (FN), which shows that this model has performed quite well. However, we are striving to improve accuracy and reduce FP and FN values in other algorithms.

(2)

Random Forest

The Random Forest model has better performance than Decision Tree with an accuracy of 98.64%. The results of the evaluation of this model’s metrics are as shown by Table 2:

Confusion Matrix Random Forest:

[1920 29]

21 2442

False positives (FP): 39
False negatives (FN): 21

Table 2 shows the confusion matrix of random forest model. This model has 39 false positives (FP) and 21 false negatives (FN), which is better than Decision Tree because the number of misclassifications is smaller. The Random Forest model is superior because it is an ensemble method that combines several decision trees to improve generalization and reduce overfitting. These results indicate that the Random Forest model is more stable and accurate than Decision Tree.

(3)

Support Vector Machine (SVM)

The SVM model has lower accuracy compared to Decision Tree and Random Forest, that is 92.76%. The results of the evaluation of this model’s metrics are as shown by Table 3:

Confusion Matrix SVM:

[1765 194]

126 2337

False positives (FP): 194
False negatives (FN): 126

From the confusion matrix shown by Table 3, this model has 194 false positives (FP) and 126 false negatives (FN), indicating a higher number of misclassifications compared to other models. The SVM model has lower performance because the model may have difficulty handling complex data and overlapping between classes. However, this model still provides relatively good results, though it is still inferior to Decision Tree-based models such as Random Forest.

B.

Model Performance Analysis

Based on the evaluation results, the comparison of the accuracy of the three models is as shown by Table 4:

From the comparison shown by Table 4, Random Forest is the best model because it has the highest accuracy and the least number of misclassifications. This model is superior in handling data variability and reducing the risk of overfitting, which is the main weakness of the Decision Tree model.

Decision Tree still has good performance with accuracy that is almost close to Random Forest, but because this model only uses one decision tree, the model is more susceptible to overfitting than ensemble methods such as Random Forest.

Meanwhile, SVM has a lower performance compared to the other two models, with a much higher number of false positives and false negatives. This indicates that this model has difficulty in distinguishing classes well, so it is not recommended for use in this classification scenario.

Based on the results of the model evaluation that has been carried out, it can be concluded that Random Forest is the best model in this study because it has the highest accuracy and the fewest classification errors. Therefore, this model is recommended for use in the classification of data tested in this study. In addition, the Decision Tree model can also be used as an alternative if a simpler model is needed with a relatively good performance. However, the SVM model is less recommended because it has a higher classification error rate than other models.

5. Conclusions

Based on the results of the model evaluation that has been carried out, this study found that Random Forest has the best performance compared to other models, with an accuracy level reaching 98.64%. This model also has a lower classification error rate than other models, making it more reliable in making predictions. Meanwhile, the Decision Tree model performed quite well with an accuracy of 98.37%. Although slightly lower than Random Forest, this model is simpler and easier to interpret, so it can still be a viable alternative in some scenarios. On the other hand, the Support Vector Machine (SVM) model showed lower performance compared to the two decision tree-based models, with an accuracy of 92.76%. This model has a higher misclassification rate, making it less recommended for use in the context of this study. Overall, this study confirms that decision tree-based methods, especially Random Forest, are the most optimal choice for classifying data in this study.

Author Contributions

Conceptualization, A.E.; methodology, A.E.; software, S.P.; validation, A.E.; formal analysis, S.P. and N.A.; investigation, R.R.; resources, R.R. and N.A.; data curation, S.P. and N.A.; writing—original draft preparation, N.A.; writing—review and editing, R.R.; visualization, S.P.; supervision, A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Manguma, T.T.F.; Fatra, E. 2024 Performance Analysis of Classification Algorithms for Spam Detection in Email. Innov. J. Soc. Sci. Res. 2024, 4, 16461–16465. [Google Scholar]
Windarni, V.A.; Nugraha, A.F.; Ramadhani, S.T.A.; Istiqomah, D.A.; Puri, F.M.; Setiawan, A. Phishing website detection using filter technique on machine learning model. Inf. Syst. J. (INFOS) 2023, 6, 39–43. [Google Scholar] [CrossRef]
Fatiha, M.R.; Setiawan, I.; Ikhsan, A.N.; Yunita, I.R. Optimization of web-based phishing detection system using decision tree algorithm. IT CIDA Sci. J. Inf. Technol. Dissem. 2024, 10, 97–108. Available online: https://www.kaggle.com (accessed on 13 June 2025). [CrossRef]
Alswailem, A.; Alabdullah, B.; Alrumayh, N.; Alsedrani, A. Detecting Phishing Websites Using Machine Learning. In Proceedings of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 1–3 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
Jagadeesan, S.; Sameer; Singh, D.; Ojha, R.; Ibrahim, R.K.; Alazzam, M.B. Implementation of an Artificial Intelligence with Cyber Security in E-Learning-Based Education Management System. In Proceedings of the 2023 4th International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab Emirates, 12–13 December 2023; pp. 1–5. [Google Scholar] [CrossRef]
Parekh, S.; Parikh, D.; Kotak, S.; Sankhe, S. A New Method for Detection of Phishing Websites: URL Detection. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 949–952. [Google Scholar] [CrossRef]
Nugraha, A.F.; Faticha, R.; Aziza, A.; Pristyanto, Y. Application of Stacking and Random Forest Methods to Improve Classification Performance in the Phishing Web Detection Process. Infomedia 2022, 7, 1. [Google Scholar] [CrossRef]
Harahap, A.D.; Juardi, D.; Irawan, A.S.Y. Design of phishing link detection system using web-based random forest algorithm. J. Inform. Appl. Electr. Eng. 2024, 12, 2677–2686. [Google Scholar] [CrossRef]
Mutmainnah, S.; Lorosae, T.A.; Ramadhan, S. Text Embedding and TF-IDF+Ngram Models to Improve the Performance of Binary Classifier Algorithms in Fake SMS Classification. J. Sist. Inf. (JSI) 2025, 4, 55–64. Available online: https://ojs.trigunadharma.ac.id/index.php/jsi (accessed on 19 June 2025).
Raihan, A.; Fadhli, M. Implementation of deep learning for detecting phishing attacks on websites with combination of cnn and lstm. J. Inf. Eng. (JUTIF) 2024, 5, 1451–1459. [Google Scholar] [CrossRef]
Vebriani, M.; Yustanti, W. Classification of DANA Kaget Phishing Link Detection Using Website-Based Support Vector Machine Method. J. Inform. Comput. Sci. 2024, 6, 408–416. [Google Scholar] [CrossRef]

Table 1. Decision Tree model accuracy.

Class	Precision	Recall	F1-Socre	Support
−1	0.98	0.98	0.98	1959
1	0.99	0.98	0.98	2463
Accuracy	98.37%			4422

Table 2. Random Forest model accuracy.

Class	Precision	Recall	F1-Socre	Support
−1	0.98	0.98	0.98	1920
1	0.99	0.98	0.98	2442
Accuracy	98.64%			4362

Table 3. Accuracy of Support Vector Machine (SVM) Model.

Class	Precision	Recall	F1-Socre	Support
−1	0.93	0.90	0.92	1765
1	0.92	0.95	0.94	2337
Accuracy	92.76%			4102

Table 4. Comparison of the third model.

Model	Accuracy	FP	FN
Decision Tree	98.37%	35	37
Random Forest	98.64%	39	21
SVM	92.76%	194	126

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aniyansyah, N.; Rina, R.; Puspitasari, S.; Erfina, A. Evaluation of AI Models for Phishing Detection Using Open Datasets. Eng. Proc. 2025, 107, 37. https://doi.org/10.3390/engproc2025107037

AMA Style

Aniyansyah N, Rina R, Puspitasari S, Erfina A. Evaluation of AI Models for Phishing Detection Using Open Datasets. Engineering Proceedings. 2025; 107(1):37. https://doi.org/10.3390/engproc2025107037

Chicago/Turabian Style

Aniyansyah, Nur, Rina Rina, Sarah Puspitasari, and Adhitia Erfina. 2025. "Evaluation of AI Models for Phishing Detection Using Open Datasets" Engineering Proceedings 107, no. 1: 37. https://doi.org/10.3390/engproc2025107037

APA Style

Aniyansyah, N., Rina, R., Puspitasari, S., & Erfina, A. (2025). Evaluation of AI Models for Phishing Detection Using Open Datasets. Engineering Proceedings, 107(1), 37. https://doi.org/10.3390/engproc2025107037

Article Menu

Evaluation of AI Models for Phishing Detection Using Open Datasets^†

Abstract

1. Introduction

2. Literature Review

Previous Research

3. Research Methodology

4. Model Evaluation Results and Performance Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Evaluation of AI Models for Phishing Detection Using Open Datasets †

Abstract

1. Introduction

2. Literature Review

Previous Research

3. Research Methodology

4. Model Evaluation Results and Performance Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Evaluation of AI Models for Phishing Detection Using Open Datasets^†