Next Article in Journal
Optimization and Energy Efficiency in the Separation of Butadiene 1,3 from Pyrolysis Products: A Model-Based Approach
Previous Article in Journal
Cyber-Physical System for Treatment of River and Lake Water
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Evaluation of AI Models for Phishing Detection Using Open Datasets †

Information System Departement, Faculty of Engineering, Computing, and Design, Nusa Putra University, Sukabumi 43124, Indonesia
*
Author to whom correspondence should be addressed.
Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.
Eng. Proc. 2025, 107(1), 37; https://doi.org/10.3390/engproc2025107037 (registering DOI)
Published: 28 August 2025

Abstract

Phishing is a form of cyber-attack that aims to steal sensitive information by impersonating a trusted entity. To overcome this threat, various artificial intelligence (AI) methods have been developed to improve the effectiveness of phishing detection. This study evaluates three machine learning models, namely Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM), using an open dataset containing phishing and non-phishing URLs. The research process includes data preprocessing stages such as cleaning, normalization, categorical feature encoding, feature selection, and dividing the dataset into training and test data. The trained models are then evaluated using accuracy, precision, recall, F1-score, and comparison score metrics to determine the best model in phishing classification. The evaluation results show that the Random Forest model has the best performance with higher accuracy and generalization of 98.64% compared to Decision Tree which is only 98.37% and SVM 92.67%. Decision Tree has advantages in speed and interpretability but is susceptible to overfitting. SVM shows good performance on high-dimensional datasets but is less efficient in computing time. Based on the research results, Random Forest is recommended as the most optimal model for machine learning-based phishing detection.

1. Introduction

In this increasingly digital era, cyber-attacks are becoming a serious threat, especially phishing attacks targeting individuals and organizations. Phishing is a fraudulent technique used to steal sensitive information, such as login credentials, financial data, and personal information by posing as a trusted entity. The increasing number of phishing attacks has driven the need to develop an effective and accurate detection system to identify these threats early.
Artificial intelligence (AI) has become a widely used solution in phishing detection, especially through machine learning methods [1]. By utilizing open datasets, various AI models can be trained to recognize phishing patterns and characteristics based on predetermined features. Some common algorithms used in phishing detection are Decision Tree, Random Forest, and Support Vector Machine (SVM).
Decision Tree is a decision tree-based model that divides data into small subsets based on the most significant features in the classification. Its advantages lie in high interpretability and speed in making predictions. However, this model is prone to overfitting if not pruned.
Random Forest is a development of Decision Tree which consists of many decision trees working together to increase accuracy and reduce the risk of overfitting. This model is more stable than a single Decision Tree and is able to handle complex datasets.
SVM (Support Vector Machine) is a model that works by finding the optimal hyperplane that separates phishing and non-phishing classes in the feature space. This model is effective in handling high-dimensional data and has good generalization but can require higher computation time especially on large datasets.
An evaluation of the Decision Tree, Random Forest, and SVM models needs to be performed to determine the most effective algorithm in detecting phishing based on open datasets. By comparing the performance of these three models using evaluation metrics such as accuracy, precision, recall, and F1-score [2,3], this study aims to identify the best model that can be used in a reliable and efficient phishing detection system.

2. Literature Review

Previous Research

Some previous studies that have examined this include the following:
  • Amani Alswailem, Bashayr Alabdullah, Norah Alrumayh, Aram Alsedrani [4] with the title Detecting Phishing Websites Using Filter Techniques on Machine Learning Models. The study found that the application of the Naïve Bayes method has an accuracy value of 60.4%, the Decision Tree method has an accuracy value of 94.4%, and the Random Forest method has an accuracy of 96.3%. Therefore, it can be concluded that the most effective method for detecting phishing websites is Random Forest because it has an accuracy level of 96.3%.
  • Sowmya Jagadeesan, Sameer, Devender Singh, Ritika Ojha, Read Khalid Ibrahim, Malik Bader Alazzam [5] with the title Implementation of Artificial Intelligence-Based Cyber Security System to Overcome Phishing Attacks. This system uses machine learning algorithms such as Support Vector Machine (SVM), Random Forest, and Neural Networks to detect phishing emails and websites with high accuracy. The data used includes phishing emails and URLs collected from various sources. Data preprocessing involves cleaning, feature extraction, and normalization before training the AI model to recognize phishing patterns. Performance evaluation was carried out using metrics such as precision, recall, and F1-score to assess the effectiveness of the system. The results show that the AI-based system achieves 97% accuracy and an F1-score of around 96%, indicating high capability in detecting phishing attacks. The implementation of this system provides a proactive solution that reduces false positives and false negatives, thereby increasing data and information security.
  • Shraddha Parekh, Dhwanil Parikh, Srushti Kotak, Smita Sankhe [6] with the title Detection of Phishing Websites From URL Analysis Using the Random Forest Algorithm. The contributions made by this study include using the Random Forest algorithm to detect phishing websites and adding detection features that are integrated into websites that discuss phishing. The Random Forest classification algorithm was used because of its high ability to process a large number of detection features. By using 30 detection features, the test results show that the system built is able to achieve optimal performance, with a prediction rate of 96%, recall 92%, accuracy 94%, and F1-score 93%. These results indicate that the proposed method is effective in detecting phishing attacks with a high level of accuracy, making it a very useful tool in protecting users from cyber threats and is considered to be able to solve existing problems because it can work optimally.

3. Research Methodology

A.
Dataset Explanation
The dataset used in this study was obtained from open sources containing samples of phishing and non-phishing URLs. This dataset includes various features that reflect the characteristics of the URL, such as URL length, use of special symbols, presence of suspicious keywords, and SSL certificate information. In addition, the dataset also includes additional metadata such as domain creation time and hosting information that can contribute to phishing detection. This data will be used to train and test machine learning models.
B.
Data Preprocessing
Before being used in model training, the data will go through several preprocessing stages [7,8]:
  • Data Cleansing
    Remove duplication and handle missing data and ensure that all relevant features are available.
  • Feature Normalization
    Converting data into a uniform scale improves model performance, especially for algorithms such as SVM that are sensitive to data scale.
  • Categorical Encoding
    Converting categorical features into numeric format using methods such as one-hot encoding or label encoding can be processed by machine learning models.
  • Dataset Sharing
    The dataset is divided into training data and test data with a certain ratio (e.g., 80:20) for more accurate model evaluation.
C.
AI Models Used
Three machine learning models will be used in this study:
  • Decision Tree
    A decision tree-based model that segments data based on the most significant features. Its advantages are high interpretability and fast execution.
  • Random Forest
    An ensemble model consisting of multiple decision trees to improve stability and accuracy and is more resistant to overfitting.
  • Support Vector Machine (SVM)
    A model that searches for the optimal hyperlane to separate phishing and non-phishing classes, with high performance especially on high-dimensional data.
D.
Model Evaluation
Once the model is trained and tested, evaluation will be performed using several key metrics [9,10,11]:
  • Accuracy
    Measures the percentage of correct predictions against the total data.
  • Precision
    Measures the extent to which the model does not give false positive predictions.
  • Recall
    Recall measures the extent to which the model can detect all correct phishing cases.
  • F-1 Score
    Combines precision and recall in one metric to provide a balanced picture of model performance.
  • Confusion Matrix
    Used to further analyze classification errors.

4. Model Evaluation Results and Performance Analysis

A.
Model Evaluation Results
(1)
Decision Tree
The Decision Tree model shows an accuracy of 98.37% with the following evaluation metrics as shown by Table 1:
Confusion matrix Decision Tree:
[1924     35]
39   2426
  • False positives (FP): 35
  • False negatives (FN): 37
From the confusion matrix, it can be seen from Table 1 that this model has 35 false positives (FP) and 37 false negatives (FN), which shows that this model has performed quite well. However, we are striving to improve accuracy and reduce FP and FN values in other algorithms.
(2)
Random Forest
The Random Forest model has better performance than Decision Tree with an accuracy of 98.64%. The results of the evaluation of this model’s metrics are as shown by Table 2:
Confusion Matrix Random Forest:
[1920     29]
21   2442
  • False positives (FP): 39
  • False negatives (FN): 21
Table 2 shows the confusion matrix of random forest model. This model has 39 false positives (FP) and 21 false negatives (FN), which is better than Decision Tree because the number of misclassifications is smaller. The Random Forest model is superior because it is an ensemble method that combines several decision trees to improve generalization and reduce overfitting. These results indicate that the Random Forest model is more stable and accurate than Decision Tree.
(3)
Support Vector Machine (SVM)
The SVM model has lower accuracy compared to Decision Tree and Random Forest, that is 92.76%. The results of the evaluation of this model’s metrics are as shown by Table 3:
Confusion Matrix SVM:
[1765     194]
126   2337
  • False positives (FP): 194
  • False negatives (FN): 126
From the confusion matrix shown by Table 3, this model has 194 false positives (FP) and 126 false negatives (FN), indicating a higher number of misclassifications compared to other models. The SVM model has lower performance because the model may have difficulty handling complex data and overlapping between classes. However, this model still provides relatively good results, though it is still inferior to Decision Tree-based models such as Random Forest.
B.
Model Performance Analysis
Based on the evaluation results, the comparison of the accuracy of the three models is as shown by Table 4:
From the comparison shown by Table 4, Random Forest is the best model because it has the highest accuracy and the least number of misclassifications. This model is superior in handling data variability and reducing the risk of overfitting, which is the main weakness of the Decision Tree model.
Decision Tree still has good performance with accuracy that is almost close to Random Forest, but because this model only uses one decision tree, the model is more susceptible to overfitting than ensemble methods such as Random Forest.
Meanwhile, SVM has a lower performance compared to the other two models, with a much higher number of false positives and false negatives. This indicates that this model has difficulty in distinguishing classes well, so it is not recommended for use in this classification scenario.
Based on the results of the model evaluation that has been carried out, it can be concluded that Random Forest is the best model in this study because it has the highest accuracy and the fewest classification errors. Therefore, this model is recommended for use in the classification of data tested in this study. In addition, the Decision Tree model can also be used as an alternative if a simpler model is needed with a relatively good performance. However, the SVM model is less recommended because it has a higher classification error rate than other models.

5. Conclusions

Based on the results of the model evaluation that has been carried out, this study found that Random Forest has the best performance compared to other models, with an accuracy level reaching 98.64%. This model also has a lower classification error rate than other models, making it more reliable in making predictions. Meanwhile, the Decision Tree model performed quite well with an accuracy of 98.37%. Although slightly lower than Random Forest, this model is simpler and easier to interpret, so it can still be a viable alternative in some scenarios. On the other hand, the Support Vector Machine (SVM) model showed lower performance compared to the two decision tree-based models, with an accuracy of 92.76%. This model has a higher misclassification rate, making it less recommended for use in the context of this study. Overall, this study confirms that decision tree-based methods, especially Random Forest, are the most optimal choice for classifying data in this study.

Author Contributions

Conceptualization, A.E.; methodology, A.E.; software, S.P.; validation, A.E.; formal analysis, S.P. and N.A.; investigation, R.R.; resources, R.R. and N.A.; data curation, S.P. and N.A.; writing—original draft preparation, N.A.; writing—review and editing, R.R.; visualization, S.P.; supervision, A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Manguma, T.T.F.; Fatra, E. 2024 Performance Analysis of Classification Algorithms for Spam Detection in Email. Innov. J. Soc. Sci. Res. 2024, 4, 16461–16465. [Google Scholar]
  2. Windarni, V.A.; Nugraha, A.F.; Ramadhani, S.T.A.; Istiqomah, D.A.; Puri, F.M.; Setiawan, A. Phishing website detection using filter technique on machine learning model. Inf. Syst. J. (INFOS) 2023, 6, 39–43. [Google Scholar] [CrossRef]
  3. Fatiha, M.R.; Setiawan, I.; Ikhsan, A.N.; Yunita, I.R. Optimization of web-based phishing detection system using decision tree algorithm. IT CIDA Sci. J. Inf. Technol. Dissem. 2024, 10, 97–108. Available online: https://www.kaggle.com (accessed on 13 June 2025). [CrossRef]
  4. Alswailem, A.; Alabdullah, B.; Alrumayh, N.; Alsedrani, A. Detecting Phishing Websites Using Machine Learning. In Proceedings of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 1–3 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
  5. Jagadeesan, S.; Sameer; Singh, D.; Ojha, R.; Ibrahim, R.K.; Alazzam, M.B. Implementation of an Artificial Intelligence with Cyber Security in E-Learning-Based Education Management System. In Proceedings of the 2023 4th International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab Emirates, 12–13 December 2023; pp. 1–5. [Google Scholar] [CrossRef]
  6. Parekh, S.; Parikh, D.; Kotak, S.; Sankhe, S. A New Method for Detection of Phishing Websites: URL Detection. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 949–952. [Google Scholar] [CrossRef]
  7. Nugraha, A.F.; Faticha, R.; Aziza, A.; Pristyanto, Y. Application of Stacking and Random Forest Methods to Improve Classification Performance in the Phishing Web Detection Process. Infomedia 2022, 7, 1. [Google Scholar] [CrossRef]
  8. Harahap, A.D.; Juardi, D.; Irawan, A.S.Y. Design of phishing link detection system using web-based random forest algorithm. J. Inform. Appl. Electr. Eng. 2024, 12, 2677–2686. [Google Scholar] [CrossRef]
  9. Mutmainnah, S.; Lorosae, T.A.; Ramadhan, S. Text Embedding and TF-IDF+Ngram Models to Improve the Performance of Binary Classifier Algorithms in Fake SMS Classification. J. Sist. Inf. (JSI) 2025, 4, 55–64. Available online: https://ojs.trigunadharma.ac.id/index.php/jsi (accessed on 19 June 2025).
  10. Raihan, A.; Fadhli, M. Implementation of deep learning for detecting phishing attacks on websites with combination of cnn and lstm. J. Inf. Eng. (JUTIF) 2024, 5, 1451–1459. [Google Scholar] [CrossRef]
  11. Vebriani, M.; Yustanti, W. Classification of DANA Kaget Phishing Link Detection Using Website-Based Support Vector Machine Method. J. Inform. Comput. Sci. 2024, 6, 408–416. [Google Scholar] [CrossRef]
Table 1. Decision Tree model accuracy.
Table 1. Decision Tree model accuracy.
ClassPrecisionRecallF1-SocreSupport
−10.980.980.981959
10.990.980.982463
Accuracy98.37% 4422
Table 2. Random Forest model accuracy.
Table 2. Random Forest model accuracy.
ClassPrecisionRecallF1-SocreSupport
−10.980.980.981920
10.990.980.982442
Accuracy98.64% 4362
Table 3. Accuracy of Support Vector Machine (SVM) Model.
Table 3. Accuracy of Support Vector Machine (SVM) Model.
ClassPrecisionRecallF1-SocreSupport
−10.930.900.921765
10.920.950.942337
Accuracy92.76% 4102
Table 4. Comparison of the third model.
Table 4. Comparison of the third model.
ModelAccuracyFPFN
Decision Tree98.37%3537
Random Forest98.64%3921
SVM92.76%194126
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aniyansyah, N.; Rina, R.; Puspitasari, S.; Erfina, A. Evaluation of AI Models for Phishing Detection Using Open Datasets. Eng. Proc. 2025, 107, 37. https://doi.org/10.3390/engproc2025107037

AMA Style

Aniyansyah N, Rina R, Puspitasari S, Erfina A. Evaluation of AI Models for Phishing Detection Using Open Datasets. Engineering Proceedings. 2025; 107(1):37. https://doi.org/10.3390/engproc2025107037

Chicago/Turabian Style

Aniyansyah, Nur, Rina Rina, Sarah Puspitasari, and Adhitia Erfina. 2025. "Evaluation of AI Models for Phishing Detection Using Open Datasets" Engineering Proceedings 107, no. 1: 37. https://doi.org/10.3390/engproc2025107037

APA Style

Aniyansyah, N., Rina, R., Puspitasari, S., & Erfina, A. (2025). Evaluation of AI Models for Phishing Detection Using Open Datasets. Engineering Proceedings, 107(1), 37. https://doi.org/10.3390/engproc2025107037

Article Metrics

Back to TopTop