Real-Time Phishing URL Detection Using Machine Learning

Rehman, Atta Ur; Imtiaz, Irsa; Javaid, Sabeen; Muslih, Muhamad

doi:10.3390/engproc2025107108

Open AccessProceeding Paper

Real-Time Phishing URL Detection Using Machine Learning^†

¹

Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan

²

Department of Information System, Nusa Putra University, Sukabumi 43155, Indonesia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 108; https://doi.org/10.3390/engproc2025107108

Published: 25 September 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download

Browse Figures

Versions Notes

Abstract

The study investigates the use of powerful machine learning approaches to the real-time detection of phishing URLs, addressing a critical cybersecurity concern. The dataset we utilized in this research work was collected from the University of California Irvine (UCI) Machine Learning Repository. It has 235,795 instances with fifty-four distinct parameters. The label class is of binomial type and has only two target classes. We used a range of complex algorithms, including k-nearest neighbor, naive Bayes, decision trees, random forests, and random tree, to assess the discriminative characteristics retrieved from URLs. The random forest classifier beat the other classifiers, reaching the greatest accuracy of 99.99%. The study demonstrates that these models achieve superior accuracy in identifying phishing attempts, significantly outperforming traditional detection methodologies. The findings underscore the potential of machine learning to provide a scalable, efficient, and robust solution for real-time phishing detection. Implementing these innovative platforms to existing security solutions is going to play a critical role in sustaining the protective line against continuously evolving and persistent phishing schemes.

Keywords:

phishing detection; machine learning; real-time security; URL classification; random forest

1. Introduction

A website is a singular Internet-based tool and entity in today’s world where everything is first and foremost digital, and an audience. Your website is an important tool in your strategic plan, in addition to helping to boost brand recognition and to get your business in front of prospective clients, partners, and investors. Phishing is a modern type of web scam, where criminals aim at deceiving a person into submitting personal details or account numbers, credit card numbers, or a password in an organization whose image has been imitated that of authentic businesses, such as banks or even restaurants. At times, they may use common fake email messages, which could look authentic to people, hence creating an opportunity to direct them to a link containing bad code or make them reveal their personal details. These strategies exploit victims’ trust for personal gain, which results in monetary fraud, theft of identity, or unauthorized control of personal accounts. Phishing attacks are a constant threat in the cyber space domain, as these can be aimed at anyone, with no concern for their age or tech-savviness.

On the other hand, we can discuss one specific area in phishing, such as a phishing attack on a website. Phishing URLs represent an ever-growing problem since people can be easily manipulated and tricked into doing something they would not normally do. Cybercriminal mimics a legitimate website by making minor changes in spelling or adding slight differences that create an authentic look-alike website. The same may also trick users into providing personal details, which is more often used in defrauding the user through the use of phishing URLs, which are crafted by hackers using techniques that make the URLs look like a well-designed website. Due to the inapparent nature granted by the internet, hackers cannot find legal action for themselves and can freely perform phishing attacks. There is nothing you can do to fully avoid the possibility of becoming a victim or to ensure your information stays safe online other than being aware of the phishing signs and exercising caution and skepticism when typing in unfamiliar web addresses. Phishing campaigns and organizations globally encompass various forms, such as detrimental advertisements, fraudulent emails, messages, and posts. According to the 2024 Security Risk Report [1], phishing URLs have had a dramatic worldwide impact, with 94% of the surveyed firms falling victim to such assaults. These situations have serious effects, with 96% of the impacted organizations reporting financial losses, 57% seeing revenue declines owing to client attrition, and 40% suffering reputational harm. A total of 51% of data breaches resulted in disciplinary action against employees, with 67% of individuals implicated experiencing personal consequences, underlining the critical necessity for strong information security defenses. This demonstrates the broad impact of phishing URLs and emphasizes the crucial need for enhanced digital security measures. Phishing URLs continue to pose a substantial danger to internet security, and initiative-taking measures are required to reduce their impact and protect customers from future abuse. One of the solutions is to check every single link that you are going to use to login to different websites and enter your personal information, bank account, or any other sensitive information. Manually checking each website URL is an ineffective and unsophisticated method of phishing detection. One of the more frequently used strategies is database comparison, which compares a requested URL to a list of known phishing sites. If a match is identified, access to the site is restricted, and the user is notified. Despite its utility, this strategy fails if the phishing URL has not already been reported.

Maintaining this form of database updated with the latest phishing URLs requires a bit more work because many of these sites are deactivated daily, and the URLs are removed from the report after seven days. A flaw of this scheme is that the attackers can use the same sites even if they are delisted. Due to these disadvantages, the academics have no option but to rely on machine intelligence to help identify the banking phishing URLs. This machine learning methodology has elicited a lot of interest courtesy of the ability to enhance the detection of phishing URLs and eliminate the drawbacks of the database approach. Classification algorithms and frameworks are important in identifying websites that are phishing since several attributes are often concealed, and various patterns of criminality are used. These weblog algorithms work by identifying the context, content, and patterns of URLs as potential threats. The classifiers are trained to differentiate between genuine and phishing URLs depending on parameters such as presence of certain keywords, misspellings, characters, and domain blacklist.

Based on the experiment, computerized machine learning classifiers like random forests perform remarkably in filtering phishing URLs. Due to the strength of random forests in handling large datasets with a large number of attributes, they can be used to assess various features on URLs. They are capable of segmenting different kinds of URLs by knowing the best hyperplane. When trained on a range of well-labeled datasets, these classifiers have significantly enhanced the accuracy and efficiency of phishing URL detection systems. In our research, we offer an original integration of a system for machine learning with the goal of improving the detection and prevention against phishing site assaults. This study provides a collection of phishing URLs that were obtained from trusted resources. Next, we assess machine learning methods to suggest the approach more accurately. To improve our model’s accuracy, we trained it using the University of California Irvine (UCI) Machine Learning Repository’s phishing sites dataset. The Phishing URL dataset is among the biggest accessible, including 100,945 phishing sites and 134,850 real sites. Most of the URLs that we examined throughout the dataset’s construction are the most recent ones. The source data of the webpage and URL are examined to extract various characteristics. We use the “RapidMiner” technology to train our model for effective and accurate phishing detection, which ensures consistent results.

2. Literature Review

Phishing attacks are a huge global danger to digital security, targeting both individuals and companies in order to obtain sensitive information, including passwords, credit card numbers, and personal details. These assaults usually utilize illegal emails, websites, or communications that replicate trustworthy sources, necessitating early notice and prevention. Machine learning techniques have demonstrated considerable promise in enhancing the detection and prevention of phishing URL attacks by analyzing various data attributes to identify fundamental patterns. Machine learning algorithms, particularly those based on supervised learning, are trained on datasets containing features like URL characters and metadata. By approximately defining these characteristics, the accuracy of identification of fake URLs and their differentiation from real ones decreases the possibility of phishing attacks. This Literature Review examines contemporary research on an integrated machine learning framework and model for phishing attack detection and prevention, emphasizing their techniques, performance, and contribution to the field.

Machine learning, particularly that based on supervised learning, is widely used for phishing attack detection. These algorithms trained datasets that include a variety of URL characteristics. These attributes can be used by the ML model to identify patterns of the different phishing URLs. A huge amount of work has already been carried out in this field, and some of them are presented here. Technological advancement in the current world of machine learning (ML) has enabled the construction of new frameworks to detect these forms of scams. These frameworks implement different techniques to enhance the chances and sensitivity of the detection systems. To this, Yogendra Kumar and Basant Subba [2] have contributed by proposing a security framework that involves several machine learning algorithms, namely random forest (RF), Neural Network (NN), Support Vector Machine (SVM), Logistic Regression (LR), and K-Nearest Neighbor (KNN). Their approach, carried out in a Google Colab Jupyter notebook and written in Python, showed 99.72% accuracy. Gupta, Krishna Yadav, and Imran Razzak [3] outlined a different method of performing lexical-based real-time identification of phishing URLs using machine learning. Their system employed the RF method with 99.57% accuracy employing KNN, LR, and SVM. The drawback of their solution, which was constructed using Python (3.10), ML, and DL, is a longer reaction time, higher dependence on third parties, and the inability to track newly launched websites. This study identifies areas for improvement in reaction time and flexibility while also highlighting that machine learning has the capacity to identify phishing in real time. Lizhen Tang and Qusay H. Mahmoud [4] investigated a variety of antiphishing strategies, including list-based, heuristic, and ML approaches, and natural language processing.

Their study demonstrated the need for increased accuracy performance and included algorithms that reported 99.57% accuracy with a real-time system that required very little processing time, such as SVM, decision tree, RF, KNN, and Bagging. This study highlights important areas where accuracy may be improved and gives a wide picture of the state of ML-based phishing detection at the moment. Kumar, Yogendra, Subba, and Basant [5] proposed an automatic real-time system to detect phishing URLs based on the NB, SVM, LR, ADB, DT, GDT, PE, KNN, and RF algorithms. With real-time configuration, they achieved 99.72%, and this pointed out the need to include time-varying characteristics of the URLs in the future in order to enhance detection. From this research it is clear that incorporation of dynamic characteristics is useful for enhancing real-time performance and that there is value in using multiple approaches. Using LR, KNN, SVM, DT, NB, XG Boost, RF, and ANN, Mehmet Korkmaz, Ozgur Koray Sahingoz, and Banu Diri [6] presented a system for phishing detection based on machine learning. They achieved an average accuracy of up to 94.59% for RF classifiers and planned to improve the system’s accuracy and response time in the future by incorporating deep learning models and hybrid algorithms like On Decision Tree (DT), Gradient Boosted Trees (GDT), Perceptrons (PE), KNN, and random forest (RF).

In a real-time configuration, they were able to obtain 99.72% accuracy, and this demonstrated the necessity to incorporate time-varying aspects of the URLs in the future to improve detection. This study shows that integrating dynamic characteristics is crucial for better real-time performance and that combining several methods can be useful. Mehmet Korkmaz, Ozgur Koray Sahingoz, and Banu Diri [6] demonstrated a machine learning-based phishing detection method system that makes use of LR, KNN, SVM, DT, NB, XG Boost, RF, and ANN. They maintained an accuracy rate of 94.59% for RF classifiers and proposed plans to use deep learning models and hybrid algorithms to increase the system’s accuracy and reaction time in the future. As this paper shows, deep learning and hybrid models may help to enhance the operation of phishing detection systems. Ammara Zamir, Hikmat Ullah Khan, and Tassawar Iqbal constructed a phishing website detection model with 97.3% accuracy using numerous ML techniques and feature selection algorithms [7]. They recommend that, for example, such an approach can be evaluated in a real-time mode when the proposed approach is complemented with other feature extraction algorithms. This work establishes the effectiveness of feature selection techniques and shows that to improve detection of phishing, several extraction models can be used together. This remarkable work was performed by Ali Aljofey, Qingshan Jiang, Qiang Qu, Mingqing Huang, and Jean-Pierre Niyigena, who gave a smart model for phishing detection based on CNN and URL. Ref. [8] had a model accuracy of 98.58%. They did note, however, some limitations which included the technicality that they took a long time to train and the fact that some websites are likely to be misclassified because they had registration and login pages. This is established in this work through showing that deep learning models are viable and identifying further areas that could be optimized to enhance performance. Subsequently, by combining the CASE feature architecture, Dong-Jie Liu, Guang-Gang Geng, Xiao-Bo Jin, and Wei Wang [9] established an efficient multistage phishing website detection model using CNN and LSTM for deep learning, along with ML algorithms such as NB, DT, and RF. They achieved a TPR of 94.36% and suggested future work on feature augmentation and model layer fusion. This study helps to realize that additional multistage models and complex feature extraction methodologies should be used to enhance the accuracy of phishing. Amani Alswailem [10] also extended the use of DT and other Machine learning models like to evolve a 98.8% accurate method of detecting phishing sites. He tried different combination of its dataset features but still gets the same accuracy with only minor variations in accuracy. This suggest that dataset foucs on some of the featur which can be cause of inconsisted accuracy and detection of phishing url detction.

Domain identification of phishing was discussed by Shouq Alnemari and Majid Alshammari [11] by employing ANN, SVM, DT, and RF. On accuracy, an average of 97.3% for RF, 96% for DT, 95% for ANN, and 94% for SVM was noted. They suggested that future research focus on the number of separate approaches of the ML algorithm for the analysis of phishing domains. Through this study, it is evident that a traditional machine learning approach is recommended, in addition to the need to develop new approaches regularly in order to counteract phishing attacks. These investigations show that the machine learning algorithms are helpful in detecting phishing attempts, and they also draw attention to the current research on enhancing the live applicability, precision, and response time [12,13]. Therefore, incorporating these gaps into future research directions will help improve existing shortcomings and future problems. It is like a tree structure, where each node within the tree is a test on an attribute; branches are the result of that test; and the terminal nodes are called leaves, where they contain a class label or a numerical value. The technique is useful when it comes to analyzing a decision process since decision trees are uncomplicated and well-presented graphically.

3. Proposed Methodology

The proposed technique for this research study entails using machine learning (ML) classifiers inside an integrated framework to detect phishing URLs. The study or investigation started with gathering a dataset. PhiUSIIL Phishing URL (Website) from UC Irvine Machine Learning Repository.edu. Normalization of the data and feature extraction followed in the process of data cleaning and feature extraction of the cleaning dataset. A complete diagnosing model was built by training an ML classifier like KNN, NB, RF, DT, and GBM on a merged dataset. Separate classifiers were used along with ensemble learning techniques to combine the predictions of several classifiers and enhance the efficiency of the model. There are standard procedures that may be considered ethical that are going to be addressed in the right manner as follows. In brief, this research aims to develop a practical and explainable phishing URL identification system using the ML approach that empowers cybersecurity professionals to quickly pinpoint and eliminate phishing threats. Software such as RapidMiner was used to implement a variety of machine learning techniques.

3.1. Framework

A machine learning framework, as illustrated in Figure 1, provides an interface that enables developers to build and apply machine learning models efficiently. First, we selected a single dataset from the UC Irvine Machine Learning Repository. After the collection of data, we began the pre-processing stage, during which we cleaned up and replaced any missing information. Following data cleaning, we performed feature selection so that we could evaluate just the parameters that were necessary for the experiment. SMOT was used to generate samples for minority classes. We divided the dataset into sets for testing, validation, and training. To avoid overfitting, we modified the hyperparameters in the validation set and trained the models in the training set. We used a different machine learning classifier after that to confirm accuracy. To improve performance, we gathered the predictions of several models using ensemble methods of learning like boosting and stacking, along with k-fold cross validation. The flow diagram of phishing URL detection has been shown.

3.2. Dataset

The repository’s malicious websites were used as our dataset. It has 235,795 instances with 54 different types of attributes (integer, category, and real). There are two values for the target class label: 1 “phishing” and 0 “non-phishing”. The dataset has 100,945 phishing URLs and 134,850 legal URLs. Most of the URLs are recent, providing current information for efficient categorization. The dataset enhances model performance in machine learning for phishing URL detection by offering a variety of training cases, strengthening the model, and reducing its propensity for overfitting. Table 1 shows all the attributes of the PhiUSIIL Phishing URL dataset.

3.3. Replacing Missing Values

In machine learning, replacing missing values refers to the process of adding or replacing missing data points inside a dataset. There are several possible causes of missing numbers, such as intentional omission, equipment malfunctions, and mistakes in data collection. Since many machine learning algorithms struggle to handle missing data, predicted values are important. Ignoring missing values might lead to biased or incorrect conclusions. We used this process there to avoid errors and deficient performance. It is important to deeply investigate the implications of imputing missing data, and to examine the influence of alternative imputation techniques on machine learning model performance.

3.4. Feature Selection

The method of selecting a subset of important characteristics from a larger collection to build a model that avoids over-fitting while enhancing comprehension and performance is known as feature selection in machine learning. Selecting the right attributes is essential since using redundant or incorrect ones might provide disappointing results. It improves generalization and model interpretation.

3.5. Split Data

Dividing data into smaller chunks is a crucial step in machine learning to ensure that your model can react to new input. This method is commonly used to divide your data into tests, validation, and training sets. Most of your data, known as the training set, is utilized to train the model. This collection normally contains 70–80% of your data. The goal of the test set is to assess your model’s final performance following training and validation. It should be like the data your model would produce in the real world. Typically, it contains 10 to 15% of your data.

3.6. Smote

Synthetic Minority Oversampling Technique is used to correct class imbalances in datasets by creating Synthetic samples for the minority group. It creates new instances by interpolating existing minority samples, which helps to balance the class distribution. This strategy enhances model performance and prevents prejudice against the dominant class, especially in circumstances with underrepresented instances, such as fraud detection, medical diagnosis, and phishing detection. To solve this class imbalance, we used SMOTE on our dataset, which contains 100,945 phishing websites and 134,850 legal URLs.

3.7. Filter Examples

Filter Examples operator enabled us to carefully delete unnecessary or anomalous rows from the dataset using particular criteria, assuring the dataset’s relevance and quality. Filtering away these cases increased the quality and efficiency of our machine learning models, resulting in more precise and dependable phishing detection findings.

3.8. Machine Learning Models

In machine learning model training, the most important step is to select algorithms that outperform other algorithms for a dataset. In our experiments we had a large dataset. So, we applied various machine learning classifiers, which include KNN, DT, NB, NB kernel, RT, and RF. We designed a framework and then tested every individual in the dataset. All the classifiers performed well and achieved higher accuracy than the others. Their result is given below.

3.8.1. Random Forest

Random forest is a method of ensemble learning that uses many decision trees to increase prediction accuracy while avoiding algorithm overfitting. Integrating the output of many trees results in a more reliable and robust model. Random forest is known for its high accuracy, resistance to overfitting, and ability to manage large, complex datasets. When applied to our dataset, this method delivered an incredible 99.99. The random forest algorithm categorizes the goal label by building a forest, or collection of decision trees, from randomly selected decision trees to approximate the result. The random forest classifier automatically corrects uneven classifications and oversees big datasets with ease.

3.8.2. Naïve Bayes and Naïve Bayes Kernel

The NB classifier is a Stochastic classifier that applies Bayes’ theorem. It assumes feature isolation and computes likelihood of each class making predictions. Naive Bayes analysis is a rapid and efficient approach for text classification. The “naive” assumption of feature independence is the foundation of the Bayes machine learning approach, which utilizes the Bayes theorem. Categorization is one of its common uses, particularly with high-dimensional data. If features are conditionally independent, the method determines if it is possible that a data point belongs to each class based on the likelihood that its characteristics fall into that class. The formula for Bayes’ theorem is as follows:

P (x | c) = \frac{P (c | x) \times P (x)}{P (c)}

The naive Bayes kernel classifier is a variation that uses kernel density estimation to estimate the probability density function of features. This improves accuracy when feature independence is not assumed. This approach can deal with more complicated data distributions.

3.8.3. KNN

The K-Nearest Neighbor (KNN) classifier is a non-parametric, instance-based learning technique that categorizes data points according to the classes of their nearest neighbors. It makes no assumptions regarding feature independence and is suitable for classification and regression problems. KNN is ideal for text classification due to its simplicity and ease of implementation. Among the simplest instance-based learning techniques that come in handy with both classification and regression tasks is K-Nearest Neighbor (KNN). In doing so, it arrives at a reasonable guess of what the class or the value of the new data item should be by averaging out the values of its closest neighbors or simply taking the most frequently occurring value therein. With the use of Euclidean distance, you may place the new feature where it will blend in the most naturally with the help of the KNN rule.

D (x, y) = \sum_{n} \sqrt{{(x_{i} - y_{i})}^{2}}

3.8.4. Decision Tree

A DT is a tree-structured classifier that separates data into smaller groups based on the values of input characteristics, resulting in a sequence of decisions that lead to a final classification. Decision trees have the advantage of being readable, easy to grasp, and capable of handling both numerical and categorical data. A decision tree is one of the approaches to supervised machine learning and can handle both regression and classification. However, before the establishment of a decision tree, entropy should first be determined, and this computes the uncertainty of the dataset by examining the class label.

Entropy can be calculated using the following formula.

H_{i} = - \sum_{n} p (i, k) \log_{2} p (i, k)

After determining the entropy, Information Gain can be computed using the formula given below:

Information Gain (H, A) = - \sum_{v} |H_{v}|

3.8.5. Random Tree

Random tree is like random forest, but instead of merging many trees, it uses a single tree with a randomly chosen group of characteristics at each split. Random tree is simpler and faster than random forest.

4. Results

This section holds the results from the experiments for the proposed system. The model we propose uses a single dataset to increase the ML framework’s flexibility. This study investigates our dataset using a variety of machine learning classifiers and approaches. Using a large dataset for phishing detection decreases the possibility of error while increasing applicability across various internet user communities. Using this dataset enables rigorous feature selection, resulting in more accurate and reliable detection findings. The accuracy score for the different classifiers is shown in Table 2.

The table shows the comparison of various machine learning classifiers that we applied to our dataset. In our experiment, we used one dataset for phishing URL detection. And our dataset performed very well on each machine learning classifier. KNN, naïve Bayes, naïve Bayes kernel, random forest, random tree, and decision tree all performed very well on the dataset. Figure 2 illustrates the comparisons of accuracy obtained with the different classifiers.

Table 3 presents the performance of the various classifiers used for phishing URL detection across multiple studies. Random forest (RF) is the most frequently employed classifier, consistently achieving high accuracy scores ranging from 94.36% to 99.99%. Other models, such as Convolutional Neural Networks (CNNs), Logistic Regression (LR), and Computer Vision-based approaches (CV), also show competitive performance, though with slightly lower accuracy in some cases. The table highlights the effectiveness of ensemble and deep learning methods in identifying phishing URLs.

5. Conclusions Future Network

Real-time phishing detection using machine learning techniques shows how several algorithms can accurately detect phishing URLs using characteristics extracted from URLs and algorithms like KNN, random forest, and decision tree. Random forest achieved the maximum accuracy of 99.99%. This study demonstrated substantial effectiveness in real-time detection. Future research might concentrate on increasing scalability and efficiency for real-world applications, including advanced deep learning techniques, and expanding feature sets to incorporate real-time user behavior monitoring. Creating adaptive models that constantly learn from new phishing strategies and connecting these systems with wider security structures will be critical for staying ahead of developing attacks.

Author Contributions

A.U.R. conceptualized the study and supervised the project; I.I. and S.J. performed data collection, preprocessing, and analysis; M.M. contributed to methodology design and model validation. All authors have read and agreed to the published version of the manuscript.

Funding

The authors received no funding for this research work.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Egress. Phishing Statistics for 2024. Available online: https://www.egress.com/blog/phishing/phishing-statistics-round-up (accessed on 31 July 2024).
Kumar, Y.; Subb, B. A lightweight machine learning based security framework for detecting phishing attacks. In Proceedings of the 2021 International Conference on Communication Systems and Networks (COMSNETS), Bangalore, India, 5–9 January 2021. [Google Scholar] [CrossRef]
Gupta, B.B.; Yadav, K.; Razzak, I.; Psannis, K.; Castiglione, A.; Chang, X. A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment. Comput. Commun. 2021, 175, 1–22. [Google Scholar] [CrossRef]
Tang, L.; Mahmoud, Q.H. A survey of machine learning-based solutions for phishing website detection. Machines 2021, 3, 34. [Google Scholar] [CrossRef]
Sadique, F.; Kaul, R.; Badsha, S.; Sengupta, S. An automated framework for real-time phishing URL detection. In Proceedings of the 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2020. [Google Scholar] [CrossRef]
Korkmaz, M.; Sahingoz, O.K.; Diri, B. Detection of phishing websites by using machine learning-based URL analysis. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020. [Google Scholar] [CrossRef]
Zamir, A.; Khan, H.U.; Iqbal, T.; Yousaf, N.; Aslam, F.; Anjum, A.; Hamdani, M. Phishing website detection using diverse machine learning algorithms. Electron. Libr. 2020, 38, 1. [Google Scholar] [CrossRef]
Aljofey, N.; Jiang, Q.; Qu, Q.; Huang, M.; Niyigena, J.P. An effective phishing detection model based on character-level convolutional neural network from URL. Electronics 2020, 9, 1514. [Google Scholar] [CrossRef]
Liu, D.J.; Geng, G.G.; Jin, X.B.; Wang, W. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment. Comput. Secur. 2021, 110, 102421. [Google Scholar] [CrossRef]
Alswailem, A.; Alabdullah, B.; Alrumayh, N.; Alsedrani, A. Detecting Phishing Websites Using Machine Learning. In Proceedings of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 1–3 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
Alnemari, S.; Alshammari, M. Detecting phishing domains using machine learning. Appl. Sci. 2023, 13, 84649. [Google Scholar] [CrossRef]
Ashfaq, F.; Jhanjhi, N.; Khan, N.; Muzafar, S.; Das, S. CrimeScene2Graph: Generating Scene Graphs from Crime Scene Descriptions Using BERT NER. In Proceedings of the International Conference On Computational Intelligence In Pattern Recognition, Sonepat, India, 19–20 April 2024; pp. 183–201. [Google Scholar]
Aldughayfiq, B.; Ashfaq, F.; Jhanjhi, N.; Humayun, M. Capturing semantic relationships in electronic health records using knowledge graphs: An implementation using mimic iii dataset and graphdb. Healthcare 2023, 11, 1762. [Google Scholar] [CrossRef]

Figure 1. Schemes follow the same formatting.

Figure 2. Performance metrics of classifiers on phishing URLs.

Table 1. All relevant attributes of a phishing website URL.

File Name	URL	URL Length	Domain	Domain Length
Is Domain IP	TLD	URL Similarity Index	Char Continuation Rate	TLD LegitimateProb
URL Char Prob	TLD Length	No Of Sub Domain	Has Obfuscation	No Of Obfuscated Char
Obfuscation Ratio	No Of Letters In URL	Letter Ratio In URL	No Of Digits In URL	Digit Ratio In URL
No Of Equal Sign In URL	No Of Q Mark In URL	No Of Ampers and In URL	No Of Other Special Chars In URL	Special Char Ratio In URL
Is HTTPS	Line Of Code	Largest Line Length	Has Title	Title
Domain Title MatchScore	URL Title Match Score	Has Favicon	Robots	Is Responsive
No Of URLR Edirect	No Of Self Redirect	Has Description	No Of Pop up	No Of Frame
Has External Form Submit	Has Social Net	Has Submit Button	Has Hidden Fields	Has Password Field
Bank, Pay	Crypto	Has Copyright Info	No Of Image	No Of CSS
No Of JS	No Of Self Ref	No Of Empty Ref	No Of External Ref	label

Table 2. Performance of classifiers on phishing website URLs with all relevant attributes.

Classifier’s Accuracy
Classifier	Accuracy	Classification Error	Recall	Precision	Kappa
KNN	99.77%	0.23%	99.56%	99.89%	0.995%
Naive Bayes	99.96%	0.04%	99.99%	99.92%	0.999%
Naïve Bayes kernel	99.97%	0.03%	100%	99.93%	0.999%
Random forest	99.99%	0.00%	99%	99%	0.999%
Random tree	95.38%	4.62%	90.45%	98.59%	0.905%
Decision tree	99.99%	0.01%	99.97%	100%	1.00%

Table 3. Accuracy of different classifiers on phishing website URLs reported in prior studies.

Author	Year	Classifier	Accuracy
Kumar Y [2]	2021	RF	99.72%
Gupta B [3]	2021	RF	99.57%
Tang L [4]	2021	RF	99.57%
Sadique F [5]	2020	CV	86.6%
Korkmaz [6]	2020	RF	94.59%
Zamir A [7]	2019	RF	97.3%
Aljofey A [8]	2020	CNN	98.58%
Liu D [9]	2021	RF	94.36%
Amani A [10]	2019	DT	98.8%
Alnemari S [11]	2023	RF	97.3%
Atta Ur Rehman	2024	RF	99.99%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rehman, A.U.; Imtiaz, I.; Javaid, S.; Muslih, M. Real-Time Phishing URL Detection Using Machine Learning. Eng. Proc. 2025, 107, 108. https://doi.org/10.3390/engproc2025107108

AMA Style

Rehman AU, Imtiaz I, Javaid S, Muslih M. Real-Time Phishing URL Detection Using Machine Learning. Engineering Proceedings. 2025; 107(1):108. https://doi.org/10.3390/engproc2025107108

Chicago/Turabian Style

Rehman, Atta Ur, Irsa Imtiaz, Sabeen Javaid, and Muhamad Muslih. 2025. "Real-Time Phishing URL Detection Using Machine Learning" Engineering Proceedings 107, no. 1: 108. https://doi.org/10.3390/engproc2025107108

APA Style

Rehman, A. U., Imtiaz, I., Javaid, S., & Muslih, M. (2025). Real-Time Phishing URL Detection Using Machine Learning. Engineering Proceedings, 107(1), 108. https://doi.org/10.3390/engproc2025107108

Article Menu

Real-Time Phishing URL Detection Using Machine Learning^†

Abstract

1. Introduction

2. Literature Review

3. Proposed Methodology

3.1. Framework

3.2. Dataset

3.3. Replacing Missing Values

3.4. Feature Selection

3.5. Split Data

3.6. Smote

3.7. Filter Examples

3.8. Machine Learning Models

3.8.1. Random Forest

3.8.2. Naïve Bayes and Naïve Bayes Kernel

3.8.3. KNN

3.8.4. Decision Tree

3.8.5. Random Tree

4. Results

5. Conclusions Future Network

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Real-Time Phishing URL Detection Using Machine Learning †

Abstract

1. Introduction

2. Literature Review

3. Proposed Methodology

3.1. Framework

3.2. Dataset

3.3. Replacing Missing Values

3.4. Feature Selection

3.5. Split Data

3.6. Smote

3.7. Filter Examples

3.8. Machine Learning Models

3.8.1. Random Forest

3.8.2. Naïve Bayes and Naïve Bayes Kernel

3.8.3. KNN

3.8.4. Decision Tree

3.8.5. Random Tree

4. Results

5. Conclusions Future Network

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Real-Time Phishing URL Detection Using Machine Learning^†