1. Introduction
A website is a singular Internet-based tool and entity in today’s world where everything is first and foremost digital, and an audience. Your website is an important tool in your strategic plan, in addition to helping to boost brand recognition and to get your business in front of prospective clients, partners, and investors. Phishing is a modern type of web scam, where criminals aim at deceiving a person into submitting personal details or account numbers, credit card numbers, or a password in an organization whose image has been imitated that of authentic businesses, such as banks or even restaurants. At times, they may use common fake email messages, which could look authentic to people, hence creating an opportunity to direct them to a link containing bad code or make them reveal their personal details. These strategies exploit victims’ trust for personal gain, which results in monetary fraud, theft of identity, or unauthorized control of personal accounts. Phishing attacks are a constant threat in the cyber space domain, as these can be aimed at anyone, with no concern for their age or tech-savviness.
On the other hand, we can discuss one specific area in phishing, such as a phishing attack on a website. Phishing URLs represent an ever-growing problem since people can be easily manipulated and tricked into doing something they would not normally do. Cybercriminal mimics a legitimate website by making minor changes in spelling or adding slight differences that create an authentic look-alike website. The same may also trick users into providing personal details, which is more often used in defrauding the user through the use of phishing URLs, which are crafted by hackers using techniques that make the URLs look like a well-designed website. Due to the inapparent nature granted by the internet, hackers cannot find legal action for themselves and can freely perform phishing attacks. There is nothing you can do to fully avoid the possibility of becoming a victim or to ensure your information stays safe online other than being aware of the phishing signs and exercising caution and skepticism when typing in unfamiliar web addresses. Phishing campaigns and organizations globally encompass various forms, such as detrimental advertisements, fraudulent emails, messages, and posts. According to the 2024 Security Risk Report [
1], phishing URLs have had a dramatic worldwide impact, with 94% of the surveyed firms falling victim to such assaults. These situations have serious effects, with 96% of the impacted organizations reporting financial losses, 57% seeing revenue declines owing to client attrition, and 40% suffering reputational harm. A total of 51% of data breaches resulted in disciplinary action against employees, with 67% of individuals implicated experiencing personal consequences, underlining the critical necessity for strong information security defenses. This demonstrates the broad impact of phishing URLs and emphasizes the crucial need for enhanced digital security measures. Phishing URLs continue to pose a substantial danger to internet security, and initiative-taking measures are required to reduce their impact and protect customers from future abuse. One of the solutions is to check every single link that you are going to use to login to different websites and enter your personal information, bank account, or any other sensitive information. Manually checking each website URL is an ineffective and unsophisticated method of phishing detection. One of the more frequently used strategies is database comparison, which compares a requested URL to a list of known phishing sites. If a match is identified, access to the site is restricted, and the user is notified. Despite its utility, this strategy fails if the phishing URL has not already been reported.
Maintaining this form of database updated with the latest phishing URLs requires a bit more work because many of these sites are deactivated daily, and the URLs are removed from the report after seven days. A flaw of this scheme is that the attackers can use the same sites even if they are delisted. Due to these disadvantages, the academics have no option but to rely on machine intelligence to help identify the banking phishing URLs. This machine learning methodology has elicited a lot of interest courtesy of the ability to enhance the detection of phishing URLs and eliminate the drawbacks of the database approach. Classification algorithms and frameworks are important in identifying websites that are phishing since several attributes are often concealed, and various patterns of criminality are used. These weblog algorithms work by identifying the context, content, and patterns of URLs as potential threats. The classifiers are trained to differentiate between genuine and phishing URLs depending on parameters such as presence of certain keywords, misspellings, characters, and domain blacklist.
Based on the experiment, computerized machine learning classifiers like random forests perform remarkably in filtering phishing URLs. Due to the strength of random forests in handling large datasets with a large number of attributes, they can be used to assess various features on URLs. They are capable of segmenting different kinds of URLs by knowing the best hyperplane. When trained on a range of well-labeled datasets, these classifiers have significantly enhanced the accuracy and efficiency of phishing URL detection systems. In our research, we offer an original integration of a system for machine learning with the goal of improving the detection and prevention against phishing site assaults. This study provides a collection of phishing URLs that were obtained from trusted resources. Next, we assess machine learning methods to suggest the approach more accurately. To improve our model’s accuracy, we trained it using the University of California Irvine (UCI) Machine Learning Repository’s phishing sites dataset. The Phishing URL dataset is among the biggest accessible, including 100,945 phishing sites and 134,850 real sites. Most of the URLs that we examined throughout the dataset’s construction are the most recent ones. The source data of the webpage and URL are examined to extract various characteristics. We use the “RapidMiner” technology to train our model for effective and accurate phishing detection, which ensures consistent results.
2. Literature Review
Phishing attacks are a huge global danger to digital security, targeting both individuals and companies in order to obtain sensitive information, including passwords, credit card numbers, and personal details. These assaults usually utilize illegal emails, websites, or communications that replicate trustworthy sources, necessitating early notice and prevention. Machine learning techniques have demonstrated considerable promise in enhancing the detection and prevention of phishing URL attacks by analyzing various data attributes to identify fundamental patterns. Machine learning algorithms, particularly those based on supervised learning, are trained on datasets containing features like URL characters and metadata. By approximately defining these characteristics, the accuracy of identification of fake URLs and their differentiation from real ones decreases the possibility of phishing attacks. This Literature Review examines contemporary research on an integrated machine learning framework and model for phishing attack detection and prevention, emphasizing their techniques, performance, and contribution to the field.
Machine learning, particularly that based on supervised learning, is widely used for phishing attack detection. These algorithms trained datasets that include a variety of URL characteristics. These attributes can be used by the ML model to identify patterns of the different phishing URLs. A huge amount of work has already been carried out in this field, and some of them are presented here. Technological advancement in the current world of machine learning (ML) has enabled the construction of new frameworks to detect these forms of scams. These frameworks implement different techniques to enhance the chances and sensitivity of the detection systems. To this, Yogendra Kumar and Basant Subba [
2] have contributed by proposing a security framework that involves several machine learning algorithms, namely random forest (RF), Neural Network (NN), Support Vector Machine (SVM), Logistic Regression (LR), and K-Nearest Neighbor (KNN). Their approach, carried out in a Google Colab Jupyter notebook and written in Python, showed 99.72% accuracy. Gupta, Krishna Yadav, and Imran Razzak [
3] outlined a different method of performing lexical-based real-time identification of phishing URLs using machine learning. Their system employed the RF method with 99.57% accuracy employing KNN, LR, and SVM. The drawback of their solution, which was constructed using Python (3.10), ML, and DL, is a longer reaction time, higher dependence on third parties, and the inability to track newly launched websites. This study identifies areas for improvement in reaction time and flexibility while also highlighting that machine learning has the capacity to identify phishing in real time. Lizhen Tang and Qusay H. Mahmoud [
4] investigated a variety of antiphishing strategies, including list-based, heuristic, and ML approaches, and natural language processing.
Their study demonstrated the need for increased accuracy performance and included algorithms that reported 99.57% accuracy with a real-time system that required very little processing time, such as SVM, decision tree, RF, KNN, and Bagging. This study highlights important areas where accuracy may be improved and gives a wide picture of the state of ML-based phishing detection at the moment. Kumar, Yogendra, Subba, and Basant [
5] proposed an automatic real-time system to detect phishing URLs based on the NB, SVM, LR, ADB, DT, GDT, PE, KNN, and RF algorithms. With real-time configuration, they achieved 99.72%, and this pointed out the need to include time-varying characteristics of the URLs in the future in order to enhance detection. From this research it is clear that incorporation of dynamic characteristics is useful for enhancing real-time performance and that there is value in using multiple approaches. Using LR, KNN, SVM, DT, NB, XG Boost, RF, and ANN, Mehmet Korkmaz, Ozgur Koray Sahingoz, and Banu Diri [
6] presented a system for phishing detection based on machine learning. They achieved an average accuracy of up to 94.59% for RF classifiers and planned to improve the system’s accuracy and response time in the future by incorporating deep learning models and hybrid algorithms like On Decision Tree (DT), Gradient Boosted Trees (GDT), Perceptrons (PE), KNN, and random forest (RF).
In a real-time configuration, they were able to obtain 99.72% accuracy, and this demonstrated the necessity to incorporate time-varying aspects of the URLs in the future to improve detection. This study shows that integrating dynamic characteristics is crucial for better real-time performance and that combining several methods can be useful. Mehmet Korkmaz, Ozgur Koray Sahingoz, and Banu Diri [
6] demonstrated a machine learning-based phishing detection method system that makes use of LR, KNN, SVM, DT, NB, XG Boost, RF, and ANN. They maintained an accuracy rate of 94.59% for RF classifiers and proposed plans to use deep learning models and hybrid algorithms to increase the system’s accuracy and reaction time in the future. As this paper shows, deep learning and hybrid models may help to enhance the operation of phishing detection systems. Ammara Zamir, Hikmat Ullah Khan, and Tassawar Iqbal constructed a phishing website detection model with 97.3% accuracy using numerous ML techniques and feature selection algorithms [
7]. They recommend that, for example, such an approach can be evaluated in a real-time mode when the proposed approach is complemented with other feature extraction algorithms. This work establishes the effectiveness of feature selection techniques and shows that to improve detection of phishing, several extraction models can be used together. This remarkable work was performed by Ali Aljofey, Qingshan Jiang, Qiang Qu, Mingqing Huang, and Jean-Pierre Niyigena, who gave a smart model for phishing detection based on CNN and URL. Ref. [
8] had a model accuracy of 98.58%. They did note, however, some limitations which included the technicality that they took a long time to train and the fact that some websites are likely to be misclassified because they had registration and login pages. This is established in this work through showing that deep learning models are viable and identifying further areas that could be optimized to enhance performance. Subsequently, by combining the CASE feature architecture, Dong-Jie Liu, Guang-Gang Geng, Xiao-Bo Jin, and Wei Wang [
9] established an efficient multistage phishing website detection model using CNN and LSTM for deep learning, along with ML algorithms such as NB, DT, and RF. They achieved a TPR of 94.36% and suggested future work on feature augmentation and model layer fusion. This study helps to realize that additional multistage models and complex feature extraction methodologies should be used to enhance the accuracy of phishing. Amani Alswailem [
10] also extended the use of DT and other Machine learning models like to evolve a 98.8% accurate method of detecting phishing sites. He tried different combination of its dataset features but still gets the same accuracy with only minor variations in accuracy. This suggest that dataset foucs on some of the featur which can be cause of inconsisted accuracy and detection of phishing url detction.
Domain identification of phishing was discussed by Shouq Alnemari and Majid Alshammari [
11] by employing ANN, SVM, DT, and RF. On accuracy, an average of 97.3% for RF, 96% for DT, 95% for ANN, and 94% for SVM was noted. They suggested that future research focus on the number of separate approaches of the ML algorithm for the analysis of phishing domains. Through this study, it is evident that a traditional machine learning approach is recommended, in addition to the need to develop new approaches regularly in order to counteract phishing attacks. These investigations show that the machine learning algorithms are helpful in detecting phishing attempts, and they also draw attention to the current research on enhancing the live applicability, precision, and response time [
12,
13]. Therefore, incorporating these gaps into future research directions will help improve existing shortcomings and future problems. It is like a tree structure, where each node within the tree is a test on an attribute; branches are the result of that test; and the terminal nodes are called leaves, where they contain a class label or a numerical value. The technique is useful when it comes to analyzing a decision process since decision trees are uncomplicated and well-presented graphically.
3. Proposed Methodology
The proposed technique for this research study entails using machine learning (ML) classifiers inside an integrated framework to detect phishing URLs. The study or investigation started with gathering a dataset. PhiUSIIL Phishing URL (Website) from UC Irvine Machine Learning Repository.edu. Normalization of the data and feature extraction followed in the process of data cleaning and feature extraction of the cleaning dataset. A complete diagnosing model was built by training an ML classifier like KNN, NB, RF, DT, and GBM on a merged dataset. Separate classifiers were used along with ensemble learning techniques to combine the predictions of several classifiers and enhance the efficiency of the model. There are standard procedures that may be considered ethical that are going to be addressed in the right manner as follows. In brief, this research aims to develop a practical and explainable phishing URL identification system using the ML approach that empowers cybersecurity professionals to quickly pinpoint and eliminate phishing threats. Software such as RapidMiner was used to implement a variety of machine learning techniques.
3.1. Framework
A machine learning framework, as illustrated in
Figure 1, provides an interface that enables developers to build and apply machine learning models efficiently. First, we selected a single dataset from the UC Irvine Machine Learning Repository. After the collection of data, we began the pre-processing stage, during which we cleaned up and replaced any missing information. Following data cleaning, we performed feature selection so that we could evaluate just the parameters that were necessary for the experiment. SMOT was used to generate samples for minority classes. We divided the dataset into sets for testing, validation, and training. To avoid overfitting, we modified the hyperparameters in the validation set and trained the models in the training set. We used a different machine learning classifier after that to confirm accuracy. To improve performance, we gathered the predictions of several models using ensemble methods of learning like boosting and stacking, along with k-fold cross validation. The flow diagram of phishing URL detection has been shown.
3.2. Dataset
The repository’s malicious websites were used as our dataset. It has 235,795 instances with 54 different types of attributes (integer, category, and real). There are two values for the target class label: 1 “phishing” and 0 “non-phishing”. The dataset has 100,945 phishing URLs and 134,850 legal URLs. Most of the URLs are recent, providing current information for efficient categorization. The dataset enhances model performance in machine learning for phishing URL detection by offering a variety of training cases, strengthening the model, and reducing its propensity for overfitting.
Table 1 shows all the attributes of the PhiUSIIL Phishing URL dataset.
3.3. Replacing Missing Values
In machine learning, replacing missing values refers to the process of adding or replacing missing data points inside a dataset. There are several possible causes of missing numbers, such as intentional omission, equipment malfunctions, and mistakes in data collection. Since many machine learning algorithms struggle to handle missing data, predicted values are important. Ignoring missing values might lead to biased or incorrect conclusions. We used this process there to avoid errors and deficient performance. It is important to deeply investigate the implications of imputing missing data, and to examine the influence of alternative imputation techniques on machine learning model performance.
3.4. Feature Selection
The method of selecting a subset of important characteristics from a larger collection to build a model that avoids over-fitting while enhancing comprehension and performance is known as feature selection in machine learning. Selecting the right attributes is essential since using redundant or incorrect ones might provide disappointing results. It improves generalization and model interpretation.
3.5. Split Data
Dividing data into smaller chunks is a crucial step in machine learning to ensure that your model can react to new input. This method is commonly used to divide your data into tests, validation, and training sets. Most of your data, known as the training set, is utilized to train the model. This collection normally contains 70–80% of your data. The goal of the test set is to assess your model’s final performance following training and validation. It should be like the data your model would produce in the real world. Typically, it contains 10 to 15% of your data.
3.6. Smote
Synthetic Minority Oversampling Technique is used to correct class imbalances in datasets by creating Synthetic samples for the minority group. It creates new instances by interpolating existing minority samples, which helps to balance the class distribution. This strategy enhances model performance and prevents prejudice against the dominant class, especially in circumstances with underrepresented instances, such as fraud detection, medical diagnosis, and phishing detection. To solve this class imbalance, we used SMOTE on our dataset, which contains 100,945 phishing websites and 134,850 legal URLs.
3.7. Filter Examples
Filter Examples operator enabled us to carefully delete unnecessary or anomalous rows from the dataset using particular criteria, assuring the dataset’s relevance and quality. Filtering away these cases increased the quality and efficiency of our machine learning models, resulting in more precise and dependable phishing detection findings.
3.8. Machine Learning Models
In machine learning model training, the most important step is to select algorithms that outperform other algorithms for a dataset. In our experiments we had a large dataset. So, we applied various machine learning classifiers, which include KNN, DT, NB, NB kernel, RT, and RF. We designed a framework and then tested every individual in the dataset. All the classifiers performed well and achieved higher accuracy than the others. Their result is given below.
3.8.1. Random Forest
Random forest is a method of ensemble learning that uses many decision trees to increase prediction accuracy while avoiding algorithm overfitting. Integrating the output of many trees results in a more reliable and robust model. Random forest is known for its high accuracy, resistance to overfitting, and ability to manage large, complex datasets. When applied to our dataset, this method delivered an incredible 99.99. The random forest algorithm categorizes the goal label by building a forest, or collection of decision trees, from randomly selected decision trees to approximate the result. The random forest classifier automatically corrects uneven classifications and oversees big datasets with ease.
3.8.2. Naïve Bayes and Naïve Bayes Kernel
The NB classifier is a Stochastic classifier that applies Bayes’ theorem. It assumes feature isolation and computes likelihood of each class making predictions. Naive Bayes analysis is a rapid and efficient approach for text classification. The “naive” assumption of feature independence is the foundation of the Bayes machine learning approach, which utilizes the Bayes theorem. Categorization is one of its common uses, particularly with high-dimensional data. If features are conditionally independent, the method determines if it is possible that a data point belongs to each class based on the likelihood that its characteristics fall into that class. The formula for Bayes’ theorem is as follows:
The naive Bayes kernel classifier is a variation that uses kernel density estimation to estimate the probability density function of features. This improves accuracy when feature independence is not assumed. This approach can deal with more complicated data distributions.
3.8.3. KNN
The K-Nearest Neighbor (KNN) classifier is a non-parametric, instance-based learning technique that categorizes data points according to the classes of their nearest neighbors. It makes no assumptions regarding feature independence and is suitable for classification and regression problems. KNN is ideal for text classification due to its simplicity and ease of implementation. Among the simplest instance-based learning techniques that come in handy with both classification and regression tasks is K-Nearest Neighbor (KNN). In doing so, it arrives at a reasonable guess of what the class or the value of the new data item should be by averaging out the values of its closest neighbors or simply taking the most frequently occurring value therein. With the use of Euclidean distance, you may place the new feature where it will blend in the most naturally with the help of the KNN rule.
3.8.4. Decision Tree
A DT is a tree-structured classifier that separates data into smaller groups based on the values of input characteristics, resulting in a sequence of decisions that lead to a final classification. Decision trees have the advantage of being readable, easy to grasp, and capable of handling both numerical and categorical data. A decision tree is one of the approaches to supervised machine learning and can handle both regression and classification. However, before the establishment of a decision tree, entropy should first be determined, and this computes the uncertainty of the dataset by examining the class label.
Entropy can be calculated using the following formula.
After determining the entropy, Information Gain can be computed using the formula given below:
3.8.5. Random Tree
Random tree is like random forest, but instead of merging many trees, it uses a single tree with a randomly chosen group of characteristics at each split. Random tree is simpler and faster than random forest.