MDPI - Publisher of Open Access Journals

30 pages, 12283 KB

Open AccessArticle

A Robust Ensemble Learning Approach to URL-Based Phishing Webpage Detection

by Abdellah Rezoug and Mohamed Bader-el-den

Big Data Cogn. Comput. 2026, 10(5), 136; https://doi.org/10.3390/bdcc10050136 - 27 Apr 2026

Viewed by 803

The proliferation of online fraud has resulted in substantial financial damage to individuals and organizations alike, with web phishing emerging as one of the most pervasive and harmful attack vectors. In response, this paper proposes the Stacking Ensemble Models Generator (SEMG), a URL-based [...] Read more.

The proliferation of online fraud has resulted in substantial financial damage to individuals and organizations alike, with web phishing emerging as one of the most pervasive and harmful attack vectors. In response, this paper proposes the Stacking Ensemble Models Generator (SEMG), a URL-based phishing detection approach that leverages a multi-objective Genetic Algorithm to jointly optimize Precision and Recall in the selection and configuration of stacking ensemble models. An initial pool of base learners is trained on labeled datasets and subsequently evolved through genetic operators toward a globally optimal ensemble. Experimental evaluation across five datasets sourced from Mendeley and UCI repositories demonstrates that SEMG consistently surpasses individual base learners and compares favorably against existing methods, attaining

99.2 %

performance across all metrics on D2 while matching or exceeding state-of-the-art results on the remaining benchmarks. These outcomes underscore the framework’s robustness and its potential for deployment in real-world phishing detection systems. Full article

(This article belongs to the Section Data Mining and Machine Learning)

► Show Figures

Figure 1

22 pages, 2547 KB

Open AccessArticle

Hybridizing Explainable AI (XAI) for Intelligent Feature Extraction in Phishing Website Detection

by Rashed Alsakarnah, Mohammad Z. Masoud and Ahmad Ghababsheh

Electronics 2026, 15(2), 350; https://doi.org/10.3390/electronics15020350 - 13 Jan 2026

Cited by 1 | Viewed by 2160

Abstract

This study proposes an explainability-driven feature selection framework for phishing website detection using a large-scale, heterogeneous dataset collected from four independent sources. The combined dataset contains approximately 500,000 samples, including 300,000 phishing pages and 200,000 legitimate pages, providing a comprehensive representation of real-world [...] Read more.

This study proposes an explainability-driven feature selection framework for phishing website detection using a large-scale, heterogeneous dataset collected from four independent sources. The combined dataset contains approximately 500,000 samples, including 300,000 phishing pages and 200,000 legitimate pages, providing a comprehensive representation of real-world web traffic. To enhance model interpretability and reduce feature redundancy, four explainable artificial intelligence (XAI) techniques—SHAP, LIME, partial dependence plots (PDPs), and permutation importance (PDI)—were applied to rank and analyze feature contributions. The union of all selected features was subsequently refined through a thresholding mechanism, forming the proposed Hybrid Explainability Random Forest Algorithm (HXRF). A Random Forest (RF) classifier was trained using the optimized feature subset and evaluated on an independently sampled set of 2000 webpages. Results demonstrate that HXRF significantly improves classification performance, achieving an accuracy of 98.2%, with balanced precision, recall, and F1 scores. The confusion matrix confirms strong generalization across both phishing and legitimate classes, with minimal false predictions. This work demonstrates that combining multi-method XAI with selective feature filtering produces a compact, interpretable, and highly discriminative feature set capable of robust phishing detection at scale. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

18 pages, 2503 KB

Open AccessFeature PaperArticle

Reinforced Disentangled HTML Representation Learning with Hard-Sample Mining for Phishing Webpage Detection

by Jun-Ho Yoon, Seok-Jun Buu and Hae-Jung Kim

Electronics 2025, 14(6), 1080; https://doi.org/10.3390/electronics14061080 - 9 Mar 2025

Cited by 1 | Viewed by 2151

Abstract

Phishing webpage detection is critical in combating cyber threats, yet distinguishing between benign and phishing webpages remains challenging due to significant feature overlap in the representation space. This study introduces a reinforced Triplet Network to optimize disentangled representation learning tailored for phishing detection. [...] Read more.

Phishing webpage detection is critical in combating cyber threats, yet distinguishing between benign and phishing webpages remains challenging due to significant feature overlap in the representation space. This study introduces a reinforced Triplet Network to optimize disentangled representation learning tailored for phishing detection. By employing reinforcement learning, the method enhances the sampling of anchor, positive, and negative examples, addressing a core limitation of traditional Triplet Networks. The disentangled representations generated through this approach provide a clear separation between benign and phishing webpages, substantially improving detection accuracy. To achieve comprehensive modeling, the method integrates multimodal features from both URLs and HTML DOM Graph structures. The evaluation leverages a real-world dataset comprising over one million webpages, meticulously collected for diverse and representative phishing scenarios. Experimental results demonstrate a notable improvement, with the proposed method achieving a 6.7% gain in the F1 score over state-of-the-art approaches, highlighting its superior capability and the dataset’s critical role in robust performance. Full article

(This article belongs to the Special Issue Digital Security and Privacy Protection: Trends and Applications, 2nd Edition)

► Show Figures

Figure 1

21 pages, 3115 KB

Open AccessArticle

Phishing Webpage Detection via Multi-Modal Integration of HTML DOM Graphs and URL Features Based on Graph Convolutional and Transformer Networks

by Jun-Ho Yoon, Seok-Jun Buu and Hae-Jung Kim

Electronics 2024, 13(16), 3344; https://doi.org/10.3390/electronics13163344 - 22 Aug 2024

Cited by 17 | Viewed by 7031

Abstract

Detecting phishing webpages is a critical task in the field of cybersecurity, with significant implications for online safety and data protection. Traditional methods have primarily relied on analyzing URL features, which can be limited in capturing the full context of phishing attacks. In [...] Read more.

Detecting phishing webpages is a critical task in the field of cybersecurity, with significant implications for online safety and data protection. Traditional methods have primarily relied on analyzing URL features, which can be limited in capturing the full context of phishing attacks. In this study, we propose an innovative approach that integrates HTML DOM graph modeling with URL feature analysis using advanced deep learning techniques. The proposed method leverages Graph Convolutional Networks (GCNs) to model the structure of HTML DOM graphs, combined with Convolutional Neural Networks (CNNs) and Transformer Networks to capture the character and word sequence features of URLs, respectively. These multi-modal features are then integrated using a Transformer network, which is adept at selectively capturing the interdependencies and complementary relationships between different feature sets. We evaluated our approach on a real-world dataset comprising URL and HTML DOM graph data collected from 2012 to 2024. This dataset includes over 80 million nodes and edges, providing a robust foundation for testing. Our method demonstrated a significant improvement in performance, achieving a 7.03 percentage point increase in classification accuracy compared to state-of-the-art techniques. Additionally, we conducted ablation tests to further validate the effectiveness of individual features in our model. The results validate the efficacy of integrating HTML DOM structure and URL features using deep learning. Our framework significantly enhances phishing detection capabilities, providing a more accurate and comprehensive solution to identifying malicious webpages. Full article

(This article belongs to the Special Issue Network Security and Cryptography Applications)

► Show Figures

Figure 1

15 pages, 771 KB

Open AccessArticle

PhishTransformer: A Novel Approach to Detect Phishing Attacks Using URL Collection and Transformer

by Sultan Asiri, Yang Xiao and Tieshan Li

Electronics 2024, 13(1), 30; https://doi.org/10.3390/electronics13010030 - 20 Dec 2023

Cited by 37 | Viewed by 7266

Abstract

Phishing attacks are a major threat to online security, resulting in millions of dollars in losses. These attacks constantly evolve, forcing the cyber security community to improve detection systems. One major problem with current detection systems is that they cannot detect new phishing [...] Read more.

Phishing attacks are a major threat to online security, resulting in millions of dollars in losses. These attacks constantly evolve, forcing the cyber security community to improve detection systems. One major problem with current detection systems is that they cannot detect new phishing attacks, such as Browser in the Browser (BiTB) and malvertising attacks. These attacks hide behind legitimate Uniform Resource Locators (URLs) and can evade detection systems that only analyze a web page URL without exploring the page content. To address this problem, we propose PhishTransformer, a deep-learning model that can detect phishing attacks by analyzing URLs and page content. We propose only using URLs embedded within a webpage, such as hyperlinks and JFrames, to train PhishTransformer. This helps reduce the number of features that need to be extracted from the page content, which makes training the model more efficient. PhishTransformer combines convolutional neural networks and transformer encoders to extract features from website URLs and page content. These features are then used to train a classifier that can distinguish between phishing attacks and legitimate websites. We tested PhishTransformer on a dataset of 10,000 URLs. Our results show that PhishTransformer can achieve an F1-score of 99%, precision of 99%, and recall of 99%. This result suggests that PhishTransformer is a promising new approach to phishing detection. Full article

(This article belongs to the Special Issue Advancements in Cross-Disciplinary AI: Theory and Application—2nd Edition)

► Show Figures

Figure 1

15 pages, 2625 KB

Open AccessArticle

A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection

by Sam-Shin Shin, Seung-Goo Ji and Sung-Sam Hong

Appl. Sci. 2022, 12(23), 12070; https://doi.org/10.3390/app122312070 - 25 Nov 2022

Cited by 16 | Viewed by 3802

Abstract

The growing dependence on digital systems has heightened the risks posed by cybersecurity threats. This paper proposes a new method for detecting malicious webpages among several adversary activities. As shown in previous studies, malicious URL detection performance is significantly affected by the learning [...] Read more.

The growing dependence on digital systems has heightened the risks posed by cybersecurity threats. This paper proposes a new method for detecting malicious webpages among several adversary activities. As shown in previous studies, malicious URL detection performance is significantly affected by the learning dataset features. The overall performance of different machine learning models varies depending on the data features, and using a particular model alone is not always desirable in any given environment. To address these limitations, we propose an ensemble approach using different machine learning models. Our proposed method outperforms the existing single model by 6%, allowing for the detection of an additional 141 malicious URLs. In this study, repetitive tasks are automated, improving the performance of different machine learning models. In addition, the proposed framework builds an advanced feature set based on URL and web content and includes the most optimized detection model structure. The proposed technology can contribute to define an advanced feature set based on URL and web content and includes the most optimized detection model structure and research on automated technology for the detection of malicious websites, such as phishing websites and malicious code distribution. Full article

(This article belongs to the Special Issue AI for Cybersecurity)

► Show Figures

Figure 1

18 pages, 5519 KB

Open AccessArticle

Homoglyph Attack Detection Model Using Machine Learning and Hash Function

by Abdullah M. Almuhaideb, Nida Aslam, Almaha Alabdullatif, Sarah Altamimi, Shooq Alothman, Amnah Alhussain, Waad Aldosari, Shikah J. Alsunaidi and Khalid A. Alissa

J. Sens. Actuator Netw. 2022, 11(3), 54; https://doi.org/10.3390/jsan11030054 - 16 Sep 2022

Cited by 19 | Viewed by 11932

Abstract

Phishing is still a major security threat in cyberspace. In phishing, attackers steal critical information from victims by presenting a spoofing/fake site that appears to be a visual clone of a legitimate site. Several Unicode characters are visually identical to ASCII characters. This [...] Read more.

Phishing is still a major security threat in cyberspace. In phishing, attackers steal critical information from victims by presenting a spoofing/fake site that appears to be a visual clone of a legitimate site. Several Unicode characters are visually identical to ASCII characters. This similarity in characters is generally known as homoglyphs. Malicious adversaries utilize homoglyphs in URLs and DNS domains to target organizations. To reduce the risks caused by phishing attacks, effective ways of detecting phishing websites are urgently required. This paper proposes a homoglyph attack detection model that combines a hash function and machine learning. There are two phases to the model approach. The machine was being trained during the development phase. The deployment phase involved deploying the model with a Java interface and testing the outcomes through actual user interaction. The results are more accurate when the URL is hashed, as any little changes to the URL can be recognized. The homoglyph detector can be developed as a stand-alone software that is used as the initial step in requesting a webpage as it enhances browser security and protects websites from phishing attempts. To verify the effectiveness, we compared the proposed model on several criteria to existing phishing detection methods. By using the hash function, the proposed security features increase the overall security of the homoglyph attack detection in terms of accuracy, integrity, and availability. The experiment results showed that the model can detect phishing sites with an accuracy of 99.8% using Random Forest, and the hash function improves the accuracy of homoglyph attack detection. Full article

(This article belongs to the Special Issue Feature Papers in Network Security and Privacy)

► Show Figures

Figure 1

16 pages, 5147 KB

Open AccessArticle

Evolutionary Algorithm with Deep Auto Encoder Network Based Website Phishing Detection and Classification

by Hamed Alqahtani, Saud S. Alotaibi, Fatma S. Alrayes, Isra Al-Turaiki, Khalid A. Alissa, Amira Sayed A. Aziz, Mohammed Maray and Mesfer Al Duhayyim

Appl. Sci. 2022, 12(15), 7441; https://doi.org/10.3390/app12157441 - 25 Jul 2022

Cited by 11 | Viewed by 3692

Abstract

Website phishing is a cyberattack that targets online users for stealing their sensitive data containing login credential and banking details. The phishing websites appear very similar to their equivalent legitimate websites for attracting a huge amount of Internet users. The attacker fools the [...] Read more.

Website phishing is a cyberattack that targets online users for stealing their sensitive data containing login credential and banking details. The phishing websites appear very similar to their equivalent legitimate websites for attracting a huge amount of Internet users. The attacker fools the user by offering the masked webpage as legitimate or reliable for retrieving its important information. Presently, anti-phishing approaches necessitate experts to extract phishing site features and utilize third-party services for phishing website detection. These techniques have some drawbacks, as the requirement of experts for extracting phishing features is time consuming. Many solutions for phishing websites attack have been presented, such as blacklist or whitelist, heuristics, and machine learning (ML) based approaches, which face difficulty in accomplishing effectual recognition performance due to the continual improvements of phishing technologies. Therefore, this study presents an optimal deep autoencoder network based website phishing detection and classification (ODAE-WPDC) model. The proposed ODAE-WPDC model applies input data pre-processing at the initial stage to get rid of missing values in the dataset. Then, feature extraction and artificial algae algorithm (AAA) based feature selection (FS) are utilized. The DAE model with the received features carried out the classification process, and the parameter tuning of the DAE technique was performed using the invasive weed optimization (IWO) algorithm to accomplish enhanced performance. The performance validation of the ODAE-WPDC technique was tested using the Phishing URL dataset from the Kaggle repository. The experimental findings confirm the better performance of the ODAE-WPDC model with maximum accuracy of 99.28%. Full article

(This article belongs to the Special Issue Computational Methods for Medical and Cyber Security)

► Show Figures

Figure 1

19 pages, 1261 KB

Open AccessArticle

HinPhish: An Effective Phishing Detection Approach Based on Heterogeneous Information Networks

by Bingyang Guo, Yunyi Zhang, Chengxi Xu, Fan Shi, Yuwei Li and Min Zhang

Appl. Sci. 2021, 11(20), 9733; https://doi.org/10.3390/app11209733 - 18 Oct 2021

Cited by 20 | Viewed by 4931

Abstract

Internet users have suffered from phishing attacks for a long time. Attackers deceive users through malicious constructed phishing websites to steal sensitive information, such as bank account numbers, website usernames, and passwords. In recent years, many phishing detection solutions have been proposed, which [...] Read more.

Internet users have suffered from phishing attacks for a long time. Attackers deceive users through malicious constructed phishing websites to steal sensitive information, such as bank account numbers, website usernames, and passwords. In recent years, many phishing detection solutions have been proposed, which mainly leverage whitelists or blacklists, website content, or side channel-based techniques. However, with the continuous improvement of phishing technology, current methods have difficulty in achieving effective detection. Hence, in this paper, we propose an effective phishing website detection approach, which we call HinPhish. HinPhish extracts various link relationships from webpages and uses domains and resource objects to construct a heterogeneous information network. HinPhish applies a modified algorithm to leverage the characteristics of different link types in order to calculate the phish-score of the target domain on the webpage. Moreover, HinPhish not only improves the accuracy of detection, but also can increase the phishing cost for attackers. Extensive experimental results demonstrate that HinPhish can achieve an accuracy of 0.9856 and F1-score of 0.9858. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

13 pages, 550 KB

Open AccessFeature PaperArticle

A Deep-Learning-Driven Light-Weight Phishing Detection Sensor

by Bo Wei, Rebeen Ali Hamad, Longzhi Yang, Xuan He, Hao Wang, Bin Gao and Wai Lok Woo

Sensors 2019, 19(19), 4258; https://doi.org/10.3390/s19194258 - 30 Sep 2019

Cited by 78 | Viewed by 9157

Abstract

This paper designs an accurate and low-cost phishing detection sensor by exploring deep learning techniques. Phishing is a very common social engineering technique. The attackers try to deceive online users by mimicking a uniform resource locator (URL) and a webpage. Traditionally, phishing detection [...] Read more.

This paper designs an accurate and low-cost phishing detection sensor by exploring deep learning techniques. Phishing is a very common social engineering technique. The attackers try to deceive online users by mimicking a uniform resource locator (URL) and a webpage. Traditionally, phishing detection is largely based on manual reports from users. Machine learning techniques have recently been introduced for phishing detection. With the recent rapid development of deep learning techniques, many deep-learning-based recognition methods have also been explored to improve classification performance. This paper proposes a light-weight deep learning algorithm to detect the malicious URLs and enable a real-time and energy-saving phishing detection sensor. Experimental tests and comparisons have been conducted to verify the efficacy of the proposed method. According to the experiments, the true detection rate has been improved. This paper has also verified that the proposed method can run in an energy-saving embedded single board computer in real-time. Full article

(This article belongs to the Special Issue Sensor Signal and Information Processing II)

► Show Figures

Figure 1

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI