Synergistic Phishing Intrusion Detection: Integrating Behavioral and Structural Indicators with Hybrid Ensembles and XAI Validation

Nti, Isaac Kofi; Ozer, Murat; Li, Chengcheng

doi:10.3390/fi18010030

Open AccessArticle

Synergistic Phishing Intrusion Detection: Integrating Behavioral and Structural Indicators with Hybrid Ensembles and XAI Validation

by

Isaac Kofi Nti

^1,2,*

,

Murat Ozer

^1,2

and

Chengcheng Li

¹

School of Information Technology, University of Cincinnati, Cincinnati, OH 45221, USA

²

Information Technology and Analytics Center (ITAC), School of Information Technology, University of Cincinnati, Cincinnati, OH 45221, USA

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(1), 30; https://doi.org/10.3390/fi18010030

Submission received: 23 November 2025 / Revised: 18 December 2025 / Accepted: 31 December 2025 / Published: 4 January 2026

(This article belongs to the Special Issue Anomaly and Intrusion Detection in Networks)

Download

Browse Figures

Versions Notes

Abstract

Phishing websites continue to evolve in sophistication, making them increasingly difficult to distinguish from legitimate platforms and challenging the effectiveness of current detection systems. In this study, we investigate the role of subtle deceptive behavioral cues such as mouse-over effects, pop-up triggers, right-click restrictions, and hidden iframes in enhancing phishing detection beyond traditional structural and domain-based indicators. We propose a hierarchical hybrid detection framework that integrates dimensionality reduction through Principal Component Analysis (PCA), phishing campaign profiling using K Means clustering, and a stacked ensemble classifier for final prediction. Using a public phishing dataset, we evaluate multiple feature configurations to quantify the added value of behavioral indicators. The results demonstrate that behavioral indicators, while weak predictors in isolation, significantly improve performance when combined with conventional features, achieving a macro F1 score of 97 percent. Explainable AI analysis using SHAP confirms the contribution of specific behavioral characteristics to model decisions and reveals interpretable patterns in attacker manipulation strategies. This study shows that behavioral interactions leave measurable forensic signatures and provides evidence that combining structural, domain, and behavioral features offers a more comprehensive and reliable approach to phishing intrusion detection.

Keywords:

phishing detection; behavioral features; ensemble learning; explainable artificial intelligence (XAI); cybersecurity

Graphical Abstract

1. Introduction

Phishing attacks continue to be one of the most enduring and flexible cyber threats that affect organizations worldwide. The number of phishing incidents keeps growing because attackers send out millions of phishing attempts every year [1,2]. Security measures have advanced, yet attackers keep developing new deception methods that use deceptive website elements [3,4,5,6,7,8]. The identification of these phishing sites proves difficult because they duplicate authentic websites in every detail, which makes them difficult to detect even for skilled users. Research studies have concentrated on identifying standard phishing signs, which include suspicious URLs, absent SSL certificates, and newly registered domains [9,10,11]. The user interface includes behavioral indicators such as pop-ups, mouse-over effects, right-click blocking, and hidden iframes; these elements have been less studied in prior research. It is important to clarify that behavioral interface cues such as mouse-over manipulation, right-click restriction, pop-up triggering, and hidden iframe usage are not expected to be consistently present across all phishing websites, nor are they inherently resistant to evasion. The current phishing attacks use different tactics which include adding or removing or changing specific behaviors based on what attackers want to achieve, where their targets are located and what limitations their tools have. These indicators function as deception artifacts which show how users deviate from their typical interface behavior, but they should not be seen as fixed phishing signatures. This research examines behavioral patterns that act as performance indicators and become more informative when studied alongside conventional structural and domain-based assessment methods.

The web-based actions of users according to Kavya and Sumathi [12] and Taha et al. [13] play a significant role in how users perceive websites and build trust with them. Current studies fail to recognize the importance of user interface activities because phishing sites use these behaviors to create fake legitimate experiences through right-click blocking and mouse gesture hijacking, according to interpersonal deception theory [14]. The deception methods used in phishing attacks work to manipulate user perceptions while making it difficult for users to detect malicious content, which contradicts HCI principles that demand transparent and user-friendly interfaces. Theories of web trust [15,16] demonstrate that small interface problems can lead to trust breakdowns. The detection of phishing attacks depends heavily on behavioral characteristics because these elements strengthen the appearance of authenticity [17,18].

Research investigations now focus on developing behavioral fingerprints that help identify phishing attacks. Research has incorporated these features into classification tasks through machine learning, but the resulting performance has been inconsistent [12,19,20]. Research studies [21,22,23,24] demonstrate that these features show limited predictive power when used independently. The combination of traditional indicators with behavioral features leads to enhanced phishing profiling and detection capabilities. The current research lacks empirical evidence that demonstrates the combined power of these features. Research indicates that behavioral indicators by themselves achieve less than 50% success rate and their cumulative effect remains unproven. The success rate of behavioral cues in phishing detection remains below 50% when used independently. The application of clustering methods together with large language models enables researchers to group phishing websites based on their shared characteristics, which helps them understand attack methods and infrastructure [25,26,27,28,29]. Most clustering models analyze domain and content features, but few studies investigate how behavioral elements affect URL- and domain-based data. The additional behavioral data layer shows promise to create more defined clusters and improve phishing attack classification [29]. The research field continues to develop, but experts face an ongoing major challenge. Most current phishing detection systems depend on structural indicators, which include URLs and domain names. The features prove useful, but they have restrictions because phishing sites have learned to duplicate authentic websites. The analysis of behavioral website features, including mouse-over effects, pop-ups, right-click blocking, and hidden iframes, remains an underexplored field. Attackers use these features to create trust manipulation schemes that help them stay undetected. Human–computer interaction and deception theories support these behaviors, but researchers tend to view them as insignificant problems instead of potential scam indicators. Early machine learning systems failed to recognize the value of behavioral indicators, which proved to be important for both individual assessment and combined analysis with established features. The lack of attention to behavioral indicators weakens detection system performance because attackers continue to evolve their tactics. A complete solution requires testing how behavioral indicators improve phishing identification and campaign analysis through complex machine learning systems, which combine multiple approaches with explanation capabilities. The implementation of Explainable AI (XAI) through SHAP values becomes essential because hybrid models present complex security challenges that require proof of behavioral features’ actual contribution to accurate classification.

To address this gap, we explore how behavioral features enhance phishing detection when combined with traditional factors, employing advanced hybrid ensemble and explainable machine learning techniques. This approach should be integrated with sophisticated machine learning methods, especially tiered hybrid models, to effectively analyze both low-level behavioral traces and high-level structural indicators. We focus on the following research questions (RQ):

RQ1: How do behavioral interface cues, such as mouse-over actions, pop-up events, hidden iframes, and right-click restrictions, interact with structural and domain-based indicators on phishing websites?
RQ2: To what extent do behavioral features improve the clustering and profiling of phishing campaigns compared to using only structural or technical attributes?
RQ3: How does the integration of behavioral features with traditional indicators influence the performance of phishing intrusion detection models, particularly in terms of classification accuracy and macro F1 score?
RQ4: How do the observed behavioral phishing indicators relate to established theories of deception, HCI, and trust formation?

This study advances phishing detection by emphasizing the often-overlooked importance of behavioral web features. We are developing hybrid detection systems that identify subtle signs of deception, such as mouse-over movements, unexpected pop-ups, and sites that block right-click menus. While most research has focused on website structure and domain features, we investigate whether incorporating behavioral indicators can enhance detection accuracy, especially when used alongside traditional signals. Using a publicly available phishing dataset from archive.ics.uci.edu, we apply both classification and clustering models. We aim to assess whether behavioral features improve model performance, how they relate to factors such as structure, domain, and traffic metrics, and their potential for real-time phishing detection. We also propose an ensemble-of-ensembles architecture that combines multiple base learners in voting and stacking configurations to enable layered decision-making, as discussed in the literature [21,29,30,31]. Furthermore, we utilize SHAP, an explainable AI (XAI) tool, to improve the transparency and interpretability of the models, as supported by [12,32,33]. This research adds to the growing body of work promoting more comprehensive phishing detection systems, moving away from reliance on single features toward multi-faceted analysis [6,12,17,20]. Our goal is to establish a data-driven foundation for understanding how behavioral fingerprints can complement existing detection heuristics and enhance the profiling of online phishing campaigns. The key contributions of this study are as follows:

We provide empirical evidence that subtle behavioral deception cues, including mouse-over effects, right-click blocking, hidden iframes, and pop-up triggers, significantly enhance phishing intrusion detection when combined with traditional structural indicators, addressing a gap in prior work.
We introduce a hierarchical hybrid architecture that integrates PCA-based dimensionality reduction, phishing campaign profiling through K Means clustering, and a stacked ensemble classifier specifically designed to capture both structural and behavioral intrusion signals.
We advance the use of SHAP-based Explainable AI from descriptive visualization to a formal validation mechanism, demonstrating the specific diagnostic contribution of behavioral features and improving transparency, interpretability, and forensic trustworthiness.
We establish a conceptual link between behavioral deception mechanisms, Human Computer Interaction principles, trust formation, and signal detection theory, showing how interface rule violations translate into measurable behavioral indicators that enhance intrusion detection and XAI-grounded security interpretation.

The remaining section of this paper is organized as follows. Section 2 presents a review of the relevant literature, while Section 3 presents our study’s methodology. Section 4 presents the results and discussion, and Section 5 concludes the study.

2. Literature Review

2.1. What Is Phishing: A Brief Background

Cyber attackers use phishing to deceive people into disclosing their sensitive information [3,11]. The attackers use fake identities of trusted organizations to obtain passwords, banking information, and personal details from victims. Phishing techniques have undergone major development since their initial emergence as email-based fraud during the 1990s. The first phishing emails contained multiple grammatical mistakes, which made them easy to identify. Phishing attacks have developed into a significant global cybersecurity threat because they produce increasing numbers of sophisticated attacks. Security incidents result in an average financial loss of $150,000 for each attack, according to [34]. The majority of cloud platform attacks against Microsoft 365 occur through credential phishing methods [35]. The widespread adoption of the HTTPS protocol by phishing sites makes them more difficult to detect because they display legitimate-looking security indicators [35,36]. The effectiveness of phishing attacks depends on social engineering because it remains a primary method for cyber intrusions, which results in numerous data breaches [34,37]. The implementation of Artificial Intelligence (AI) technology has introduced major changes to the field because attackers now employ large language models (LLMs) to create authentic-looking phishing emails that mimic official company messages [13]. The attackers now use three different methods to conduct their attacks, which include website cloning, customized messages, and two-factor authentication evasion [38,39]. The security threat from phishing attacks has evolved because attackers now use websites, text messages, phone calls, and social media platforms to deceive victims, which requires advanced security systems that can adapt to new threats.

2.2. Factors and Techniques Used in Phishing Detection

Research on phishing detection has investigated multiple elements, which include behavioral patterns, URL characteristics, domain information, and network traffic data. The literature [3,6,9,11,31,40,41] uses the classification system shown in Figure 1. The way a website interacts with users through behavioral features includes pop-up windows, disabled right-clicks, on-mouse-over effects, hidden iframes, and JavaScript manipulations [1,19,42]. These features seem unimportant, but they help attackers hide their activities from users. Research shows that these features help detect phishing attacks when used together with other warning signs [1,19,42]. The detection systems focus on URL structure analysis through three main indicators, which include URL length, IP address usage, suspicious keyword presence, and excessive dot numbers [10,30,43]. The first defense mechanism uses these indicators because they provide fast evaluation results that can be measured precisely. Website metadata analysis through domain-based signals examines three elements, which include registration date, activity duration, and SSL certificate expiration time. The short duration of phishing domains leads systems to identify new unverified websites as potential threats [11,29]. The analysis of user and browser interactions with websites through time-based traffic features includes traffic statistics, session length, bounce rates, and page referral patterns. The existing system prevents immediate access to essential metrics, which organizations need to perform post-incident assessments and develop machine learning models.

2.3. Related Works

Research studies [44,45,46,47,48,49] now use machine learning (ML) and deep learning (DL) methods to enhance phishing attack protection which has resulted in better feature development and improved model designs and better understanding of system operations. Early approaches for website phishing discovery relied on structural data which included URL patterns, domain metadata and SSL certificate details. For example, the Phish-armour system developed by Dhanavanthini and Chakkravarthy [45] used recurrent neural networks to analyze URL text quickly, but its ability to operate in limited resources was demonstrated through its analysis of only lexical URL features [45]. Similarly, Taha et al. [13] studied classical ML models for phishing detection and demonstrated that ensemble methods (random forests) outperform single classifiers for phishing detection tasks, reinforcing the effectiveness of ensemble-based approaches.

More recent research studies have used deep learning techniques together with multimodal data representation to develop better phishing detection systems. Adebowale et al. [44] used website images together with textual content and frame information, processed through convolutional neural networks to achieve high detection accuracy, but required more complex computations. Other research studies on deep learning systems have used CNN and LSTM architectures to study feature combinations which exist at three different levels of analysis including lexical and character and word levels [49]. Nonetheless their reliance on computationally intensive models limits practical deployment in real-time applications which operate under limited resource conditions. Elkouay et al. [50] develop URLGBM which represents URLs through graph structures instead of using them as simple text strings for phishing detection purposes. However, refs. [49,50] overlook behavioral web-interface deception, despite human–computer interaction (HCI principles [51] shows these features hinder site verification.

Beyond web-based phishing, researchers have extended detection techniques to related cybersecurity domains. The research by Jo et al. [52] introduced a Vision Transformer (ViT)-based malware detection system which uses attention maps to identify malicious behavior in Android applications thus showing attention mechanisms’ increasing importance for security analytics. Likewise, Shaukat et al. [53] developed a multi-layered ML framework which used XGBoost classifiers to identify phishing websites that use advertisements as bait, and their results showed excellent accuracy. Also, Moussavou Boussougou and Park [54] developed an attention-based 1D CNN-BiLSTM system for Korean voice phishing detection which united data-centric and model-centric approaches to handle restricted dataset sizes. Their method worked well for their intended purposes, yet they only protect against specific types of attacks which do not include web phishing deception signals that appear on user interfaces.

In parallel, the lack of transparency in ML-based security systems has driven researchers to create explainable AI (XAI) methods which enhance user trust and system accountability. Research such as Calzarossa et al. [46] assessed different explanation techniques through their ability to handle complex systems and maintain stable performance. Similarly, Lim et al. [47] combined LIME with large language models to produce interpretable results which helped them detect phishing attacks. Research focused on trust development has proven that cybersecurity systems need explainable operations because users must maintain confidence to operate these systems effectively.

Despite these advances, existing research on phishing detection focuses mainly on structural and content-based indicators, with limited attention to behavioral deception techniques which web interfaces contain. The combination of interpersonal deception theory [14] and human–computer interaction (HCI) principles [51] shows that phishing websites often uses interface manipulations, like disable right-click functions, hijack mouse-over actions and hide iframes which make users less likely to verify the site’s authenticity. Web credibility theories [15,16] demonstrate how interface elements affect user trust, but researchers tend to ignore these behavioral indicators when developing automated detection systems.

Hence, this study distinguishes itself by systematically integrating behavioral and structural indicators within a hybrid phishing detection framework. Unlike prior work that focuses solely on detection accuracy or modality-specific solutions, this research combines unsupervised clustering for exploratory profiling with supervised ensemble classification for operational detection, while employing explainability methods to validate feature interactions and error behavior. Table 1 summarizes representative phishing detection studies and highlights the research gap addressed by this work.

3. Materials and Methods

The following section explains the research methods that this study implemented. The study begins by establishing the official definition of phishing website detection before presenting the complete hybrid system developed for phishing website identification.

3.1. Problem Definition

Let W = {w_i|w_i ε websites, i ε Futureinternet 18 00030 i001

⁺} be the set of all labeled websites in the dataset, where each website w_i is represented by a structured features

x_{i} ϵ R^{m}

comprising behavioral indicators (b), URL-based (u), domain (d), webpage-based (w), and traffic and search engine (t). Thus, the full representation of the website is defined as in Equation (1). Let the label space C = {1, 0} where

y_{i} = 1

denote a phishing (malicious) website and

y_{i} = - 1

indicate a legitimate (benign) website.

x_{i} = [b_{i} ‖u_{i}‖ w_{i} ‖t_{i}]

(1)

where || denotes vector concatenation

m = m_{b} + m_{u} + m_{d} + m_{w} + m_{t}

, the total number of features after integration. Give a labeled website

w_{i} ϵ W_{i}

, represented by

x_{i} ϵ R^{m}

the phishing detection task is to learn a decision function:

f : R^{m} \to C s u c h t h a t f (x) = \hat{y}

where

\hat{y}

is the predicted class for

w_{i}

. The goal of learning is to reduce a loss function (see Equation (2)), which looks at the prediction mistakes between phishing and legitimate websites and may also include regularization to improve generalization. where l is a binary classification loss function,

y_{i} ϵ C

is the true label of the website

w_{i}, y_{i} = f (x_{i})

is the model’s predicted label.

\min_{fϵF} (f) = \frac{1}{n} \sum_{i = 1}^{n} l (f (x_{i}), y_{i})

(2)

3.2. Study Framework

Figure 2 shows the data flow framework of the study. The framework contains nine stages, which start with dataset preprocessing, followed by Exploratory Data Analysis (EDA), feature set construction, dimensionality reduction, unsupervised clustering, supervised classification, model evaluation, and model explainability. The following sections of this paper explain each phase of the study through detailed descriptions.

3.3. Data Preparation and Feature Categorization

The research uses phishing website data accessible at archive.ics.uci.edu (https://archive.ics.uci.edu/dataset/327/phishing+websites) (accessed on 10 June 2025). The dataset includes 11,055 instances, each containing 30 features and a binary label indicating whether a website is phishing (malicious) or legitimate (benign): 1 and −1, respectively. The features in this study are organized into five distinct categories (see Table 2). The data inspection revealed that it contained no missing values. The model required binary target values, so the researchers converted the target variable to 1 for phishing and 0 for legitimate. The researchers performed typo correction on popUpWidnow to popUpWindow to achieve successful plotting and correlation evaluation. No min-max scaling was applied because dataset attributes are normalized; however, for PCA and K-Means we standardize the integrated feature matrix using z-score normalization (discussed in Section 3.4). We divided the prepared data into training and test sets at a 70:30 ratio. We used the Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance in the training set. The training set was adjusted to reduce class imbalance, resulting in a distribution of 56% phishing and 44% legitimate instances; the test set was untouched.

This paper uses behavioral features as additional indicators which replace the main prediction role because these features show inconsistent behavior and share similarities with normal web development techniques. The researchers included these features because deception theory and human–computer interaction principles show that phishing websites often use interface manipulations together with structural flaws which do not provide enough discrimination on their own. The dataset received no additional processing except for the preprocessing steps described in Section 3.3, because it was handled as a typical public benchmark.

3.4. Exploratory Data Analysis (EDA) Procedure

The Exploratory Data Analysis (EDA) process allowed researchers to identify patterns between study data characteristics and their distribution patterns. The process helps researchers pick the most important variables that will work best for their phishing detection model. The two visualization tools serve as the main components of our EDA. The first step involved using frequency bar charts to study data distribution for pattern identification in each feature. The Pearson correlation heatmap enables us to identify fundamental relationships between features, yet most variables consist of binary data. The visualization process for this study used Matplotlib (3.9.2) and Seaborn (0.13.2) Python libraries to create all images.

Feature Integration and Clustering

The next phase of our dataflow pipeline integrates the feature sets and performs unsupervised clustering. The research goal at this point involved testing how the combination of behavioral signals with structural and security indicators would enhance phishing website identification in a basic feature space. We created a comprehensive feature representation by integrating selected attributes across the dataset. Let our study’s relabeled dataset

D S

consist of

n

website samples, each annotated with a label

y_{i} ϵ \{0,1\}

, where 0 indicates a legitimate (benign) website and 1 indicates a phishing (malicious) website. The five different feature categories that describe websites are presented through Equations (3)–(7).

Behavioral features (b_{i}) \in R^{m_{b}}

(3)

URL-based features (u_{i}) \in R^{m_{u}}

(4)

Domain features (d_{i}) \in R^{m_{d}}

(5)

Webpage-based features (w_{i}) \in R^{m_{w}}

(6)

Traffic & search engine features (t_{i}) \in R^{m_{t}}

(7)

where

b_{i} \in R^{m_{b}}, u_{i} \in R^{m_{u}}, d_{i} \in R^{m_{d}}

etc., are vectors of features from a specific category for the

i^{t h}

sample, and

m_{c}

is the dimensionality

(c ϵ \{b, u, d, w, t\})

corresponds to the number of features in that category. We define the full feature vector

x_{i}

for the

i^{t h}

sample as the concatenation of the five feature categories (see Equation (1)). The final feature matrix is defined in Equation (8). We normalized the final feature matrix with the z-score as defined in Equation (9) and produced a normalized feature matrix defined by Equation (10), which becomes the input to our machine learning models.

X = [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{n} \end{matrix}] \in R^{n \times m}

(8)

{\hat{x}}_{i j} = \frac{x_{i j} - μ_{j}}{σ_{j}}

(9)

where

x_{i j}

is the values of feature j for sample i,

μ_{i}

is the mean of feature j and

σ_{j}

is the standard deviation of feature j.

\hat{X} ϵ R^{n \times m}

(10)

3.5. Detection Framework: Dimensionality Reduction, Clustering, and Classification Modelling

The study developed an efficient prediction system through experimental verification of different machine learning approaches, which included ensemble and hybrid methods. The following sections explain each method in detail.

3.5.1. Principal Component Analysis (PCA)

The method of principal component analysis transforms complicated data into simpler form by identifying essential axes which reveal major differences between data points. The method allows researchers to analyze complex data through basic evaluation methods, which preserve vital information. The study used PCA to decrease the number of variables in (

\hat{X}

) while maintaining the maximum amount of data variation. The research achieved its goal through the following methodological approach.

Compute covariance matrix (Equation (11)):

C = \frac{1}{n - 1} {\hat{X}}^{⊤} \hat{X}, C \in R^{m \times m}

(11)

where

\hat{X}

denotes the mean-centered feature matrix.

2.: Compute eigenvectors and eigenvalues: Find $λ_{k}$ and $V_{k}$ such that

C_{V K} = ⋋_{k} V_{k}

(12)

3.: Project the data onto the top k components: choose the top k eigenvectors $\{V_{1}, . . . V_{k}\}$ that capture the most variance from the projection matrix $V_{k} ϵ R^{m \times k}$ . Then project

Z = \hat{X} V_{k} ϵ R^{n \times k}

(13)

where Z is the PCA-transformed feature matrix and becomes the input for clustering.

Our implementation used n_components = 0.95 to maintain 95% of the explained variance. The selection of n_components = 0.95 achieved the best possible trade-off between data dimensionality reduction and information preservation. The sensitivity analysis used 90% and 99% variance thresholds to evaluate the stability of our results (Section 4).

3.5.2. K-Means Clustering

The K-means is an unsupervised algorithm that partitions data into K clusters by minimizing the distance between each data point and its assigned cluster centroid [11]. The goal of the K-means was to partition the PCA output Z into K clusters

C_{1}, C_{2}, . . . C_{k}

as defined in Equation (14). Thus, K-Means clustering was used in this study to explore latent structure in the data and identify potential groupings of phishing campaigns in an unsupervised manner. In this study, K was initially set to 4. However, our final cluster (K) was n_clusters = 2 to align with the binary class structure (phishing vs. legitimate). A sensitivity analysis was conducted for K = 2, 3, and 4 to evaluate robustness (see Section 4.1). where

μ_{k}

is the centroid of cluster

C_{k}

,

| | . | |

is the Euclidean norm in the reduced ℝ^k space.

a r g \underset{C}{m i n} \sum_{k = 1}^{K} \sum_{z ϵ C_{k}} {‖z - μ_{k}‖}^{2}

(14)

K-Means was selected due to its computational efficiency, scalability, and suitability for numeric, structured feature spaces. It enables the discovery of natural groupings among phishing samples without prior labeling, helping profile attack patterns and complement the supervised classification stage. It is important to note that the unsupervised clustering analysis is conducted independently of the supervised classification pipeline. The model training process uses cluster assignments only for exploratory data analysis and post hoc interpretation of data structure but does not incorporate them as input features, labels, or constraints.

3.5.3. Classification Approach

This research used multiple machine learning algorithms to create integrated classification systems that combined behavioral signals with structural system performance data for phishing attack detection. We employed three advanced ensemble algorithms which included Random Forest and XGBoost and Gradient Boosting as individual classifiers. The algorithms were selected because they demonstrated exceptional performance in previous phishing detection research and other fields of study [31,46,55]. Also, due to their proven effectiveness in handling high-dimensional, structured datasets and their ability to model non-linear interactions [56].

Random Forest (RF) functions as an ensemble method that generates multiple decision trees that produce combined predictions to reduce overfitting and prediction variability. RF handles big datasets well and includes automatic feature selection capabilities [31,46]. The Random Forest classifier used 100 decision trees (n_estimators = 100) with ‘sqrt’ as the max features selection and ‘gini’ as the splitting criterion (criterion = ‘gini’).

The XGBoost algorithm operates as a gradient boosting system, which constructs decision trees sequentially to improve the predictions of previous trees [55]. The XGBoost algorithm operates at high speed while producing accurate results when analyzing structured data. Our XGBoost classifier operated with 100 boosting rounds (n_estimators = 100) and used ‘binary:logistic’ as the objective function and ‘logloss’ for evaluation. The model ran without label encoding while using a fixed random_state value of 42 for reproducibility purposes. The XGBoost unites weak prediction models into a powerful predictive system through gradient descent optimization of a defined loss function. XGBoost uses each new tree to fix the mistakes made by previous trees. The ensemble technique of Gradient Boosting constructs decision trees sequentially to build its predictive model. In this study, we used 100 estimators with a learning rate of 0.1 and a maximum depth of 3 to balance bias and variance.

The Ensemble-of-Ensembles framework at a higher level was developed to determine if different learning approaches would enhance classification performance. The base models produced voting results for class predictions, which the system used to determine the final classification outcome. The hard voting ensemble of ensemble classifiers (VEoE) combined predictions from four base learners, which included Random Forest, XGBoost, Gradient Boosting Classifier, and Logistic Regression. The Ensemble-Of-Ensembles (EoE) strategy used two different methods, which included Voting-Based (VEoE) and Stacked Ensemble (SEoE). We combined different learning approaches through tree-based and linear methods to achieve prediction bias reduction.

We constructed two distinct models that used different meta-classifiers for prediction. The study used a stacking ensemble with single learners, which operated through two stages where multiple classifiers ran in the first stage, and a meta-classifier processed their output in the second stage.

Stacking Model A (Meta: Logistic Regression) used LogReg as its level-1 meta-learner while RF and XGBoost and Gradient Boosting and logistic regression operated as level-0 models.
Stacking Model B (Meta: Support Vector Classifier) used RF and XGBoost and Gradient Boosting and Logistic Regression as level-0 models while SVC operated as the level-1 meta-learner. The level-0 models produced output that served as input to the level-1 meta-classifier for predictions.

The study assesses base model performance to create an optimal method for their combination. The StackNet architectures receive their definitions through Equations (15)–(20). The base and meta-learner hyperparameter selection process employed a limited grid search with 5-fold cross-validation on the training dataset, along with targeted hyperparameter sensitivity evaluation tests. The optimization process included three main parameters: ensemble model estimator numbers, Gradient Boosting learning rate, and PCA explained variance thresholds. The fivefold cross-validation process served two purposes by reducing random sampling effects and producing dependable evaluation results. The performance metrics reported in Section 4 of this paper represent average scores across these folds.

D = {\{(x_{i}, y_{i})\}}_{1 = 1}^{n}, x_{i} ϵ R, y_{i} ϵ \{0,1\}

(15)

Let the first-level models be

M_{1}, M_{2}, M_{3}, M_{4} ϵ \{R f, X G B o o s t, G B C, L o g R e g\}

(16)

For each model,

M_{k}

, make a prediction:

{\hat{y}}_{i}^{(k)} = M_{k} (x_{i}), o r {\hat{p}}_{i}^{(k)} = M_{k} (x_{i}) ϵ [0,1]

(17)

Build a meta-feature vector for each sample:

z_{i} = [{\hat{p}}_{i}^{(1)}, {\hat{p}}_{i}^{(2)}, {\hat{p}}_{i}^{(3)}, {\hat{p}}_{i}^{(4)}] ϵ R^{4}

(18)

These meta-features form a new dataset:

Z = {\{(z_{i}, y_{i})\}}_{i = 1}^{n}

(19)

Second-level (meta) classifier

M_{m e t a}

learns

{{\hat{y}}_{i} = M}_{m e t a} (z_{i})

(20)

In stacknet model A, $M_{m e t a} = L o g i s t i c r e g r e s s i o n$ .
In stacknet model B, $M_{m e t a} = S u p p o r t v e c t o r c l a s s i f i e r$ .

Figure 3 and Figure 4 illustrate the schematic structure of our stacking models A and B, respectively. Each base learner generates a prediction (P1–P4), which is used as input to a meta-classifier (logistic regression or support vector machine). The meta classifier (logistic regression or support vector machine) learns how to optimally combine the base predictions to generate a final prediction (Pf). This two-level architecture enhances performance by leveraging the complementary strengths of different learners. The overall training cost of the proposed ensemble (see Figure 3 and Figure 4) is dominated by tree-based models, with complexity ranging from

O (n \cdot \log n)

to

O (n \cdot d)

, while the SVC meta-learner introduces additional cost due to probability estimation

(O (n^{2} \cdot d))

. However, given the dataset’s moderate size and the use of optimized libraries, training and inference times remained efficient. A full breakdown is included in Appendix A.1 and Appendix A.2.

The supervised classification models operate directly on the engineered structural and behavioral feature set and do not rely on the outcomes of the unsupervised clustering stage. Scientists need to protect essential data while performing their research methods with precision through separate analytical methods.

3.6. Model Explainability with Shapley Additive exPlanations (SHAP)

The development of Explainable AI (XAI) techniques demonstrates that model transparency stands as a crucial element for machine learning applications across all domains. The SHAP tool enabled us to enhance model decision transparency for phishing detection tasks. The game-theoretic approach of SHAP enables users to understand model outputs through feature importance values, which explain specific predictions [32]. The shape. Explainer class enabled us to produce both global and local explanations for our top models which included XGBoost and RF. The TreeExplainer variant of SHAP proved suitable for tree-based models because it delivered both high precision and fast computation times. The SHAP values revealed which features produced the most significant impact on predictions while showing their positive or negative effect on phishing prediction results. The test set served as the data source for explanation generation because it contained new instances for which the interpretation would apply. The predictive framework demonstrates that the models achieved both high accuracy and expert knowledge alignment through their behavioral feature understanding. The SHAP value for feature j in model f and input feature set x is calculated through the formula shown in Equation (21).

φ_{j} = \sum_{S \subseteq F \ {j}} \frac{∣ S ∣! (∣ F ∣ - ∣ S ∣ - 1)!}{∣ F ∣!} [f_{S \cup {j}} (x) - f_{S} (x)]

(21)

where F is the set of all features, S is a subset of features excluding j, and

f_{s} (x)

is the model’s prediction using only feature j across all feature subsets.

3.7. Model Evaluation

The evaluation of our model used established metrics, which included precision, recall, F1-score, and support, to determine its performance. The precision metric (Equation (22)) shows the percentage of identified phishing sites that actually contain phishing content. The model achieves better results when it produces fewer incorrect positive predictions because this helps maintain user trust and system usability. The recall metric (Equation (23)) measures how well the model identifies genuine phishing sites because failing to detect malicious sites leads to severe security risks. The F1-score (Equation (24)) combines precision and recall through harmonic mean calculation to produce a single performance metric that works well for situations with unbalanced data classes and equal costs for false positives and negatives. The scikit-learn classification report generated results for all models through per-class and macro-averaged scores, which provided detailed performance assessments for each category. The table shows all essential hyperparameters that the models received during their training process. The evaluation process assessed both the model’s precision, F1-score performance, its capacity to detect particular error types and its processing speed because these elements affect how well phishing detection systems function when deployed in real-world scenarios.

Precision = \frac{True Positives (TP)}{True Positives (TP) + False Positives (FP)}

(22)

Recall = \frac{True Positives (TP)}{True Positives (TP) + False Negatives (FN)}

(23)

F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(24)

3.8. Experimental Setup

The experiments took place on a Windows 11 Enterprise system, which used an Intel Core i7 CPU (16 cores at 2.1 GHz) and 32 GB of RAM on a laptop. The software environment used Python 3.12.7 with NumPy 1.26.4, Pandas 2.2.2, scikit-learn 1.5.1, XGBoost 3.0.0, LightGBM 4.6.0, SHAP 0.47.1, and imbalanced-learn 0.12.3. The software used a fixed random seed value of 42 to achieve controlled randomization throughout all processes. The evaluation process used stratified 5-fold cross-validation with fixed data splits that remained constant between models. The model optimization process used a restricted search of pre-defined parameter values for each classifier (Appendix B.1 shows the complete hyperparameter search range). The table shows the best model parameters, which were determined through a 5-fold grid search process. The stacking model hyperparameter tuning process includes complete search configurations and cross-validation accuracy results, which are presented in Appendix B.2. The Section 4 metrics display mean values with standard deviation error bars from multiple folds while using a 30% test set for validation. Table 3 presents the key models hyperparameters used in this study.

4. Results and Discussion

This section discusses the study’s findings and their implications, guided by the research questions. It summarizes the key results and explores important behavioral and technical details to better understand the broader impact of employing hybrid features and ensemble learning in phishing detection.

4.1. Exploratory Data Analysis (EDA) Findings

The research team conducted their first analysis to determine which behaviors including on_mouseover RightClick popUpWindow and Iframe appear together and which one link to phishing or legitimate websites. The analysis provided us with better knowledge about the dataset. The visual presentation of Figure 5 displays count plots to show the data. The research revealed that phishing sites displayed their behaviors at higher rates than legitimate websites. The analysis in Figure 5 demonstrates that phishing sites used “on_mouseover” and “popup window” functions extensively during their operations. The analysis revealed that phishing samples used right-click blocking and “iframe” functions in an irregular pattern, even though these functions appeared infrequently throughout the dataset. The observed patterns indicate that phishing sites use fake user interface actions to trick users into performing specific actions. The study validates the significance of deceptive interface behaviors in phishing detection through empirical evidence, which supports both interpersonal deception theory and usability-centered trust models.

The correlation heatmap in Figure 6 shows the relationship between behavioral data, URL information, and domain characteristics. The analysis reveals that behavioral indicators align with the structural elements researchers uses to detect phishing attacks. The behavioral features demonstrate strong positive relationships with each other through their connection between popUpWindow and on_mouseover (r = 0.73) and their link between Iframe and popUpWindow (r = 0.63). The analysis shows that phishing websites implement multiple behavioral warning signs at the same time. The behavioral features in the dataset do not show significant relationships with URL_Length and SSLfinal_State and Domain_registration_length because they operate as independent signals. The analysis reveals minimal connections between behavioral characteristics and URL-based indicators through their weak correlation between popUpWindow and DNSRecord (r = 0.099). The research results demonstrate that behavioral indicators help phishing detection systems when used together with other detection methods. The research shows that behavioral indicators by themselves do not work as phishing detectors, but they often appear with structural modifications which support their function in deception detection.

4.2. Clustering Analysis Using PCA and K-Means

The analysis used Principal Component Analysis (PCA) for dimensionality reduction before applying K-Means clustering to discover hidden structural patterns in the data. Our research goal focused on determining whether uniting behavioral data with structural information would create distinct groups through unsupervised methods. The analysis included a sensitivity test, which examined how different PCA explained variance thresholds (90%, 95%, 99%) and K values (2, 3, 4) affected clustering stability. The results in Table 4 show that K = 3 produced the best clustering results because it achieved the highest Silhouette Score (0.3046) and lowest Davies-Bouldin Index (1.5290) when using 90% variance. The performance decreased performance occurred when K reached 4 because the groups became too similar or contained too much noise. The 2D PCA visualizations at different variance levels showed consistent separation patterns, which indicated that meaningful groups form when different features are combined.

The results from unsupervised learning show promise, but behavioral signals need supervised refinement to achieve effective phishing detection. The optimal unsupervised clustering results were obtained at 90% explained variance with K = 3 (see Appendix C), but we selected 95% variance and K = 2 for the final configuration (see Figure 7) to maintain better alignment with the binary classification labels and preserve additional data variability (23 components instead of 19). The selected configuration maintained excellent interpretability and clustering stability while achieving high classification accuracy without significant deterioration of clustering performance. The unsupervised method maintained its ability to work within the established classification system instead of creating new label categories.

The PCA reduced our phishing website data to 95% of its original information before K-Means clustering produced two clusters (k = 2). The clustering method failed to separate phishing websites from legitimate websites. The two clusters (Cluster 0 and Cluster 1) contained nearly equal proportions of phishing and legitimate websites, which suggests the clusters formed from fine-grained structural patterns instead of a clear phishing-legitimate distinction. The data distribution showed Cluster 0 contained 83% of the total samples (2752) while Cluster 1 contained 17% of the samples (565). The websites demonstrate different patterns of operation. The performance of our classifier showed different results when it processed data from these two unsupervised clusters: Cluster 0 (The Ambiguous Group). The majority of incorrect predictions originated from this extensive cluster. The model achieved 96.8% accuracy but produced higher rates of false positive and false negative errors (FPR: 4.24% and FNR: 2.42%). The cluster contains websites that display strong similarities between phishing and legitimate characteristics.

The smaller cluster of websites proved simple to identify because it achieved 98.8% accuracy with minimal false positive and false negative errors (1.23% and 1.25%). The websites in this group demonstrate distinct behavioral patterns. The data presents an organized pattern, but Cluster 1 stands out as the only cluster which can be easily distinguished from others. The classifier achieves high accuracy but needs supervised training to identify the difficult-to-distinguish instances in cluster 0. The clustering results function as an explanatory tool to understand why supervised models made incorrect predictions but they do not affect the model prediction process. The method separates cluster analysis from classification tasks to enable qualitative analysis of unclear samples which does not impact the system’s classification accuracy.

4.3. Performance of Base Models Using Only Website Behavioral Features

The main models including Random Forest and XGBoost and Gradient Boosting and Logistic Regression achieved 50% accuracy in phishing website detection through behavioral web feature training (on_mouseover, popUpWindow, Iframe, and RightClick). The models achieved accuracy rates that exceeded random chance by a small margin at 50%. The macro F1-score which represents balanced performance across all labels reached 0. The models demonstrated strong ability to detect phishing sites through their high recall values of 0.94 for Random Forest and XGBoost and GBC and 0.93 for Logistic Regression. The models displayed a low recall rate of 0.09 for legitimate sites but achieved high recall rates of 0.94 for phishing sites across all models. The results confirm previous research which demonstrated behavioral features alone cannot effectively identify phishing websites from legitimate ones.

The warning signals from behavioral features do not provide enough information to make accurate distinctions between phishing and legitimate websites. The findings support deception and trust theories because users can identify suspicious behavior through mouse-over redirects and disabled right-clicks, but these actions also occur on legitimate websites. Behavioral indicators function best as additional warning indicators, which need additional information to make accurate distinctions. The combination of behavioral indicators with stronger structural and domain and content-based indicators produces better results. The use of behavioral cues as standalone indicators for high-confidence classification produces insufficient results. The integrated modeling approach which uses ensemble classifiers to combine behavioral features with traditional features represents the findings from this study. The hybrid approach achieves better results than previous methods while solving the problems that were discovered in this research.

4.4. Performance of Base Models with All Website Features

Figure 8 presents baseline models performance trained exclusively on behavioral features. The performance metrics from Figure 9 show how the baseline and voting ensemble models operate with their combined website features which include behavior, structure, domain and traffic information. We created a voting ensemble classifier to determine the advantages of model combination. The results from Figure 9 demonstrate that all models achieve better performance than when using behavioral features alone, as shown in Figure 8. The random forest and XGBoost classifiers produced the best results among all models with 97% prediction accuracy. The two algorithms achieved perfect precision and recall values (≥0.96) for both phishing and legitimate website identification, which makes them suitable for real-world phishing site detection applications. The gradient boosting classifier achieved 94% accuracy while logistic regression reached 91% accuracy. The results show that logistic regression performs worse than other models because its basic linear decision-making approach fails to handle the complex relationships between features. Figure 10 shows confusion matrix for the base models trained with full website features.

The voting ensemble achieved 96% overall accuracy through its ability to detect 98% of legitimate sites and 98% of phishing samples. The model shows equal ability to predict both phishing and legitimate websites. The ensemble model achieves a macro F1-score of 0.96 because model combination improves performance by compensating for individual model weaknesses. The integration of different feature types including behavioral and structural and domain indicators leads to better phishing threat understanding. The successful performance of our ensemble and tree-based models demonstrates that complex decision-making processes and feature relationships play a vital role in critical tasks, including phishing detection. The experimental results demonstrate that operational environments can successfully implement ensemble-based detection systems because these systems provide accurate results with explainable output and balanced class performance.

Model Performance Using Only Structural Features

The evaluation process required us to train classifier models using structural indicators, which included URL-based and domain-related features. The features used in this analysis included ‘having_IP_Address’, ‘URL_Length’, ‘Shortening_Service’, ‘having_At_Symbol’, ‘double_slash_redirecting’, ‘Prefix_Suffix’, ‘having_Sub_Domain’, ‘SSLfinal_State’, ‘Domain_registeration_length’, ‘age_of_domain’, and ‘DNSRecord’. The Random Forest and XGBoost models achieved 91% accuracy while producing macro-averaged F1-scores of 0.91 for both phishing and legitimate website classification. The Gradient Boosting classifier demonstrated a preference for identifying legitimate websites through its 90% accuracy rate and F1-score of 0.90. The linear modeling approach of Logistic Regression resulted in poor performance because it failed to handle complex feature relationships, which led to 88% accuracy and a macro F1-score of 0.88.

The performance of baseline classifiers using behavioral, structural, and combined website features for phishing detection appears in Table 5. The models trained on structural and behavioral features together achieved their highest performance, with 97% accuracy and macro F1-scores. The results confirm that behavioral features lack effectiveness on their own but gain importance when integrated with structural features.

4.5. Performance of Stacked Ensemble Models

Figure 11 shows the performance results of our stacking models. The two ensemble models share identical base learner settings to achieve better stability through combined classifier performance. The two models differ only in their meta-classifier selection because Model A uses logistic regression, but Model B uses a support vector classifier (SVC). The two models achieved 97% accuracy through their identical macro and weighted F1 scores (0.97). The classification results demonstrated balanced precision and recall values for all classes, which proved successful in reducing both false positive and false negative errors. The recall value for legitimate websites reached 0.98 (class 0) while phishing websites (class 1) achieved precision and recall values exceeding 0.97. The two stacking models produced identical results, which demonstrate that their meta-classifiers performed equally well in finding the best decision threshold from base learner predictions. The research demonstrates how different indicators, when combined, produce stable and reliable features for analysis.

The stacked ensemble received evaluation through threshold-independent metrics in addition to its assessment based on accuracy and F1-score. The receiver operating characteristic (ROC) curve and precision–recall (PR) curve for Stacked Model A is shown in Figure 12. The ROC curve shows an area under the curve (AUC) of 0.9971 which demonstrates that the model maintained excellent discrimination power throughout all decision threshold values. The PR curve reveals that the model performed well when dealing with unbalanced classes because it achieves an average precision (AP) of 0.9977 which demonstrates its capability to keep high recall rates while sustaining precision levels for phishing attack identification.

Also, false-positive and false-negative behavior was explicitly examined for the stacked ensemble at the selected operating threshold. The Stacking model A generated 55 incorrect positive results and 41 incorrect negative results which resulted in 3.74% false positives and 2.22% false negatives. The ensemble demonstrates perfect phishing website detection ability (recall = 97.78%) together with high accuracy in avoiding false positives for legitimate websites (specificity = 96.26%). The results indicate that our stacked models sustain high performance as observed in Figure 12, achieving 97% accuracy and supporting the role of signal detection theory [57] in elevating weak behavioral indicators when combined with strong structural features.

The results demonstrate that a properly designed stacked ensemble system achieves performance levels equal to or superior to those of individual classifiers. The system provides enhanced capabilities to handle both feature noise and model bias during operation. The combination of linear logistic regression and kernel-based SVC meta-learners in architecture enhances its ability to perform well in various operational environments. Table 6 presents the training and inference times for our stack models. Model ‘A’ needed 9.87 s to train and 0.11 s to test while model ‘B’ needed 12.15 s for training and 0.31 s for testing. The models demonstrate suitable performance for near-real-time deployment in phishing detection systems when hardware factors and optimization techniques are applied.

4.6. Feature Importance and Explainability Analysis

The research used two methods to study feature effects on models through Gini-based feature importance from random forests and SHAP values. The research aimed to explain the decision-making mechanisms of the models. Figure 13 shows the random forest model feature importance plot. The three most important features for the model were SSLfinal_State and URL_of_Anchor and web_traffic. The SSL final state feature accounted for more than 30% of the total importance in the model. The model used subdomains and Links_in_tags and prefixes-suffixes as its main features because these indicators represent typical phishing threats based on domain and content information. The analysis confirmed previous results by showing behavioral indicators have limited value but still help confirm phishing signals through their supporting evidence.

The global feature importance rankings (see Figure 14) give additional interpretation through SHAP summary plots, which displayed feature influence directions and magnitudes for all samples. The beeswarm plot shows how different values of URL_of_Anchor and SSLfinal_State and other essential features affect model predictions between phishing and legitimate classes. The SHAP force plot in Figure 15 demonstrates how particular features affect model predictions for individual classification instances. The SHAP summary plot in Figure 15 shows detailed information about how particular feature values affect model prediction outcomes. The SHAP influence analysis in Figure 14 shows that URL_of_Anchor and SSLfinal_State and web_traffic features have the highest impact on model predictions because they consistently affect model performance. The SHAP plot in Figure 14 provides better model interpretability because it shows which features affect model outputs and their corresponding impact levels. The model predicts phishing when URL_of_Anchor values are high because of many external links, but SSLfinal_State with valid values leads to legitimate classification. The model uses IP address presence and prefix-suffix patterns to detect phishing attacks through their different impact levels based on their values. The model achieves high F1 scores (Table 5) because it uses meaningful structural and behavioral indicators that match domain expertise.

The SHAP force plot in Figure 15 supports the global interpretability shown in Figure 14. The XGBoost model prediction receives an individual feature influence demonstration through this plot. The force plot shows both the strength and direction that each feature applies to the model output for this particular instance. The model output shows a strong non-phishing prediction because the cumulative SHAP value creates a significant downward shift from the baseline. The model used Prefix_Suffix, URL_of_Anchor, SSLfinal_State, Links_in_tags, and web_traffic to make a legitimate classification (blue bars), with Prefix_Suffix showing the highest individual impact. The prediction moved toward phishing because Links_pointing_to_page and having_Sub_Domain applied negative forces. The legitimate indicators produced a stronger effect than the phishing indicators did. The model bases its decisions on multiple feature interactions, demonstrating that AI trustworthiness emerges from complex systems rather than relying on a single dominant variable.

The SHAP analysis revealed that traditional structural and domain-based features hold the highest importance in global rankings, that is, they have proven to be effective discriminators for phishing detection. Behavioral features fail to become the leading predictors of phishing, yet their useful application remains intact. Behavioral features function as supplementary evidence which enables experts to develop local explanations through contextual evaluation to improve their understanding of ambiguous situations when structural indicators fail to provide reliable information. Our SHAP analysis reveals that behavioral indicators as supplementary indicators in the hybrid framework enhance decision accuracy by introducing additional data which does not reduce the performance of core features.

4.7. Models Statistical Significance Testing Using McNemar’s Test

Figure 16 illustrates the results from McNemar’s test for Random Forest (RF) and Voting Ensemble (VE) models. The research evaluates model performance through error analysis to determine its distinct prediction patterns. The models produced identical results for 3181 out of 3317 test samples, while both models incorrectly classified 70 samples. The models produced different results in 41 cases where RF performed better than VE and in 25 cases where VE outperformed RF. The McNemar chi-square statistic reached χ² = 3.41 while the p-value reached 0.0648 which exceeds 0.05 thus indicating no statistically significant difference between their error patterns. The models show equivalent performance in their disagreements according to previous accuracy and F1 score results. The evaluation reveals that each model excels at specific tasks, yet neither model demonstrates superior performance in difficult prediction scenarios. The models show identical decision patterns because their disagreement rates remain small and statistically insignificant.

4.8. Comparative Analysis with Related Studies

Table 7 presents a comparative analysis of our student results with existing literature on phishing detection. The research studies [13,58,59,60] used the same dataset source, which matches our study, but [44,45,48,49] used different datasets. The research results from these studies appear for reference purposes, although they do not receive direct evaluation. Our model predicts phishing sites by uniting user conduct information with website organization elements, which results in a 97% F1-score and accuracy. The detection results from our study align with previous research findings: Mohammad et al. [58] achieved 96.07% accuracy with functional tree detection, and Omari [60] reached 96% F1-score with deep neural networks. The two main distinctions of our method include behavioral feature integration and SHAP-based explainability implementation. The combination of structural and behavioral features through our approach enables better model decision interpretation while producing effective phishing detection systems that remain transparent and adaptable.

The comparison in Table 7 shows that different phishing detection studies use different methods to analyze data while their systems have distinct operational capabilities. The previous methods for detection achieve high accuracy through their use of structural and content-based features yet they function as unexplainable classifiers which ignore behavioral deception indicators and lack model interpretability. The proposed framework unites behavioral interface indicators with structural features and uses SHAP-based explainability to study how features affect results and how errors behave and how decisions become confident. Our study goes past performance evaluation because it focuses on making systems transparent and easy to understand and ready for real-world use which current phishing detection systems did not explicitly address.

4.9. Theoretical Implications

The research combined behavioral web indicators with deception detection principles, human–computer interaction concepts, trust models, and signal detection theory. The study evaluated how interface elements, including pop-ups and mouse-over links, and disabled right-click functions, together with structural signals and user focus areas, would improve phishing detection capabilities. The research used interpersonal deception theory [14] to show that phishing sites use multiple behavioral layers to conceal their actual intentions. The research findings demonstrate that these indicators boost detection rates when used with powerful signals, yet they fail to produce reliable results independently (Section 4.3). The research indicates that deception appears through small behaviors that depend on specific situations. The implemented manipulations violate the established HCI heuristics, which users depend on for interface navigation [51]. The research confirms that users recognize unusual design elements that affect their trust level toward interfaces [15,16]. The behavioral feature importance rankings in Figure 12 and their effects on local predictions in Figure 13 suggest a relationship between machine-generated signals and user trust evaluations. The ensemble models achieved 97% accuracy, which demonstrates how signal detection theory [57] enables weak behavioral indicators to become valuable diagnostic tools when combined with strong structural signals. The research establishes a basic connection between HCI principles and machine learning by converting interface rule violations into measurable data points. The research advances current initiatives to merge usability science with explainable AI [33].

The study’s experimental results showed that behavioral indicators failed to effectively distinguish between phishing sites and authentic websites because they perform slightly better than random guessing. This outcome is theoretically con-sistent with the nature of interface-based deception cues, which are context-dependent, easily bypassed, and occasionally present in legitimate websites. Nevertheless, the combination of behavioral signals with structural and domain-based features leads to more stable classification results when data becomes uncertain. The signal detection process accepts these indicators as supporting evidence which enhances decision precision when the main evidence is robust. Hence, their contribution lies not in standalone detection capability, but in their ability to corroborate and contextualize other phishing indicators within hybrid ensemble.

5. Limitations, Conclusions, and Future Work

5.1. Limitations and Conclusions

The research investigated which website elements affect phishing website identification and their connection to established detection approaches, including URL structures, domain characteristics, and traffic metrics. Our findings show that phishing websites employ deceptive user interface behaviors through mouse-over actions, pop-up events, hidden iframes, and right-click restrictions, which support traditional structural indicators (RQ1). The unsupervised clustering analysis demonstrates that uniting behavioral data with structural elements results in more distinct and meaningful campaign groupings than using technical indicators alone (RQ2). The supervised modeling results show that behavioral features improve classification accuracy when combined with structural indicators 97% for the hybrid stacked ensemble, but do not enhance results when used alone (RQ3). The theoretical assessment shows that interface modifications generate particular warning indicators that match deception theory and HCI heuristics and signal detection principles to develop explainable intrusion detection systems (RQ4). The research produced positive findings, but multiple ongoing challenges continue to affect the study. The training dataset lacked sufficient data on modern phishing methods and websites that use different languages.

The system needs periodic model updates, such as retraining or fine-tuning, to operate in real time. The model evaluation was conducted in laboratory settings rather than operational environments, and the researchers failed to conduct a memory analysis of the ensemble system. Future detection systems need to develop methods that counter adaptive attackers’ evolving evasion techniques while performing adversarial training to improve robustness. The system generates privacy issues because it monitors all user activities through its mouse-over tracking and right-click blocking functionality. Security deployment needs users to give direct permission for data collection while protecting user privacy through full data anonymization and restricted access to information. Organizations can achieve security excellence through ethical standards by developing privacy protection systems that operate at their highest security level. This paper does not explicitly evaluate adversarial robustness against adaptive phishing strategies, which require longitudinal or adversarially generated datasets. Hence, this remains an important direction for future work.

5.2. Future Work

The research shows promising results, but future studies will use dynamic data streams, including mouse movements, hover times, and scrolling behavior, to identify real-time user deception patterns. The following research will test model scalability through performance assessments on different datasets and real-time feeds, including newly discovered phishing attacks. The clustering models will achieve better results when researchers incorporate domain registration information, threat data, and hosting location details to detect the reuse of phishing tools and attacker network connections. Our research team will create hybrid models that use user feedback data, including hesitation clicks and cursor movement patterns, to develop individualized detection systems that minimize false alerts in adaptive environments. The research team will test the hybrid architecture’s runtime performance and resource usage before deploying it as a browser or edge device application.

The SHAP analysis provides useful results, but researchers should investigate additional XAI methods, such as LIME and counterfactual explanations, to further enhance model explainability and user confidence. The validation process for clustering results becomes more effective when researchers use datasets that include threat intelligence annotations and campaign metadata. The current research uses centralized learning with structured features, but FL is a growing approach for secure distributed model training. The integration of FL with explainable phishing detection systems shows great potential for future research because it supports secure user data sharing in decentralized systems. The research demonstrates that behavioral patterns play a crucial role in developing effective phishing detection systems. The combination of behavioral indicators with structural data, ensemble techniques, and XAI produces better cybersecurity systems that adapt to changing situations and maintain user trust.

Author Contributions

I.K.N.: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Writing—Original Draft, and Writing—Review and Editing. M.O.: Supervision and Writing—Review and Editing. C.L.: Validation and Writing—Review and Editing. All authors contributed to the analysis and interpretation of the data, reviewed the manuscript critically for important intellectual content, approved the final version to be published, and agreed to be accountable for all aspects of the work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data used in this study are publicly available and can be accessed at: https://archive.ics.uci.edu/dataset/327/phishing+websites (accessed on 10 June 2025).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Appendix A

Appendix A.1. Computational Steps for Stacked Ensemble Models A and B

Algorithm A1: Training Procedure for Stacked Ensemble Models A and B
Input:
	Training dataset D = {X_train,y_train} Base models: Random Forest (RF),XGBoost (XGB),Gradient Boosting (GB),Logistic Regression (LR) Meta-learner A: Logistic Regression Meta-learner B: Support Vector Classifier (SVC)
Output:
	Trained StackingClassifier_A and StackingClassifier_B
1. Initialize base models: RF ← RandomForestClassifier(n_estimators = 100) XGB ← XGBClassifier(n_estimators = 100) GB ← GradientBoostingClassifier(n_estimators = 100) LR ← LogisticRegression(max_iter = 1000) 2. Create Stacking Ensemble A: StackingClassifier_A ← Stack [RF,XGB,GB,LR] → Meta: LogisticRegression 3. Fit StackingClassifier_A on X_train,y_train 4. Create Stacking Ensemble B: StackingClassifier_B ← Stack [RF,XGB,GB,LR] → Meta: SVC (probability = True) 5. Fit StackingClassifier_B on X_train,y_train 6. Return trained StackingClassifier_A and StackingClassifier_B

Appendix A.2. Stacked Ensemble Models A and B Computational Complexity

Component	Model(s)	Complexity
Base Learners	RF	O(T·d·n log n)
	XGB, GB	O(T·d·n)
	LR	O(T·d·I)
Meta Learner (Model A)	Logistic Regression	O(n·I)
Meta Learner (Model B)	SVC (probability)	O(n²)
Model A is more efficient for large datasets due to the linear meta-learner. Model B might become computationally expensive at the meta-level if the dataset is large, due to the quadratic complexity of SVC with probability estimation.

Appendix B

Appendix B.1. Hyperparameter Search Space for Grid Search Optimization

Model	Hyperparameter	Parameter Range (ε)
Random Forest (RF)	Number of Estimators (n_estimators)	{50, 100, 200}
	Maximum Depth (max_depth)	{6, 8, 10}
Gradient Boosting Classifier (GBC)	Number of Estimators (n_estimators)	{50, 100, 200}
	Learning Rate (learning_rate)	{0.05, 0.1, 0.2}
	Maximum Depth (max_depth)	{3, 5, 7}
LightGBM (LGBM)	Number of Leaves (num_leaves)	{31, 63, 127}
Support Vector Machine (SVM)	Regularization Parameter (C)	{0.1, 1, 10}
	Kernel Type (kernel)	{rbf, poly}

Appendix B.2. Cross-Validated Grid Search Results for Stacking Models A and B

Model	Accuracy	Std. Dev.	Meta-Learner	C Value	Learning_Rate (GB)	n_estimators (RF, XGB)
Model A	0.973	±0.00	Logistic Regression	0.1	0.1	100
Model A	0.973	±0.00	Logistic Regression	1.0	0.1	100
Model A	0.972	±0.00	Logistic Regression	10.0	0.1	100
Model B	0.973	±0.00	SVC (probability = True)	0.1	0.1	100
Model B	0.970	±0.00	SVC (probability = True)	1.0	0.1	100

Appendix C. K-Means Clustering Results (K = 3 and K = 4) on PCA-Reduced Feature Spaces with 90%, 95%, and 99% Explained Variance Thresholds

References

Gautam, A.; Prakash, A.; Kaushal, G. Artificial Intelligence in Cybersecurity. Int. J. Adv. Res. Sci. Commun. Technol. 2024, 4, 607–610. [Google Scholar] [CrossRef]
Anti-Phishing Working Group (APWG). Phishing Activity Trends Report, Q1 2023. 2023. Available online: https://apwg.org (accessed on 12 June 2025).
Tang, L.; Mahmoud, Q.H. A Survey of Machine Learning-Based Solutions for Phishing Website Detection. Mach. Learn. Knowl. Extr. 2021, 3, 672–694. [Google Scholar] [CrossRef]
Taha, A. Intelligent ensemble learning approach for phishing website detection based on weighted soft voting. Mathematics 2021, 9, 2799. [Google Scholar] [CrossRef]
Abdelhamid, N.; Ayesh, A.; Thabtah, F. Phishing detection based Associative Classification data mining. Expert Syst. Appl. 2014, 41, 5948–5959. [Google Scholar] [CrossRef]
Salem, A.H.; Azzam, S.M.; Emam, O.E.; Abohany, A.A. Advancing Cybersecurity: A Comprehensive Review of AI-Driven Detection Techniques; Springer International Publishing: Berlin/Heidelberg, Germany, 2024; Volume 11. [Google Scholar]
Yan, C.; Han, X.; Zhu, Y.; Du, D.; Lu, Z.; Liu, Y. Phishing behavior detection on different blockchains via adversarial domain adaptation. Cybersecurity 2024, 7, 45. [Google Scholar] [CrossRef]
Alhuzali, A.; Alloqmani, A.; Aljabri, M.; Alharbi, F. In-Depth Analysis of Phishing Email Detection: Evaluating the Performance of Machine Learning and Deep Learning Models Across Multiple Datasets. Appl. Sci. 2025, 15, 3396. [Google Scholar] [CrossRef]
Sahingoz, S.O.K.; Buber, E.; Demir, O.; Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019, 117, 345–357. [Google Scholar] [CrossRef]
Adane, K.; Beyene, B. Machine Learning and Deep Learning Based Phishing Websites Detection: The Current Gaps And Next Directions. Rev. Comput. Eng. Res. 2022, 9, 13–29. [Google Scholar] [CrossRef]
Zieni, R.; Massari, L.; Calzarossa, M.C. Phishing or Not Phishing? A Survey on the Detection of Phishing Websites. IEEE Access 2023, 11, 18499–18519. [Google Scholar] [CrossRef]
Kavya, S.; Sumathi, D. Staying ahead of phishers: A review of recent advances and emerging methodologies in phishing detection. Artif. Intell. Rev. 2024, 58, 50. [Google Scholar] [CrossRef]
Taha, M.A.; Jabar, H.D.A.; Mohammed, W.K. A Machine Learning Algorithms for Detecting Phishing Websites: A Comparative Study. Iraqi J. Comput. Sci. Math. 2024, 5, 275–286. [Google Scholar] [CrossRef]
Buller, D.B.; Burgoon, J.K. Interpersonal Deception Theory. Commun. Theory 1996, 6, 203–242. [Google Scholar] [CrossRef]
Fogg, B.J. Prominence-Interpretation theory: Explaining how people assess credibility online. In Proceedings of the Human Factors in Computing Systems, Ft. Lauderdale, FL, USA, 5–10 April 2023; pp. 722–723. [Google Scholar] [CrossRef]
Fogg, B.J.; Soohoo, C.; Danielson, D.R.; Marable, L.; Stanford, J.; Tauber, E.R. How do users evaluate the credibility of Web sites?: A study with over 2500 participants. In Proceedings of the 2003 Conference on Designing for User Experiences DUX’03, San Francisco, CA, USA, 6–7 June 2003; pp. 1–15. [Google Scholar] [CrossRef]
Li, W.; Manickam, S.; Chong, Y.-W.; Leng, W.; Nanda, P. A State-of-the-Art Review on Phishing Website Detection Techniques. IEEE Access 2024, 12, 187976–188012. [Google Scholar] [CrossRef]
Hossain, A.; Khin, L.; Wison, G. An Intelligent Phishing Detection and Protection Scheme Using a Fusion of Images, Frames and Text. 2020. Available online: https://www.researchgate.net/publication/353482275_An_Intelligent_Phishing_Detection_and_Protection_Scheme_using_a_fusion_of_Images_Frames_and_Text (accessed on 10 June 2025).
Jain, A.K.; Gupta, B.B. A machine learning based approach for phishing detection using hyperlinks information. J. Ambient. Intell. Humaniz. Comput. 2018, 10, 2015–2028. [Google Scholar] [CrossRef]
Safi, A.; Singh, S. A systematic literature review on phishing website detection techniques. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 590–611. [Google Scholar] [CrossRef]
Li, Y.; Yang, Z.; Chen, X.; Yuan, H.; Liu, W. A stacking model using URL and HTML features for phishing webpage detection. Futur. Gener. Comput. Syst. 2019, 94, 27–39. [Google Scholar] [CrossRef]
Karim, A.; Shahroz, M.; Mustofa, K.; Belhaouari, S.B.; Joga, S.R.K. Phishing Detection System Through Hybrid Machine Learning Based on URL. IEEE Access 2023, 11, 36805–36822. [Google Scholar] [CrossRef]
Hamadouche, S.; Boudraa, O.; Gasmi, M. Combining Lexical, Host, and Content-based features for Phishing Websites detection using Machine Learning Models. ICST Trans. Scalable Inf. Syst. 2024, 11, 1–15. [Google Scholar] [CrossRef]
Joshi, A.; Lloyd, L.; Westin, P.; Seethapathy, S. Using Lexical Features for Malicious URL Detection—A Machine Learning Approach. arXiv 2019, arXiv:1910.06277. Available online: http://arxiv.org/abs/1910.06277 (accessed on 10 June 2025).
Hong, J.; Kim, T.; Liu, J.; Park, N.; Kim, S.-W. Phishing URL Detection with Lexical Features and Blacklisted Domains. In Adaptive Autonomous Secure Cyber Systems; Springer International Publishing: Cham, Switzerland, 2020; pp. 253–267. [Google Scholar]
Haq, Q.E.U.; Faheem, M.H.; Ahmad, I. Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks. Appl. Sci. 2024, 14, 10086. [Google Scholar] [CrossRef]
Prakash, P.; Kumar, M.; Kompella, R.R.; Gupta, M. PhishNet: Predictive Blacklisting to Detect Phishing Attacks. In Proceedings of the 2010 Proceedings IEEE INFOCOM, San Diego, CA, USA, 14–19 March 2010; pp. 1–5. [Google Scholar] [CrossRef]
Rashid, F.; Doyle, B.; Han, S.C.; Seneviratne, S. Phishing URL detection generalisation using Unsupervised Domain Adaptation. Comput. Netw. 2024, 245, 110398. [Google Scholar] [CrossRef]
Catal, C.; Giray, G.; Tekinerdogan, B.; Kumar, S.; Shukla, S. Applications of Deep Learning for Phishing Detection: A Systematic Literature Review; Springer: London, UK, 2022; Volume 64. [Google Scholar]
Al-Sarem, M.; Saeed, F.; Al-Mekhlafi, Z.G.; Mohammed, B.A.; Al-Hadhrami, T.; Alshammari, M.T.; Alreshidi, A.; Alshammari, T.S. An optimized stacking ensemble model for phishing websites detection. Electronics 2021, 10, 1285. [Google Scholar] [CrossRef]
Mohamed, N.; Taherdoost, H.; Madanchian, M. Enhancing Spear Phishing Defense with AI: A Comprehensive Review and Future Directions. ICST Trans. Scalable Inf. Syst. 2024, 12, 1–10. [Google Scholar] [CrossRef]
Calzarossa, M.C.; Giudici, P.; Zieni, R. An assessment framework for explainable AI with applications to cybersecurity. Artif. Intell. Rev. 2025, 58, 150. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed.; Independently Published: Chicago, IL, USA, 2020; Available online: https://books.google.com/books/about/Interpretable_Machine_Learning.html?id=jBm3DwAAQBAJ (accessed on 10 June 2025).
Smith, G. Top Phishing Statistics for 2025: Latest Figures and Trends. 2025. Available online: https://www.stationx.net/phishing-statistics/ (accessed on 10 June 2025).
Hoxhunt. Phishing Trends Report (Updated for 2025). 2025. Available online: https://hoxhunt.com/guide/phishing-trends-report? (accessed on 10 June 2025).
Egress. Must-Know Phishing Statistics for 2025. 2024. Available online: https://www.egress.com/blog/security-and-email-security/must-know-phishing-statistics-for-2025? (accessed on 10 June 2025).
National University. 101 Cybersecurity Statistics and Trends for 2025. 2025. Available online: https://www.nu.edu/blog/cybersecurity-statistics/? (accessed on 10 June 2025).
Bhattacharya, T.; Veeramalla, S.; Tanniru, V. A Survey on Retrieving Confidential Data Using Phishing Attack. In Proceedings of the 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), Las Vegas, NV, USA, 24–27 July 2023; pp. 2528–2535. [Google Scholar] [CrossRef]
Alhashmi, S.; Alneyadi, A.; Alshehhi, M.; Lamaazi, H. Mobile and Web Applications Clones: A Comprehensive Study. In Proceedings of the 2023 International Wireless Communications and Mobile Computing (IWCMC), Marrakesh, Morocco, 19–23 June 2023; pp. 464–469. [Google Scholar] [CrossRef]
Akanbi, O.A.; Amiri, I.S.; Fazeldehkordi, E. A Machine-Learning Approach to Phishing Detection and Defense; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
Basit, A.; Zafar, M.; Liu, X.; Javed, A.R.; Jalil, Z.; Kifayat, K. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommun. Syst. 2021, 76, 139–154. [Google Scholar] [CrossRef]
Wang, Y.; Ma, W.; Xu, H.; Liu, Y.; Yin, P. A Lightweight Multi-View Learning Approach for Phishing Attack Detection Using Transformer with Mixture of Experts. Appl. Sci. 2023, 13, 7429. [Google Scholar] [CrossRef]
Zhou, J.; Cui, H.; Li, X.; Yang, W.; Wu, X. A Novel Phishing Website Detection Model Based on LightGBM and Domain Name Features. Symmetry 2023, 15, 180. [Google Scholar] [CrossRef]
Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Intelligent phishing detection scheme using deep learning algorithms. J. Enterp. Inf. Manag. 2020, 36, 747–766. [Google Scholar] [CrossRef]
Dhanavanthini, P.; Chakkravarthy, S.S. Phish-Armour: Phishing detection using deep recurrent neural networks. Soft Comput. 2023. [Google Scholar] [CrossRef]
Calzarossa, M.C.; Giudici, P.; Zieni, R. How robust are ensemble machine learning explanations? Neurocomputing 2025, 630, 129686. [Google Scholar] [CrossRef]
Lim, B.; Huerta, R.; Sotelo, A.; Quintela, A.; Kumar, P. EXPLICATE: Enhancing Phishing Detection through Explainable AI and LLM-Powered Interpretability. arXiv 2025, arXiv:2503.20796. [Google Scholar] [CrossRef]
Kumar, P.P.; Jaya, T.; Rajendran, V. SI-BBA—A novel phishing website detection based on Swarm intelligence with deep learning. Mater. Today Proc. 2023, 80, 3129–3139. [Google Scholar] [CrossRef]
Rasymas, T.; Dovydaitis, L. Detection of Phishing URLs by Using Deep Learning Approach and Multiple Features Combinations. Balt. J. Mod. Comput. 2020, 8, 471–483. [Google Scholar] [CrossRef]
Elkouay, A.; Moussa, N.; Madani, A. Graph-Based phishing detection: URLGBM model driven by machine learning. Int. J. Comput. Appl. 2024, 46, 481–495. [Google Scholar] [CrossRef]
Jakob, N. Usability Engineering; Academic Press Limited: London, UK, 1993. [Google Scholar]
Jo, J.; Cho, J.; Moon, J. A malware detection and extraction method for the related information using the ViT attention mechanism on Android operating system. Appl. Sci. 2023, 13, 6839. [Google Scholar] [CrossRef]
Shaukat, M.W.; Amin, R.; Muslam, M.M.A.; Alshehri, A.H.; Xie, J. A hybrid approach for alluring ads phishing attack detection using machine learning. Sensors 2023, 23, 8070. [Google Scholar] [CrossRef]
Moussavou Boussougou, M.K.; Park, D.-J. Attention-based 1D CNN-BiLSTM hybrid model enhanced with FastText word embedding for Korean voice phishing detection. Mathematics 2023, 11, 3217. [Google Scholar] [CrossRef]
Nti, I.K.; Somanathan, A.R. A Scalable RF-XGBoost Framework for Financial Fraud Mitigation. IEEE Trans. Comput. Soc. Syst. 2022, 11, 1556–1563. [Google Scholar] [CrossRef]
Yadollahi, M.M.; Shoeleh, F.; Serkani, E.; Madani, A.; Gharaee, H. An Adaptive Machine Learning Based Approach for Phishing Detection Using Hybrid Features. In Proceedings of the 2019 5th International Conference on Web Research (ICWR), Tehran, Iran, 24–25 April 2019; pp. 281–286. [Google Scholar] [CrossRef]
Green, D.M.; Swets, J.A. Signal Detection Theory and Psychophysics; Wiley & Sons, Inc.: New York, NY, USA, 1974; Volume 27. [Google Scholar]
Mohammad, R.M.; Thabtah, F.; McCluskey, L. Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 2014, 25, 443–458. [Google Scholar] [CrossRef]
Omari, K. Comparative Study of Machine Learning Algorithms for Phishing Website Detection. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 417–425. [Google Scholar] [CrossRef]
Lakshmi, L.; Reddy, M.P.; Santhaiah, C.; Reddy, U.J. Smart Phishing Detection in Web Pages Using Supervised Deep Learning Classification and Optimization Technique ADAM. Wirel. Pers. Commun. 2021, 118, 3549–3564. [Google Scholar] [CrossRef]

Figure 1. Classification of website features used for malicious URL detection, categorized into behavioral, URL-based, domain-specific, webpage-based, and traffic and search engine-based groups.

Figure 2. Dataflow framework for phishing detection and profiling. The architecture consists of two parallel paths: an unsupervised branch using PCA and K-Means for campaign profiling, and a supervised branch using ensemble classifiers for phishing detection. Clustering output is not fed into classification; both branches operate independently on the same preprocessed features.

Figure 3. Architecture of stacked ensemble (Model A).

Figure 4. Architecture of stacked ensemble (Model B).

Figure 5. Frequency distribution of behavioral features (“on_mouseover,” “RightClick,” “popUpWindow,” and “Iframe”) comparing legitimate and phishing websites in the dataset.

Figure 6. Correlation heatmap illustrating pairwise relationships among selected dataset features, highlighting degrees of correlation that may influence the malicious URL detection process.

Figure 7. K-Means clustering results (K = 2) on PCA-reduced feature space with varying explained variance thresholds: 90% (left), 95% (middle), and 99% (right). Colors indicate cluster assignments produced by K-Means (blue and orange denote the two clusters).

Figure 8. Comparative performance analysis of base classifiers (Random Forest, XGBoost, Gradient Boosting Classifier, and Logistic Regression) trained exclusively on behavioral features.

Figure 9. A comparison of how different base classifiers (Random Forest, XGBoost, Gradient Boosting Classifier, and Logistic Regression) and a voting ensemble model perform using all website features.

Figure 10. Confusion matrices showing the classification performance of base models (Random Forest, XGBoost, Gradient Boosting Classifier, and Logistic Regression) when trained on the complete set of website features.

Figure 11. Performance evaluation of stacked ensemble models (Stacking Model A and Stacking Model B), showcasing precision, recall, and F1-score for the classification of legitimate and phishing websites.

Figure 12. Receiver operating characteristic (ROC) curve and the precision–recall (PR) curve for Stacked Model A.

Figure 13. Feature importance ranking derived from the Random Forest classifier, emphasizing the most influential website attributes in differentiating between phishing and legitimate websites. Bar color intensity reflects relative importance magnitude.

Figure 14. SHAP summary (beeswarm) plot displaying the influence of each feature on the model’s output. Features are ordered by importance, with color gradients showing feature values (red for high, blue for low), and SHAP values indicating the size and direction of their effect on phishing classification.

Figure 15. SHAP force plot showing a local explanation of a single instance classified by the XGBoost model.

Figure 16. McNemar’s test results compare the prediction outcomes between the Random Forest (RF) and Voting Ensemble (VE) models, showing the number of samples where both models were correct, incorrect, or differed in correctness, to evaluate the statistical significance of performance differences.

Table 1. Summary of related studies on malicious URL detection, highlighting methodologies, study objectives, input features, and identified research gaps.

Ref.	Methodology (Machine Learning or AI Module)	Study Objectives	Input Features	Research Gap (Limitation)
[44]	Deep Learning (Convolutional Neural Networks)	Design and develop a deep learning-based phishing detection solution leveraging URL and website content (images, text, frames).	These include URL features, website images, text content, and frames.	The computational complexity of deep learning might limit its application in resource-constrained environments.
[45]	Deep Learning (Recurrent Neural Networks)	Achieve high accuracy in detecting malicious URLs using RNNs, focusing solely on URL text.	URL text.	The system is limited to URL text analysis and may not detect phishing sites with benign-looking URLs but malicious content.
[47]	Explainable AI, Large Language Models	Develop an AI framework for real-time phishing detection with enhanced interpretability to build user trust.	Not specified.	Preliminary results require further validation in real-world scenarios.
[48]	Swarm Intelligence (Binary Bat Algorithm), Deep Learning	Propose a deep learning-based swarm intelligence-binary bat algorithm for detecting phishing URLs.	URL features.	Computational complexity may hinder real-time applications.
[49]	Deep Learning (CNN, LSTM)	Compare different feature combinations (lexical, character-level, and word-level embeddings) for phishing URL classification.	These include lexical features, character-level embeddings, and word-level embeddings.	The computational complexity of deep learning might limit its application in resource-constrained environments.
[13]	Machine Learning (Random Forest, Decision Tree, XGBoost)	Compare the performance of various machine learning algorithms in detecting phishing websites.	Website features.	The study is limited to specific algorithms and may not reflect the performance of other models.

Table 2. Categorization of dataset features utilized in the study, grouped into behavioral, URL-based, domain-specific, webpage-based, and traffic and search engine-based attributes.

Features Category	Features
Behavioral	‘on_mouseover,’ ‘RightClick,’ ‘popUpWindow,’ ‘Iframe’
URL-based	‘having_IP_Address,’ ‘URL_Length,’ ‘Shortening_Service,’ ‘having_At_Symbol,’ ‘double_slash_redirecting,’ ‘Prefix_Suffix’
Domain	‘having_Sub_Domain,’ ‘SSLfinal_State,’ ‘Domain_registeration_length,’ ‘age_of_domain,’ DNSRecord’
Webpage-based	‘Favicon,’ ‘port,’ HTTPS_token,’ Request_URL,’ ‘URL_of_Anchor,’ ‘Links_in_tags,’ ‘SFH,’ ‘Submitting_to_email,’ ‘Abnormal_URL,’ ‘Redirect’
Traffic & Search Engine	‘web_traffic,’ ‘Page_Rank,’ ‘Google_Index,’ ‘Links_pointing_to_page,’ ‘Statistical_report’

Table 3. Key model hyperparameters.

Model	Hyperparameters
Random Forest (RF)	n_estimators = 100, max_features = ’sqrt’, criterion = ’gini’, random_state = 42
XGBoost (XGB)	n_estimators = 100, eval_metric = ’logloss’, use_label_encoder = False, objective = ’binary:logistic’, random_state = 42
Gradient Boosting (GB)	n_estimators = 100, learning_rate = 0.1, loss = ’log_loss’, max_depth = 3, random_state = 42
Logistic Regression (LR)	penalty = ’l2′, solver = ’lbfgs’, max_iter = 1000, C = 0.1
SVC	C = 0.1
K-Means Clustering	n_clusters = 2 (final), with sensitivity analysis for K = 2, 3, 4; random_state = 42
Principal Component Analysis (PCA)	n_components = 0.95 (final), with sensitivity analysis for 90%, 95%, 99% explained variance

Table 4. Sensitivity analysis of K-Means clustering (K = 2, 3, 4) applied to PCA-transformed feature spaces with varying explained variance thresholds (90%, 95%, 99%).

PCA Variance (%)	Components	K = 2 (Silh/DB)	K = 3 (Silh/DB)	K = 4 (Silh/DB)
90%	19	0.2921/1.7888	0.3046/1.5290	0.1290/2.0189
95%	23	0.2796/1.8457	0.2907/1.5848	0.1201/2.1135
99%	28	0.2797/1.9096	0.2850/1.6354	0.1140/2.1650

Silh = Silhouette Score (higher is better); DB = Davies-Bouldin Index (lower is better).

Table 5. Phishing detection baseline comparison table.

Model	Behavioral Only		Structural Only		Combined
Model	Accuracy	Macro F1	Accuracy	Macro F1	Accuracy	Macro F1
Random Forest	0.50	0.40	0.91	0.91	0.97	0.97
XGBoost	0.50	0.40	0.91	0.91	0.97	0.97
Gradient Boosting	0.50	0.40	0.90	0.90	0.94	0.94
Logistic Regression	0.50	0.40	0.88	0.88	0.91	0.91

Table 6. Training and testing duration of stacked ensemble models.

Model	Meta-Learner	Training Time (s)	Testing Time (s)
Stacked A	Logistic Regression	9.87	0.11
Stacked B	Support Vector Classifier (SVC)	12.15	0.31

Table 7. Comparative analysis of the proposed phishing detection model against selected recent studies. The table highlights differences in dataset origin, dataset type, detection techniques, and reported evaluation metrics.

Reference	Dataset	Dataset Description	Technique	Evaluation Metrics
Reference	Dataset	Dataset Description	Technique	F1-Score (%)	Accuracy (%)
[44]	PhishTank, Common Crawl, OpenPhish	Uses URL, text, images, frame content (per paper)	Hybrid deep learning (CNN & LSTM)	X	93.28
[45]	PhishTank, Common Crawl, OpenPhish	URL-based features only	Federated Learning + Deep NN	90.50	90.50
[13]	UCI Phishing Websites Dataset	Structured features from URLs	RF, DT, XGBoost	X	94.57–96.89
[48]	PhishTank, Alexa, UNB ISCX 2016	Character-level URL features only	Deep learning (1D CNN)	X	94.8
[49]	PhishTank + public benign URL sources	Character & word-level embeddings; lexical features	Deep neural network (CNN + GRU, char & word embeddings)	X	94.4
[58]	UCI Phishing Websites Dataset	Structured features from URLs	Functional Tree	96.1	96.07
[59]	UCI Phishing Websites Dataset	Structured features from URLs	LR, KNN, SVM, NB, DT, RF and Gradient Boosting	96.9	97.2
[60]	UCI Phishing Websites Dataset	Structured features from URLs	DNN + Adam	X	96
Our study	UCI Phishing Dataset	Features include URL + behavioral metadata	Stacking, XGBoost, Random Forest	97	97

Note: For References [44,45,48,49], direct performance comparisons should be interpreted with caution due to differences in dataset composition, feature engineering, and experimental setups.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nti, I.K.; Ozer, M.; Li, C. Synergistic Phishing Intrusion Detection: Integrating Behavioral and Structural Indicators with Hybrid Ensembles and XAI Validation. Future Internet 2026, 18, 30. https://doi.org/10.3390/fi18010030

AMA Style

Nti IK, Ozer M, Li C. Synergistic Phishing Intrusion Detection: Integrating Behavioral and Structural Indicators with Hybrid Ensembles and XAI Validation. Future Internet. 2026; 18(1):30. https://doi.org/10.3390/fi18010030

Chicago/Turabian Style

Nti, Isaac Kofi, Murat Ozer, and Chengcheng Li. 2026. "Synergistic Phishing Intrusion Detection: Integrating Behavioral and Structural Indicators with Hybrid Ensembles and XAI Validation" Future Internet 18, no. 1: 30. https://doi.org/10.3390/fi18010030

APA Style

Nti, I. K., Ozer, M., & Li, C. (2026). Synergistic Phishing Intrusion Detection: Integrating Behavioral and Structural Indicators with Hybrid Ensembles and XAI Validation. Future Internet, 18(1), 30. https://doi.org/10.3390/fi18010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synergistic Phishing Intrusion Detection: Integrating Behavioral and Structural Indicators with Hybrid Ensembles and XAI Validation

Abstract

1. Introduction

2. Literature Review

2.1. What Is Phishing: A Brief Background

2.2. Factors and Techniques Used in Phishing Detection

2.3. Related Works

3. Materials and Methods

3.1. Problem Definition

3.2. Study Framework

3.3. Data Preparation and Feature Categorization

3.4. Exploratory Data Analysis (EDA) Procedure

Feature Integration and Clustering

3.5. Detection Framework: Dimensionality Reduction, Clustering, and Classification Modelling

3.5.1. Principal Component Analysis (PCA)

3.5.2. K-Means Clustering

3.5.3. Classification Approach

3.6. Model Explainability with Shapley Additive exPlanations (SHAP)

3.7. Model Evaluation

3.8. Experimental Setup

4. Results and Discussion

4.1. Exploratory Data Analysis (EDA) Findings

4.2. Clustering Analysis Using PCA and K-Means

4.3. Performance of Base Models Using Only Website Behavioral Features

4.4. Performance of Base Models with All Website Features

Model Performance Using Only Structural Features

4.5. Performance of Stacked Ensemble Models

4.6. Feature Importance and Explainability Analysis

4.7. Models Statistical Significance Testing Using McNemar’s Test

4.8. Comparative Analysis with Related Studies

4.9. Theoretical Implications

5. Limitations, Conclusions, and Future Work

5.1. Limitations and Conclusions

5.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Computational Steps for Stacked Ensemble Models A and B

Appendix A.2. Stacked Ensemble Models A and B Computational Complexity

Appendix B

Appendix B.1. Hyperparameter Search Space for Grid Search Optimization

Appendix B.2. Cross-Validated Grid Search Results for Stacking Models A and B

Appendix C. K-Means Clustering Results (K = 3 and K = 4) on PCA-Reduced Feature Spaces with 90%, 95%, and 99% Explained Variance Thresholds

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI