1. Introduction
Phishing remains one of the most pervasive and damaging cyber threats because it directly targets human trust while scaling efficiently through digital channels. The majority of phishing campaigns operate by mimicking trustworthy websites [
1] through clickable links, where a single malicious URL can trigger credential theft, malware delivery, or financial fraud at scale. The Anti-Phishing Working Group reported [
2] that SaaS and webmail category was the single most targeted sector (17.6%), while online payment and banking together accounted for 30.9%. Retail, social media, telecommunications, and logistics also experienced substantial targeting. In addition, attackers are increasingly abusing QR codes delivered through email to steer victims to phishing pages or to facilitate malware installation. Across a six-month period, over 1.7 million distinct malicious QR codes were identified, alongside an average of 2.7 million QR-code emails sent per day. Business email compromise activity involving wire transfers also rose, increasing by 33% compared with the previous quarter. While the mean amount demanded per BEC incident fell to
$42,236, gift card fraud remained the dominant tactic (51% of cases), with payroll diversion and cryptocurrency-related scams also frequently observed.
Substantial research has addressed phishing detection using machine learning (ML) and deep learning (DL) approaches. Prior studies commonly learned from URL lexical patterns, host and domain registration features, and, in some cases, page or content attributes, often combined with feature selection and ensemble learning to improve robustness and deployment feasibility. More recent work reflected a shift toward representation learning, where authors learned URL embeddings directly or combined URL strings with contextual signals to improve generalization across evolving campaigns [
3,
4]. Existing surveys and systematic reviews indicated that ML and DL dominated the state of practice, with frequent use of tree-based ensembles and neural architectures, and with public repositories such as PhishTank widely used to construct phishing datasets [
5]. These studies collectively demonstrated that automated detection was feasible and that high performance was achievable under controlled evaluation settings. However, despite these advances, classical approaches remain constrained by feature engineering limitations, scalability challenges, and difficulties in modeling complex non-linear relationships inherent in evolving phishing patterns, motivating exploration of alternative computational paradigms.
Quantum machine learning (QML) has recently emerged as a potential alternative to ML and DL approaches. Several prior studies investigated QML-based classifiers in cybersecurity contexts and reported both promising results and notable limitations. For example, researchers evaluated quantum-enhanced support vector machine variants and related pipelines, typically using simulators, curated datasets, or restricted feature sets [
6,
7]. Work focused on phishing URL detection also explored quantum-enhanced models to assess whether quantum feature spaces or hybrid quantum–classical workflows could remain competitive with strong classical baselines [
8]. Beyond acting as classifiers, QML models introduce fundamentally different mechanisms for representing data, where classical features are embedded into high-dimensional Hilbert spaces through quantum feature maps. In this view, QML can be interpreted as a knowledge representation framework, rather than merely a predictive tool, enabling the exploration of alternative geometric structures, interference effects, and non-linear correlations that are difficult to realize in classical feature spaces.
Phishing-related features, such as URL length, character distributions, subdomain depth, lexical entropy, and protocol indicators, naturally form structured, high-dimensional vectors. When encoded into quantum states, these features are mapped into Hilbert spaces whose dimensionality grows exponentially with the number of qubits, potentially allowing subtle relationships between benign and malicious URLs to be represented through quantum superposition and entanglement. This mapping provides a distinct representational perspective on phishing data, in which separability is governed not only by feature values, but also by the geometry induced by the chosen quantum encoding and circuit architecture.
In principle, QML offers new forms of feature mapping, kernel construction, and optimization behavior that differ from classical learners and may benefit problems involving complex, high-dimensional decision boundaries [
9,
10]. However, QML is currently constrained by the noisy intermediate-scale quantum (NISQ) era, where limited qubit counts, hardware noise, and shallow circuit depth significantly affect performance and reproducibility, making careful encoding strategies, feature scaling, and error mitigation central design considerations [
6,
11]. These constraints create uncertainty regarding whether reported QML performance improvements reflect genuine algorithmic advantages or artifacts of simulator-based experimental conditions, highlighting the need for systematic evaluation.
Despite growing interest, the knowledge base for QML in phishing detection remains underexplored. Existing studies are scattered, heterogeneous, and often evaluated under incompatible assumptions, making it difficult for ML researchers to extract transferable lessons. In particular, there are currently no studies that systematically review which QML classifiers are most suitable for phishing detection, how phishing features are encoded into quantum representations, what performance and limitations have been reported, and whether QML offers meaningful advantages under realistic hardware constraints.
Therefore, the primary objective of this systematic literature review is to explore the current state of QML classifiers for phishing detection, with a specific focus on identifying dominant techniques, encoding strategies, reported performance, practical constraints, and research gaps. Rather than merely summarizing prior studies, this review aims to assess the methodological maturity, performance trends, and deployment readiness of QML approaches within the phishing detection domain.
The study applies a PRISMA-guided systematic review methodology to identify, screen, and analyze relevant studies published between 2021 and 2025. This review provides a coherent understanding of how QML has been applied, what limitations remain, and what future directions are necessary to advance the field by synthesizing available evidence using structured qualitative and descriptive analysis. The main contributions of this work are as follows:
We employed an SLR methodology using predefined search and selection criteria to ensure methodological rigor and transparency.
We reviewed and synthesized existing QML classifiers applied to phishing detection, including their feature encoding strategies, strengths, and limitations.
We identified key technical, methodological, and hardware-related challenges affecting QML performance and deployment readiness.
We provided a critical synthesis of current evidence to evaluate the practical feasibility and research maturity of QML classifiers for phishing detection.
The remainder of this paper is organized as follows.
Section 2 provides background study on phishing attacks and QML.
Section 3 describes the SLR methodology, including the search strategy, screening, and selection process.
Section 4 presents and categorizes the results.
Section 5 discusses and interprets the findings.
Section 6 outlines the study’s limitations.
Section 7 concludes the paper and highlights broader directions for future research.
2. Background
The following section connects the structural characteristics of phishing detection with the computational principles of QML to establish the technical basis for this review, highlighting how the limitations of classical approaches motivate the exploration of QML.
Rather than treating phishing detection and QML as independent topics, this section develops a continuous technical argument, starting from the complexity of phishing feature spaces and progressing toward the representational capabilities offered by quantum learning models.
2.1. Phishing
Phishing is a cyberattack technique in which adversaries impersonate legitimate entities to deceive users into revealing sensitive information, such as login credentials, financial data, or personal identifiers [
12], and has evolved significantly since its emergence in the 1990s, progressing from simple email scams to highly sophisticated campaigns that employ automation, obfuscation, and social engineering techniques [
13,
14]. These attacks are typically delivered through malicious URLs, email links, fake websites, or embedded scripts designed to mimic trusted website [
15].
Phishing attacks typically follow a multi-stage lifecycle (
Figure 1) consisting of planning, delivery, data collection, and exploitation phases [
16,
17]. This lifecycle highlights that phishing detection systems must operate under adversarial conditions, where attackers continuously modify features to evade classification models. Consequently, phishing detection models must achieve not only high accuracy but also robustness, generalization, and scalability.
URL-level feature engineering has emerged as an effective approach for automated phishing detection. Lexical features, such as URL length, subdomain depth, the use of IP-based hosts, and entropy derived from character n-grams, provide strong discriminatory signals for classification models. Structural features, including the ratio of special characters, the presence of brand-related keywords in the URL path, and redirect patterns, further enhance the feature space. In addition, host-based attributes, such as domain age obtained from WHOIS records, DNS behavior, and page rank, are commonly incorporated into ensemble models to improve generalization across diverse phishing campaigns [
18].
Within a machine learning framework, phishing URL detection is formulated as a binary classification problem. Given a URL
, the objective is to assign it to one of two classes, phishing or legitimate, such that:
This classification relies on a feature set , extracted from the URL string, the corresponding webpage’s HTML content, and additional contextual attributes. Crucially, the resulting feature space is high-dimensional, heterogeneous, and exhibits complex non-linear interactions between features. A key challenge lies in identifying the most informative and discriminative features from a high-dimensional and potentially noisy feature space, while maintaining high detection accuracy, low false positive rates, and efficient real-time performance.
Classical ML approaches, including Support Vector Machines, Random Forests, and Neural Networks, have demonstrated strong performance in phishing detection tasks. However, several limitations persist. First, these models often depend on handcrafted feature representations, which may fail to capture subtle and evolving phishing patterns. Second, modeling complex non-linear decision boundaries in high-dimensional spaces can be computationally intensive. Third, classical models are susceptible to adversarial manipulation and feature obfuscation techniques.
These limitations indicate that classical feature representations and learning mechanisms may be insufficient to fully capture the complex, high-dimensional relationships inherent in phishing data, thereby motivating the investigation of alternative computational frameworks.
2.2. Quantum Machine Learning
In response to these challenges, QML has emerged as a computational paradigm that introduces fundamentally different mechanisms for data representation and learning, offering potential advantages for modeling high-dimensional and non-linear classification problems such as phishing detection.
QML is an emerging interdisciplinary field that integrates quantum computing principles with ML algorithms to improve data representation and pattern recognition [
19,
20]. Unlike classical ML, which operates on binary bits, QML operates on quantum bits (qubits), which can exist in superposition states and exhibit entanglement. A qubit can be expressed as a superposition of the basis states |0⟩ and |1⟩, given by |
ψ⟩ =
α|0⟩ +
β|1⟩, where
α and
β are complex amplitudes satisfying the normalization condition |α|
2 + |
β|
2 = 1. A system of
n qubits can represent 2
n states simultaneously, forming the basis of quantum parallelism [
21]. This property is particularly relevant to phishing detection, where subtle variations in URL structure and content features require expressive representations capable of capturing complex and non-linear decision boundaries.
The quantum kernel method represents a well-founded approach for near-term QML, leveraging quantum circuits as feature maps to define kernels within exponentially large Hilbert spaces. For a classical input
x ∈ ℝᵈ, a quantum feature map
ϕ: ℝᵈ →
encodes the data into a quantum state |
ϕ(
x)⟩ via a parameterized encoding circuit. The resulting kernel function, κ(
xᵢ,
xⱼ) = |⟨
ϕ(
xᵢ)|
ϕ(
xⱼ)⟩|
2, measures the similarity between quantum-encoded states and can be estimated through quantum circuit evaluations [
22]. Importantly, this transformation implicitly maps classical data into a high-dimensional feature space, aligning with the structural requirements of phishing detection problems, where class boundaries are non-linear and difficult to separate in the original space.
This kernel can be integrated into classical classifiers, such as Support Vector Machines, forming Quantum Support Vector Machines (QSVM). Additional architectures include Variational Quantum Classifiers (VQC), Quantum Neural Networks (QNN), and Quantum Convolutional Neural Networks (QCNN) [
23,
24,
25,
26,
27], typically implemented within hybrid quantum–classical frameworks.
QML offers several theoretical advantages from a computational perspective. First, quantum feature maps can project classical data into higher-dimensional feature spaces, potentially improving class separability. Second, quantum circuits can model complex non-linear relationships through entanglement and interference. Third, quantum algorithms may offer computational speed advantages for certain optimization tasks.
Therefore, QML provides a theoretically grounded framework for addressing the core challenges of phishing detection, particularly high-dimensionality, non-linearity, and adversarial feature variability, by leveraging quantum-enhanced feature representations. Despite these theoretical advantages, the practical applicability of QML to phishing detection remains unclear due to hardware limitations, experimental inconsistencies, and the absence of consolidated evidence across studies.
3. Methodology
To address the mentioned gap, a Systematic Literature Review (SLR) was conducted to synthesize existing studies on QML classifiers for phishing detection. The review adhered to PRISMA guidelines [
28] to ensure methodological rigor and transparency. The study selection progression was summarized in a PRISMA flow diagram [
29], presented in
Figure 2.
The primary aim of this methodology is not only to identify relevant studies but also to systematically identify the technical approaches, and synthesize evidence to assess the maturity and feasibility of QML in phishing detection.
This systematic review was designed to address five key research questions (RQs) that defined the scope of the study:
RQ1: What QML models have been applied for phishing detection?
RQ2: What feature encoding strategies have been used to represent phishing data in QML models?
RQ3: How do realistic quantum hardware constraints affect the performance and generalization of these models?
RQ4: What advantages and limitations have been reported for existing QML classifiers?
RQ5: What challenges and future research directions have been identified in the literature?
3.1. Search Strategy
First, we performed an extensive search across multiple electronic databases, including the ACM Digital Library, IEEE Xplore, MDPI, ScienceDirect, SpringerLink. Those sources were selected due to their broad and reputable coverage of computer science and cybersecurity research, which enabled access to relevant and high-quality studies. The search string was defined as shown in
Table 1. The search was restricted to studies published within the past five years, between early January 2021 and early December 2025, to capture recent developments reflecting the evolution of both cybersecurity threats and QML.
3.2. Study Selection
The study selection process was carried out in three sequential stages. First, all records retrieved from the databases were combined (n = 19). Next, during the screening stage, titles, abstracts, and keywords were examined to exclude irrelevant studies; studies were excluded if the article did not involve QML, address phishing detection and were review articles, editorials, or theoretical papers without empirical evaluation (n = 12). Then, full-text articles were assessed to confirm relevance and methodological adequacy for eligibility, only peer-reviewed journal and conference papers written in English were included, and one article was excluded. The final selection resulted in six studies that met all inclusion criteria.
3.3. Included
The final stage of selection resulted in the inclusion of six research articles, as shown in the lower section of
Figure 2. The limited number of included studies reflected the early adoption of QML within the phishing detection domain. Nevertheless, all eligible studies were systematically analyzed to ensure comprehensive coverage of available evidence. Key details of the included studies, such as source databases, publication years, references, and citation counts, were reported in
Table 2, providing a transparent overview of the selection outcomes.
3.4. Data Extraction and Synthesis
A structured data extraction process was conducted to systematically collect relevant information from each included study. The extracted data included the type of QML classifier, feature encoding strategy, dataset characteristics, evaluation metrics, reported performance, as well as the identified advantages and limitations of each approach. In addition, information on the platform used, such as quantum simulators or real quantum devices, was also recorded. The data extraction was performed independently by the first author and subsequently verified by the second author to ensure accuracy, consistency, and to minimize potential bias in the review process.
The synthesis was conducted using descriptive statistical analysis and thematic grouping. Frequency analysis was used to identify dominant QML classifiers and encoding strategies, while thematic synthesis was applied to categorize advantages, limitations, and research challenges.
3.5. Quality Assessments
The quality assessment of included studies was conducted using an adapted Newcastle–Ottawa Scale (NOS) [
35]. The NOS was structured across three domains, selection, comparability, and outcome, with a maximum score of nine stars. The scale was adapted to suit empirical phishing detection studies using QML or quantum-inspired approaches, where quality depended strongly on dataset integrity, fair benchmarking, and transparent evaluation.
In the selection domain, we evaluated: (1) dataset relevance and representativeness, (2) labeling and ground-truth quality, (3) preprocessing transparency and leakage control, and (4) adequacy of the train–test design, including split strategies and imbalance handling. In the comparability domain, we evaluated: (5) baseline strength and fairness of comparison and (6) control of experimental confounders, including hyperparameter tuning, feature selection, encoding or feature-map settings, circuit or ansatz configuration, and training conditions. In the outcome domain, we evaluated: (7) appropriateness and completeness of outcome measures, (8) robustness and statistical reliability, and (9) reproducibility and generalizability.
Quality scores were independently assessed by two researchers. Studies scoring 7–9 stars were classified as high quality, 4–6 stars as moderate quality, and 1–3 stars as low quality, as summarized in
Table 3.
4. Results
This section presents the synthesized findings from the reviewed studies on QML for phishing detection, organized by the research questions. The review first identified the QML models applied to phishing detection in cyberspace and grouped them into four categories. It then examined the feature encoding strategies employed to map classical phishing features into quantum representations. Building on this, the section analyzed how realistic deployment factors, including limited qubit counts, circuit depth, and hardware noise, influenced the learning behavior, stability, and generalization of quantum machine learning models. This section reported the advantages and disadvantages associated with each technique, as documented in the selected studies. Finally, it consolidated the current challenges that limited practical QML deployment and summarized the future research directions reported in the literature.
4.1. RQ1: What QML Models Have Been Applied for Phishing Detection?
Figure 3 illustrates the occurrence frequency of QML classifiers applied to phishing detection across the reviewed studies. The results show that four main classifier categories were identified: QSVM/QSVC, VQC, QCNN, and QNN. Among these, QSVM/QSVC was the most frequently adopted approach, appearing in four out of six studies, followed by VQC and QCNN, each reported in three studies, while QNN was the least frequently used, appearing in two studies. This result indicates a clear research preference toward kernel-based and variational quantum classifiers.
QSVM/QSVC represents the quantum equivalent of the classical Support Vector Machine (SVM), where classical data are embedded into quantum states using quantum feature maps, such as ZZFeatureMap, and classification is performed using quantum kernel evaluation. QSVM enables improved class separability by projecting phishing URL features into a high-dimensional Hilbert space, particularly when phishing and legitimate URLs exhibit subtle differences in lexical and structural patterns. This capability is especially important in phishing detection, where attackers deliberately manipulate features to resemble legitimate websites. The high occurrence of QSVM/QSVC in
Figure 3 reflects its strong theoretical foundation, simpler implementation within hybrid quantum–classical frameworks, and compatibility with current noisy intermediate-scale quantum (NISQ) hardware constraints.
VQC, identified in three studies, represents another widely explored approach. VQC combines quantum feature encoding with parameterized quantum circuits, known as variational ansatzes, whose parameters are optimized using classical optimizers to minimize classification error. This hybrid optimization process allows VQC to learn non-linear decision boundaries by leveraging quantum superposition and entanglement properties. In phishing detection, this enables the model to capture complex feature interactions and improve discrimination between malicious and benign URLs. The relatively high occurrence of VQC in
Figure 3 indicates its flexibility and adaptability, particularly in hybrid quantum–classical learning environments.
QCNN, also appearing in three studies, extends quantum learning toward hierarchical feature extraction by incorporating convolution-like quantum circuit layers. These architectures are designed to learn structured feature representations through multiple quantum processing layers, similar to classical convolutional neural networks. In phishing detection, QCNN enables progressive transformation and abstraction of input features, potentially improving detection performance. However, despite its theoretical advantages, QCNN was less dominant compared to QSVM, likely due to its higher circuit complexity, increased sensitivity to quantum noise, and greater hardware resource requirements, which limit practical implementation under current NISQ constraints.
QNN, identified in two studies, represents hybrid quantum–classical neural architectures that combine classical neural network concepts with parameterized quantum circuits. In these models, phishing features are first encoded into quantum states using feature maps, and classification is performed using trainable quantum circuits. While QNN provides flexibility and expressive learning capability, its lower occurrence in
Figure 3 suggests that its practical performance and scalability remain limited compared to kernel-based and variational approaches.
4.2. RQ2: What Feature Encoding Strategies Have Been Used to Represent Phishing Data in QML Models?
Figure 4 presents the occurrence frequency of feature encoding strategies used to map classical phishing features into quantum representations across the reviewed studies. The results reveal substantial variation in encoding adoption, reflecting trade-offs between representational expressiveness and hardware feasibility. Among the identified methods, ZZ Feature Map was the most frequently used encoding, appearing in three studies, followed by Z Feature Map, Angle Encoding, and Amplitude Encoding, each appearing in two studies. The remaining encoding strategies, including Quantum Random Access Coding (QRAC), Qudit-based Encoding, Pauli-feature maps, Rotation Angle Setting, Cascaded or Concatenated QRAC, Categorical and Binary Feature Embedding, and One-Hot Encoding, were each used in one study.
The ZZ Feature Map, identified as the most dominant encoding strategy, encodes classical phishing features using single-qubit rotation gates combined with entanglement operations to project data into high-dimensional quantum feature spaces. This encoding is particularly suitable for kernel-based classifiers such as QSVM, as it enables modeling of pairwise feature interactions through quantum entanglement. In phishing detection, this allows relationships between URL lexical features, such as character distributions and subdomain patterns, to be captured more effectively. However, because ZZ Feature Map requires one qubit per feature, its scalability is limited when applied to high-dimensional phishing datasets, which restricts its practical deployment under current NISQ hardware constraints.
Z Feature Map encoding uses Hadamard and parameterized rotation gates to embed phishing features into quantum states. This encoding is particularly useful when feature interactions are less complex, as it avoids heavy entanglement and reduces circuit complexity, improving stability under quantum noise conditions.
Angle Encoding was identified in two studies and represents a simple and hardware-efficient encoding method, where phishing features are encoded directly into qubit rotation angles. This encoding is particularly suitable for variational quantum circuits because it enables efficient parameterized transformations while maintaining shallow circuit depth. Its lower hardware requirements make it more feasible for near-term quantum devices, although it provides lower representational capacity compared to amplitude encoding.
Amplitude Encoding represents another powerful encoding strategy, where phishing feature vectors are mapped directly into the amplitudes of quantum states. This encoding provides high representational efficiency because it allows exponential data compression, where n classical features can be represented using log2(n) qubits. This makes amplitude encoding theoretically attractive for high-dimensional phishing datasets. However, its implementation requires strict data normalization and complex quantum state preparation, which limits its practical usage under current hardware and simulator constraints.
Pauli-feature maps, another quantum kernel-based encoding strategy, were also applied to load phishing data into quantum states for classification. These encodings use combinations of Pauli rotation operators to transform classical features into quantum representations, enabling flexible construction of quantum kernels. Similar to ZZ Feature Map, Pauli-feature maps offer strong representational capability but face scalability challenges due to qubit requirements.
Quantum Random Access Coding (QRAC) and its Cascaded or Concatenated variants were introduced as compression-based encoding strategies to address qubit limitations. QRAC enables multiple classical bits to be encoded into fewer qubits while maintaining partial recoverability, allowing high-dimensional phishing features to be compressed into smaller quantum circuits. Cascaded QRAC extends this concept by grouping multiple QRAC encodings to achieve greater compression efficiency, with reported hardware compression improvements of up to 3.5 times. This approach significantly improves hardware feasibility and enables more realistic deployment scenarios, particularly when quantum hardware resources are limited.
Qudit-based Encoding extends conventional qubit encoding by utilizing higher-dimensional quantum units, known as qudits, to represent phishing data. This approach increases representational capacity and allows more nuanced encoding of complex phishing patterns, particularly linguistic and structural features of malicious URLs. However, because qudit-based quantum hardware remains limited, this encoding has been explored less frequently.
Rotation Angle Setting represents a specialized encoding approach where specific numerical phishing features, such as IP address values, are transformed into quantum rotation angles. This enables direct mapping of numerical data into quantum circuits without requiring additional transformation steps, improving encoding efficiency.
Categorical and Binary Feature Embedding converts phishing features into binary or categorical representations before encoding them into quantum states. This approach simplifies encoding complexity and enables compatibility with quantum circuits, particularly when phishing datasets include categorical attributes such as protocol type or domain category.
One-Hot Encoding was primarily used as a preprocessing step to convert categorical phishing features into binary vectors before quantum encoding. This method ensures compatibility between classical feature formats and quantum encoding circuits but does not directly provide quantum advantage.
4.3. RQ3: How Do Realistic Quantum Hardware Constraints Affect the Performance and Generalization of These Models?
In this section, we distinguish hardware noise (device-level decoherence and gate/readout errors that reduce circuit fidelity), noise-like regularization effects (stochasticity from finite shots or training dynamics that may sometimes improve generalization), and adversarial perturbations (deliberate input manipulations designed to evade detection), because these factors influence QML learning behavior in fundamentally different ways.
Realistic constraints in the current Noisy Intermediate-Scale Quantum (NISQ) era strongly shape the learning behavior and performance of QML models for phishing detection. Limited qubit availability forces researchers to adopt efficient feature encoding strategies, such as QRAC, which compresses multiple classical features into a small number of qubits. This approach not only mitigates hardware scarcity but also enables faster training and evaluation cycles.
Despite these advantages, circuit depth remains a major bottleneck. Increasing the repetitions of feature maps or variational ansatzes to capture subtle linguistic patterns in URLs substantially raises execution wall time and amplifies error rates due to short qubit coherence times. In addition, hardware noise and quantum interference frequently cause measured state probabilities to diverge from ideal simulation results, leading to noticeable performance degradation on real quantum devices and motivating the use of error mitigation techniques.
Interestingly, several studies reported that intrinsic quantum noise can, in some cases, improve generalization by acting as a form of regularization. This effect helps models resist overfitting, particularly when exposed to adversarial manipulations such as URL obfuscation. Furthermore, while Quantum Support Vector Machines exhibit high stability and strong recall in high-dimensional Hilbert spaces, more complex architectures, such as Quantum Convolutional Neural Networks, currently struggle to generalize effectively due to the difficulty of tuning quantum parameters under noisy conditions.
4.4. RQ4: What Advantages and Limitations Have Been Reported for Existing QML Models?
The included studies were categorized into four QML models, and the reported advantages and disadvantages were summarized in the following subsections.
4.4.1. QSVM/QSVC
This subsection reported the advantages and disadvantages of QSVM or QSVC for phishing detection.
Table 4 summarizes the QSVM/QSVC benefits and limitations across the reviewed studies. The reviewed studies reported that QSVM or QSVC improved class separability by using quantum kernels and feature maps to embed classical features into a high-dimensional Hilbert space, thereby strengthening discrimination between benign and phishing samples. Several studies reported higher predictive performance than classical SVM baselines on the same datasets, including accuracy values up to 92% compared with 85–89% for classical SVM in one evaluation, and improved recall that reduced missed phishing cases. One study reported stronger robustness under adversarial perturbations, where QSVM performance remained above 88% while classical models dropped below 75%. Another reported reduction in training time of approximately 40% relative to selected classical baselines under the same experimental design. This comparison refers to the reported optimizer or kernel-training runtime within the authors’ setup, whereas the substantially higher “resource cost” reflects the end-to-end simulation overhead (state-vector updates, shot sampling, and repeated circuit evaluations) rather than the classical SVM solver itself. For graph-based phishing detection, QSVM-based approaches were reported to yield fewer false negatives than graph convolutional baselines in the evaluated setting. In addition, QRAC encoding was reported to improve QSVM performance by approximately 3% compared with a ZZ feature map in one study.
The reviewed studies also reported several limitations. Implementation remained constrained by NISQ hardware noise and limited access to stable real-device execution. Multiple studies reported that simulation-based QSVM or QSVC incurred substantial computational cost, including an experiment where QSVM simulation required nearly 2000 times the resources of classical SVM. Several evaluations were conducted primarily on simulators rather than on physical quantum processors, which restricted operational validation and deployment relevance.
4.4.2. VQC
This subsection reported the advantages and disadvantages of VQC for phishing detection.
Table 5 summarizes the VQC benefits and limitations across the reviewed studies. The reviewed studies reported that VQC combined quantum feature maps with variational ansatzes and leveraged superposition and entanglement to support expressive decision boundaries in classification tasks. QRAC-VQC was reported to outperform a ZZ feature map by approximately 13% in one evaluation. QRAC-based VQC models were also reported to achieve high recall, which reduced false negatives and was particularly valuable for security-sensitive detection settings. One study reported that PhishVQC achieved a maximum macro-average F1-score of 0.89, representing a 22% improvement over earlier VQC-based results, and VQC was reported to achieve among the highest F1-scores across QML techniques on certain datasets, outperforming selected classical baselines such as SVM.
However, the reviewed studies reported that VQC performance often degraded on real quantum hardware due to noise, indicating the need for effective error mitigation. Studies also reported increasing computational burden as dataset size grew, where execution wall time increased substantially and certain ansatz designs, such as EfficientSU2, consistently required more execution time than alternatives such as RealAmplitudes. Current hardware constraints were reported to limit scaling to larger sample sizes and more complex feature representations.
4.4.3. QCNN
This subsection reports the advantages and disadvantages of QCNN for phishing detection.
Table 6 summarizes the key advantages and disadvantages of QCNN for phishing detection. The reviewed studies reported that QCNN supported hierarchical learning and feature extraction, suggesting potential for capturing complex dependencies among URL features. In some evaluations, QCNN complexity was reported to yield slight improvements in phishing detection performance compared with QNN variants. QCNN was also reported to navigate high-dimensional feature spaces while maintaining a hybrid efficiency profile in comparison with fully quantum architectures. The selected studies also reported limitations of QCNN, where its performance was described as sensitive to quantum noise and difficult circuit parameter tuning under NISQ constraints. QCNN architectures were also reported to be less suited to purely numerical feature structures and were often designed with inductive biases that aligned more naturally with image-like or spatially structured data. In addition, QCNN circuit complexity was reported to slow convergence relative to simpler QNN models, and some studies reported that adding quantum layers did not consistently yield the expected gains and sometimes increased training difficulty and runtime.
4.4.4. QNN
This subsection reports the advantages and disadvantages of QNN for phishing detection.
Table 7 summarizes the advantages and limitations of QNN reported in the reviewed studies. The reviewed studies reported that QNN combined classical neural principles with parameterized quantum circuits within a hybrid quantum–classical architecture. One study reported that QNN performed strongly on legitimate-sample classification, suggesting potential utility for designing complementary detection strategies. However, the reviewed studies reported that QNN generally underperformed relative to classical neural networks and QSVC in phishing detection accuracy and class-specific sensitivity. The models were reported to learn more effectively on legitimate samples than on phishing samples, which increased the risk of misclassifying phishing instances as benign. In one evaluation, average phishing detection performance was reported to be approximately 70%.
4.5. RQ5: What Challenges and Future Research Directions Have Been Identified in the Literature?
This subsection synthesized the findings reported in the reviewed studies to summarize the current challenges limiting the practical adoption of QML for phishing detection and the future research directions proposed to address them.
4.5.1. Current Challenges
Table 8 consolidates the current challenges reported across the reviewed studies. The reviewed studies reported that QML performance was strongly constrained by NISQ hardware limitations. Gate errors, decoherence, and measurement instability were repeatedly reported to degrade performance on real devices relative to simulator-based results. Limited qubit availability was also reported to restrict feature dimensionality and hinder scaling to real-world phishing datasets. Several studies reported high computational and simulation costs, where QML pipelines required substantially more resources than classical counterparts and execution wall time increased sharply as dataset size grew. Feature encoding was consistently reported as a bottleneck because inefficient feature maps increased qubit requirements and added overhead.
Many studies primarily relied on quantum simulators rather than physical hardware, limiting real-world validation. Circuit design sensitivity and ansatz selection were also reported to affect runtime and model stability, with certain configurations increasing execution time substantially. In addition, studies reported challenges in supporting large-scale or real-time phishing detection due to training and inference latency. Finally, the reviewed studies reported that the lack of standardized benchmarks and evaluation protocols limited fair comparison between QML approaches and classical ML baselines.
4.5.2. Future Research Directions
Table 9 presents the future research directions most frequently reported in the reviewed studies. The reviewed studies emphasized improving robustness on NISQ hardware through noise-aware modeling and error-mitigation techniques. They also highlighted the need for more efficient quantum feature encoding schemes to reduce qubit requirements and encoding overhead. Multiple studies called for scalable QML architectures capable of handling large, real-world phishing datasets and for increased adoption of hybrid quantum–classical pipelines to balance quantum expressiveness with classical efficiency. Extending evaluations from simulators to real quantum hardware through systematic benchmarking was repeatedly recommended.
Several studies also proposed adaptive or incremental QML frameworks to address evolving phishing tactics and concept drift. Circuit simplification and optimized ansatz design were reported as important for reducing execution time and training cost. In addition, studies consistently recommended developing standardized datasets and evaluation protocols to enable reproducible and fair comparisons with classical methods. Finally, the reviewed studies suggested expanding QML applications to broader phishing contexts, including multimodal and zero-shot settings, and improving explainability and interpretability to support trustworthy deployment in cybersecurity environments.
Future research should also emphasize standardized benchmarking practices across different quantum hardware platforms. This includes evaluating QML models on multiple quantum processors, comparing performance across varying noise profiles, and conducting cross-device validation. Studies should further report detailed hardware configurations, such as qubit connectivity, gate fidelities, calibration data, and measurement error rates, to support reproducibility and fair comparison. In addition, ablation studies examining the effects of circuit depth, connectivity constraints, and encoding strategies would provide deeper insight into the practical limitations and scalability of QML models.
5. Discussion
This discussion highlights the main findings and indicates that QML for phishing detection research has concentrated on a small set of techniques, with QSVM/QSVC and VQC appearing most frequently, while QCNN and QNN remain less common and more experimental. Across these techniques, the results consistently suggest a trade-off between promising classification performance under controlled settings, often on simulators, and limited deployment readiness due to NISQ noise, qubit scarcity, encoding overhead, and high computational costs.
5.1. Interpreting the Landscape of QML Models Used for Phishing Detection
The results showed that the reviewed studies employed four main categories of QML techniques, namely QSVM or QSVC, VQC, QCNN, and QNN. QSVM or QSVC and VQC dominated the literature, a pattern that can be interpreted from a practicality perspective. In the reviewed studies, both techniques were implemented within hybrid quantum–classical pipelines, where classical preprocessing generated compact feature representations and quantum circuits were subsequently used either to compute kernel functions, as in QSVM or QSVC, or to implement parameterized decision functions, as in VQC. This design aligned well with how phishing datasets were represented, most commonly as URL-derived feature vectors and, in some cases, as transaction or interaction networks for node classification. QCNN was positioned as a model capable of hierarchical learning and implicit dimensionality reduction through layered parameterized quantum gates; however, it was reported to be less frequently applied due to data suitability issues and current hardware limitations.
Quantum feature maps embed classical input features into high-dimensional Hilbert spaces with exponential representational capacity. This property is particularly relevant for phishing detection, where class boundaries are inherently complex because attackers deliberately imitate legitimate websites. From a theoretical and empirical standpoint, quantum kernels can exploit feature spaces that are inaccessible to conventional feature engineering methods, a capability that has been shown to improve class separability in practical learning scenarios [
36]. These findings indicate that kernel-based and variational QML models currently align most closely with the structural properties of phishing data and the constraints of NISQ-era hardware.
In contrast, QNN model approaches showed weaker phishing-class performance in the reviewed evaluations, which likely reduced their adoption relative to QSVM and VQC baselines. Although QNN architectures offer conceptual flexibility as hybrid quantum–classical neural models, the reported results suggested limited benefits under current noise and scaling constraints.
Future research should focus on systematically comparing QSVM, VQC, and emerging hybrid architectures under standardized datasets and evaluation protocols. Further work should also explore noise-aware circuit design, qubit-efficient feature encodings, and adaptive learning mechanisms to improve robustness against evolving phishing strategies. As quantum hardware matures, future studies should extend experimental validation beyond simulators to real devices and investigate whether deeper circuits or hybrid scaling strategies can enable more expressive QML models without sacrificing practicality.
5.2. Advantages and Disadvantages for Each Technique
QSVM and QSVC demonstrate consistent advantages through the use of quantum kernels and feature maps that project phishing features into high-dimensional Hilbert spaces, thereby improving class separability. This is particularly important because phishing indicators are often weakly separable in the original feature space, especially when adversaries mimic legitimate patterns. As shown in
Table 10, QSVM-based approaches achieved accuracy levels up to 92% in hybrid settings and perfect classification (100%) in controlled subsets, while precision and recall values reached as high as 1.00 in kernel-based configurations. In comparison, classical SVM baselines typically reported accuracy in the range of 85–89%, indicating that quantum feature mapping functions as a strong nonlinear transformation rather than a direct model replacement.
In addition, robustness under adversarial perturbations was maintained, with QSVM performance remaining above 88% while classical models dropped below 75%, suggesting reduced sensitivity to small feature manipulations such as character substitutions or subdomain obfuscation [
37]. The influence of encoding strategies is also evident in
Table 10, where QRAC-QSVM achieved higher recall (0.96) and F1-score (0.93) compared with other configurations, supporting the claim that embedding choice contributes significantly to performance gains [
38].
However, these advantages are offset by practical limitations. The reported computational cost of quantum simulations can reach approximately 2000 times that of classical SVM, which raises concerns regarding scalability and deployment feasibility in real-time phishing detection systems. Furthermore, most results remain dependent on simulators rather than real quantum hardware, limiting external validity.
VQC approaches combine quantum feature maps with variational ansatz circuits optimized via classical methods, offering flexibility in model design. As reflected in
Table 10, VQC models achieved strong recall (0.93) and competitive F1-scores (up to 0.89), particularly in QRAC-based configurations. The PhishVQC model demonstrated high precision (0.97) but relatively lower recall (0.81), indicating a tendency toward conservative classification. The use of macro-average F1-score is appropriate in this context due to class imbalance, where false negatives carry higher operational risk [
39]. Moreover, QRAC-based encoding improved performance by approximately 13% compared with ZZ-based feature maps, reinforcing the importance of encoding strategies.
Despite these strengths, VQC models exhibit sensitivity to NISQ noise and increased computational overhead as dataset size grows. Ansatz selection also affects efficiency, with EfficientSU2 circuits incurring higher computational cost than RealAmplitudes, thereby constraining scalability and retraining frequency in dynamic threat environments [
40].
QCNN models introduce a hierarchical structure capable of capturing feature interactions, which is theoretically suitable for compositional phishing patterns. However, empirical results in
Table 10 show inconsistent performance, with accuracy as low as 0.65 and modest F1-scores (0.62). Even in improved configurations, QCNN achieved only 85.22% accuracy, which remains below QSVM performance. These findings suggest that QCNN architectures are sensitive to noise, require complex tuning, and are less suited to tabular phishing datasets due to their reliance on spatial inductive biases [
41].
QNN-based approaches demonstrate moderate performance, achieving accuracy of 0.9107 on the PhishStorm dataset, but lack complete metric reporting. The available precision (~0.94) suggests reasonable classification capability, yet the absence of recall and F1-score limits interpretability. More importantly, existing results indicate a bias toward legitimate-class prediction, with phishing detection performance reported around 70%, which is insufficient for deployment in high-risk environments.
5.3. A Practical Performance vs. Readiness View
The results showed that many reported performance gains were obtained under constrained evaluation settings, predominantly using simulators, while performance on real quantum hardware was repeatedly found to be vulnerable to NISQ-related noise.
Figure 5 presents readiness versus reported model performance of QML for phishing detection. The observed gap between simulator-based results and real-device behavior reflects a persistent limitation of NISQ-era quantum computing. Methods evaluated under idealized or lightly noisy simulation conditions often did not translate into comparable performance on physical devices, where combined noise sources, including decoherence, gate errors, depolarization, and state preparation inaccuracies, progressively reduced fidelity as circuit depth and qubit count increased. In phishing detection, which requires reliable and reproducible operation, this discrepancy represents a significant obstacle to practical validation and deployment [
42].
This interpretation aligns with the challenges identified in the reviewed studies, including NISQ noise, limited qubit availability, high simulation cost, poor scalability with increasing dataset size, encoding overhead, and heavy reliance on simulators. In
Figure 5, the horizontal axis denotes reported model performance, while the vertical axis indicates deployment readiness. QSVM or QSVC and VQC demonstrated strong reported performance but low readiness due to simulation expense, hardware sensitivity, and NISQ constraints. QCNN and QNN appeared in the low-performance, low-readiness region, reflecting mixed or weak phishing-class results and convergence difficulties.
5.4. Current Challenges and Implications
The results showed that hardware constraints dominated the simulator-to-hardware realism gap. The most consistent limitation was NISQ noise, including gate errors, decoherence, and unstable measurements, which degraded performance on real devices relative to simulators. These noise sources operated across different timescales. Coherent errors accumulated systematically during circuit execution, while decoherence arose from interactions with the environment, and both effects were compounded by calibration drift that varied within and across days on the same device. In this setting, algorithm-level design alone did not close the simulator-to-hardware gap without hardware-aware compilation and targeted error mitigation. This gap matters because phishing detection requires reliable operation under real-world conditions. If simulator gains do not transfer, QML remains a research prototype rather than a deployable defense component.
The results also showed that qubit scarcity and feature dimensionality restricted scaling. Limited qubits constrained the number of features that could be encoded, which limited applicability to large, real-world phishing datasets. Although NISQ devices may provide roughly 50 to more than 100 qubits, effective use was reduced by connectivity constraints, uneven calibration quality, and location-dependent error rates. Two-qubit gates were also substantially noisier than single-qubit operations, creating a trade-off between modeling richer relationships through entanglement and preserving circuit fidelity through shallow designs [
43]. This constraint is particularly important because strong classical phishing detectors often benefit from richer feature sets and larger training data. Under strict qubit budgets, QML must compress features or adopt more efficient encodings, both of which can introduce new failure modes.
A third challenge reported in the results was high computational and simulation cost, sometimes orders of magnitude higher than classical baselines, coupled with poor scalability as dataset size increased and execution wall time grew. This directly conflicts with operational phishing defense, where models must retrain frequently and score large volumes of URLs with low latency.
The results further showed that feature encoding created a persistent bottleneck. Inefficient feature maps increased qubit requirements and added circuit depth. Encoding choices, such as angle encoding, block encoding, QRAC, or entanglement-based schemes, directly affected both resource use and fidelity. No single encoding strategy was consistently optimal because performance depended on dataset properties, ansatz choice, and hardware factors such as native gates and connectivity. As a result, encoding was often optimized through manual tuning or automated search methods. Circuit design and ansatz selection also remained sensitive. In variational circuits, ansatz design must balance expressivity with trainability. More expressive ansatzes, such as EfficientSU2, can represent richer functions but often increased training time and optimization difficulty, including barren plateau effects. Conversely, simpler circuits reduced cost but increased the risk of underfitting. Together, these findings indicate that current QML outcomes may be shaped as much by representational engineering, particularly the encoding–ansatz pairing, as by the nominal choice of classifier family (QSVM, VQC, or QNN). Comparative evidence suggested that once an encoding–ansatz combination was carefully tuned for a given task, performance differences between classifier families often narrowed. This implies that reported “quantum advantage” may reflect problem-specific engineering choices rather than general algorithmic dominance, and claims of broad superiority therefore require caution.
In addition, future QML studies should adopt more rigorous and transparent hardware-level reporting practices. Experimental results obtained on quantum hardware can be significantly influenced by device-specific characteristics, including qubit topology, gate error rates, readout fidelity, and calibration conditions. Without detailed reporting of these hardware parameters, it remains difficult to determine whether observed performance improvements reflect genuine algorithmic advantages or are artifacts of specific hardware configurations. Systematic evaluation across multiple quantum processors with different noise characteristics would provide stronger evidence of algorithm robustness and generalizability. Furthermore, repeated experiments over time are necessary to account for calibration drift and temporal variability in quantum device performance.
Finally, the results reported that the lack of standardized benchmarks and evaluation protocols limited fair comparison with classical ML. This is consequential because small methodological differences, such as train–test splits, preprocessing order, or feature selection leakage, can inflate performance, especially in simulator-based studies with small datasets. In addition, without shared benchmarking frameworks, reproducibility is weakened and “quantum advantage” claims become difficult to validate across independent research groups. Future work should therefore adopt standardized datasets and protocols, report simulator versus hardware settings transparently, and include reproducibility artifacts to support credible cross-study comparison.
5.5. Future Research Directions: A Coherent Roadmap
The results showed that the most emphasized direction was the development of noise-aware and error-mitigated QML models to improve reliability on real NISQ hardware. Error mitigation methods such as dynamic decoupling, circuit folding for zero-noise extrapolation, and noise-aware compilation were reported as promising for improving fidelity on physical devices. However, these approaches also introduced overhead through increased circuit depth, higher gate counts, or additional classical post-processing. The findings further indicated that integrating device-specific calibration data into compilation and optimization could improve performance, but this tailoring reduced portability across hardware platforms [
44]. This trade-off is important because credible phishing defense requires robust performance on real devices, not only on simulators.
The results also showed that designing more efficient quantum feature encoding schemes was repeatedly proposed to reduce qubit usage and encoding overhead. Recent work on trainable embeddings, evolutionary architecture search, and quantum data compression suggests that encoding efficiency can be improved through structured optimization. For example, QRAC-based embeddings were reported to increase accuracy while using fewer qubits, and quantum run-length encoding was reported to provide substantial resource gains for specific data structures [
45]. The reported improvements associated with QRAC-based approaches in this review reinforced that encoding choices can materially affect recall and F1.
In addition, the findings indicated a strong need for scalable QML architectures that can handle large phishing datasets and for hybrid quantum–classical pipelines that balance quantum capability with classical efficiency. Hybrid designs, where quantum circuits support feature extraction or intermediate representation learning and classical models perform final classification, are more scalable because classical components can process large datasets without being limited by qubit budgets. In the near term, this approach appears the most practical because it acknowledges hardware constraints while still enabling targeted quantum components to be evaluated.
The results further showed that adaptive and incremental QML models were proposed to address concept drift and evolving phishing strategies. Phishing is inherently adversarial and non-stationary, as attackers continually alter URLs and social engineering tactics to evade detection. Drift-aware and adaptive learning frameworks are widely used in cybersecurity to monitor performance shifts and trigger retraining, yet such mechanisms were largely absent from the reviewed QML studies. This gap matters because deployment-ready phishing detection requires continuous adaptation, not one-off training.
Finally, this study emphasized the need for standardized datasets and evaluation protocols, systematic benchmarking on real quantum hardware, expansion to broader settings such as transaction networks and multimodal or zero-shot scenarios, and improved explainability to support trustworthy deployment. Together, these directions will move the field from proof-of-concept demonstrations toward reproducible and security-relevant research.
5.6. Implications for Interpreting the Current Evidence
The findings supported the interpretation that QSVM or QSVC and VQC were the most mature candidates for further research iteration because they consistently reported competitive metrics, including improvements in accuracy, recall, and F1, as well as robustness under perturbations and gains driven by encoding choices. These approaches also benefited from stronger theoretical foundations, widely available implementations in platforms such as Qiskit and PennyLane, and a broader empirical base than newer phishing-focused approaches such as QCNN or QNN. However, greater research maturity did not automatically imply operational readiness [
46].
At the same time, the results did not support strong deployment claims. Simulator dependence, NISQ noise, qubit constraints, and cost and latency limitations were repeatedly reported as unresolved barriers. Overall, the current literature was more informative about where QML might add value than about when it will be ready to replace, or reliably outperform, well-optimized classical phishing detectors.
In the near term, QML should be treated as a complementary tool for narrowly scoped tasks, such as anomaly detection in structured network data or feature selection that captures complex interaction patterns, rather than as a full replacement for classical phishing detection systems. This targeted strategy will enable incremental validation on real quantum hardware, accumulation of operational evidence, and gradual expansion of scope as quantum devices improve and QML methods become more robust.
6. Limitations
While this early systematic review provided a structured synthesis of QML research in phishing detection, several limitations should be acknowledged when interpreting the findings. First, the evidence base remained small, as only six eligible primary studies were included. This constraint limited the strength of cross-study comparisons and reduced the ability to generalize observed trends beyond early-stage or proof-of-concept implementations.
Second, the search strategy and eligibility criteria introduced potential selection bias. The review was restricted to a predefined set of databases, a recent publication window, and English-language, peer-reviewed articles. As a result, relevant studies using alternative terminology, such as “malicious URL detection” without explicitly referencing “phishing,” as well as quantum-inspired approaches, technical reports, preprints, or industry evaluations, may not have been captured. In addition, some potentially relevant studies were excluded due to full-text access limitations, which further constrained coverage.
Finally, many of the reported results were obtained under simulation or highly controlled experimental conditions. Performance on real quantum hardware remains sensitive to NISQ noise and qubit constraints. Consequently, the conclusions of this review primarily reflect current experimental feasibility rather than readiness for deployment in real-world phishing detection systems.
7. Conclusions
This systematic review synthesizes the emerging literature on QML for phishing detection. Because the evidence base remains small, only six studies met the eligibility criteria, and study designs were heterogeneous, these conclusions should be interpreted as early evidence rather than definitive guidance. Reported performance and claimed advantages were study-specific and not always directly comparable due to differences in datasets, feature sets, encodings, backends, and evaluation protocols.
Across the included studies, four QML models were applied to phishing detection: QSVM/QSVC, VQC, QCNN, and QNN. QSVM/QSVC and VQC were evaluated most frequently and often reported competitive results under controlled or simulated settings. However, outcomes varied with feature encoding choices, circuit depth, ansatz design, and leakage control, and only limited evidence supported direct comparison against strong classical baselines.
Taken together, the current literature positions QML as an experimental component that can complement classical phishing pipelines, particularly as a representation mechanism that may introduce different inductive biases in high dimensional Hilbert spaces. At this stage, the evidence does not support broad claims of superiority or deployment readiness.
This review contributes a structured synthesis that maps models, feature encoding strategies, reported advantages and disadvantages, and realistic constraints in the NISQ era, and it consolidates challenges and future directions into a practical roadmap. It also identifies recurrent reporting gaps that limit reproducibility, including incomplete dataset descriptions, inconsistent train-test splits, and limited disclosure of hyperparameter tuning and resource costs.
Future work should prioritize noise-aware training and inference on real quantum backends, qubit-efficient encodings tailored to phishing features, and evaluation on larger, temporally realistic datasets that reflect evolving attack behavior. Community benchmarks with shared splits, clear leakage controls, and transparent reporting of preprocessing and tuning should be established, and hybrid designs should justify when the quantum component adds measurable value beyond classical alternatives.
Future investigations should also move beyond reporting classification accuracy alone and incorporate practical performance metrics, including execution time, shot efficiency, and error-mitigation overhead. These factors directly influence the feasibility of real-world deployment and provide a more comprehensive assessment of algorithm efficiency. More transparent and systematic hardware-level evaluation will be essential to determine whether reported improvements represent true algorithmic progress or device-specific behavior.
Until these evaluation practices are adopted consistently, cross-study comparisons should be treated cautiously. Deployment-oriented testbeds and open benchmarks will help translate progress from simulator-centric proofs of concept into evidence that is relevant for operational phishing defense.