1. Introduction
While cybersecurity tools grow increasingly sophisticated, phishing attacks persist and evolve by exploiting our fundamental human impulses to trust, react quickly to authority, and respond to emotional triggers, making the human mind both our greatest asset and most persistent vulnerability. These attacks exploit various entities, such as malicious URLs (Uniform Resource Locator), deceptive screenshots, and misleading HTML (Hypertext Markup Language) text, to bypass traditional security measures and gain unauthorized access to personal and organizational data. These malicious activities, furthermore, have evolved in sophistication, adapting to conventional detection methods and necessitating more robust countermeasures. According to the Anti-Phishing Working Group’s 4th Quarterly Phishing Activity Report 2024, 989,123 phishing attacks were observed, up from 877,536 in the second quarter [
1]. The report highlights several key trends, such as (i) attackers now sending Google Street View photos of victims’ homes in targeted emails, and (ii) an alarming new trend in extortion emails that include the recipient’s phone number and home address as part of the lure [
1], highlighting the level of sophistication and danger in phishing.
While single-modality approaches, focusing on specific attack vectors such as URLs [
2,
3], web page screenshots [
4], and HTML content [
5,
6,
7], have shown promise, they often fall short in addressing the multifaceted nature of modern web page phishing techniques. The limitations of unimodal detection strategies have become increasingly apparent as cybercriminals employ more complex and diverse tactics to evade detection. For instance, URL-based methods may struggle with compromised domains or typosquatting. Similarly, screenshot analysis alone may fail to capture highly diverse layouts, whereas HTML content analysis may miss clues that could indicate malicious intent. In a similar vein, logo-focused approaches have to utilize domain-checking mechanisms since the existence of brand logos does not always imply malicious intent. Nevertheless, today, the utmost goal and the challenge of contemporary anti-phishing systems is recognizing zero-day attacks where a clever, unknown and novel tactic is exploited which machine learning (ML) models have not seen before, resulting in a vulnerability.
According to Parcalabescu et al. [
8], a machine learning task is considered multimodal when the inputs or outputs are represented in different ways or consist of distinct types of fundamental units of information. In response to the above-mentioned challenges, multimodal approaches have emerged as a promising avenue for enhancing phishing detection in terms of accuracy and robustness. By integrating diverse data sources (e.g., URL, logo, screenshot, HTML content, favicon, etc.) and leveraging the complementary strengths of various representations, multimodal methods offer great potential in a more comprehensive and enriched understanding of phishing threats. This holistic approach aims to detect subtle patterns and correlations that may be imperceptible when examining a single modality in isolation. At this point, Lee et al. [
9] note that the synergistic potential of combining URL analysis, visual inspection, and structural content evaluation presents an opportunity to improve the effectiveness of anti-phishing systems. However, it is important to realize that multimodal approaches are imperfect due to several technical challenges (e.g., the high number of features and resultant high computation burden). For example, the method followed by Wang et al. [
10] requires collecting multiple types of features (domain registration, content analysis, behavioral patterns) that may not always be readily available or reliable, potentially limiting its practical applicability across diverse web environments.
Existing multimodal phishing detection approaches suffer from three fundamental limitations that constrain their practical deployment and generalization capability. First, methods such as Jail-Phish [
11], SenseInput [
12], and the approach by Yu et al. [
13] rely heavily on handcrafted features extracted from HTML structure, CSS properties, or visual elements, requiring extensive domain expertise and becoming brittle as attackers adapt their techniques. Second, several state-of-the-art systems depend on third-party services. For instance, Jail-Phish [
11] queries Google Search for domain verification, while recent LLM-based approaches like [
9] rely entirely on commercial APIs without custom model development, introducing latency, cost, and availability dependencies unsuitable for high-throughput security applications. Third, most existing multimodal methods [
11,
12,
14,
15] employ multi-stage pipelines where modalities are processed separately and combined through late fusion or model stacking, preventing the learning of joint cross-modal representations that could capture subtle interdependencies between URL patterns, visual deception, and semantic manipulation.
CrossPhire fundamentally differs from prior work through three methodological innovations. First, we introduce semantic-aware content extraction using markup-free text parsed from HTML and encoded with sentence transformers (MPNet), eliminating reliance on DOM structure analysis while capturing linguistic deception patterns. Unlike HTML-based approaches [
16,
17,
18] that process structural markup or methods that require English language content [
6,
19], our approach is language-independent through multilingual embeddings or efficient translation preprocessing. To the best of our knowledge, this is the first application of sentence transformers to markup-free web page text for phishing detection. Second, we propose an end-to-end joint training architecture where URL (via GramBeddings), visual (via fine-tuned CNNs), and semantic (via frozen sentence transformers) encoders are simultaneously optimized through a unified embedding space, enabling the model to learn cross-modal feature interactions rather than treating modalities as independent information sources. This contrasts with prior multi-stage approaches [
11,
12] and ensemble methods [
15] that combine separately-trained models. Our main contributions are:
A novel end-to-end multimodal architecture that jointly learns discriminative representations from URL syntax, visual appearance, and semantic content through unified embedding optimization, eliminating the multi-stage pipelines and handcrafted features that characterize prior multimodal approaches [
11,
12,
13].
The first application of sentence transformers to markup-free HTML text for phishing detection, enabling language-independent semantic analysis without reliance on DOM structure or third-party translation services during inference.
Two novel multimodal datasets: Phish360 (10,748 samples spanning 2020–2024 with rigorous duplicate elimination and diversity validation) and Phish360-Zeroday (1080 samples from February 2025), addressing the temporal bias and quality issues we identified in existing benchmarks through comprehensive dataset analysis.
Rigorous cross-dataset evaluation demonstrating generalization capability across five datasets and temporal robustness on zero-day attacks, including comparative analysis of two fusion strategies (concatenation vs. mixture of experts) and modality dropout training for missing modality resilience.
A LIME-based explainability framework providing hierarchical explanations at both modality-level (quantifying URL/visual/semantic contributions) and token-level (highlighting specific characters, image regions, and phrases driving classification).
The rest of the paper is organized as follows:
Section 2 introduces relevant related studies, whereas
Section 3 explains our motivation. Next,
Section 4 presents the employed datasets.
Section 5 explains the details of the proposed scheme. Similarly,
Section 6 presents the findings and results.
Section 7 discusses the pros and cons of the proposed approach, while
Section 8 concludes the study.
2. Related Work
Phishing detection methods have evolved significantly, with various approaches classified into several distinct categories. Traditional research often groups these methods into (i) list-based, (ii) similarity-based, and (iii) machine learning-based approaches. However, we categorize the existing literature into five main categories based on the data modality used: (1) URL-based; (2) content-based; (3) vision-based, (4) bimodal approaches, and (5) multimodal approaches. In this section, we review state-of-the-art anti-phishing approaches from recent years according to these categories. We summarize key studies, highlighting their main findings, datasets used, and known limitations. This structured overview aims to provide a clear understanding of the strengths and weaknesses of current phishing detection methods, setting the stage for our proposed approach.
2.1. URL-Based Phishing Detection
Early URL-based methods relied on blacklists, which were difficult to maintain due to the need for constant updates. Recent studies have shifted toward machine learning, extracting lexical and statistical features from URLs [
20,
21], such as length [
21], dots, or hyphens [
20]. Advanced techniques include automatic feature extraction using NLP methods like TF-IDF (term frequency-inverse document frequency) [
22,
23,
24], N-gram features, and embeddings [
3,
25,
26] for more accurate detection.
In 2019, Sahingoz et al. [
2] collected and published the EBBU2017 dataset with 73,000 URL samples and extracted 40 NLP-based features from URLs, achieving 97.98% accuracy using a Random Forest classifier. Rao et al. [
22] later developed the CatchPhish dataset, combining handcrafted and TF-IDF features to reach 96.67% accuracy with a similar model, validated across two benchmark datasets with accuracies up to 98.57%. Recently, Haynes et al. [
27] fine-tuned pre-trained transformers (BERT and ELECTRA) on URL data, achieving 96.3% accuracy, showing that transformers outperform traditional models with less training time.
Shirazi and Hayne [
28] introduced MobileBERT for phishing detection on mobile devices, having three times faster runtime performance than its BERT-base counterpart and exceeded the accuracy of 97%. Jishnu and Arthi [
25] combined BERT embeddings with handcrafted features, achieving 97.32% accuracy on 200,000 URLs. In another study, Jishnu and Arthi [
26] used RoBERTa for feature extraction and LSTM for classification, reporting an accuracy of 97.14% on 300,000 URLs. Both studies highlighted strong performance using transformer-based models for phishing detection.
In a different vein, Bozkir et al. [
3] introduced
GramBeddings, a deep learning model for phishing detection using URL character-level n-grams. They collected a balanced dataset involving 800K phishing and legitimate URLs to address the scarcity of public large scale URL datasets. Their model, combining CNN, BiLSTM, and attention layers, achieved an accuracy of 98.27% on their dataset and outperformed other methods on seven public benchmarks with at least the accuracy of 98.32%. URL-based schemes offer a way faster processing time. They are, however, vulnerable to zero-day attacks mainly due to a lack of prior knowledge, rapid evolution, and more importantly, the limited context [
29].
2.2. Content-Based Phishing Detection
Content-based phishing detection involves analyzing websites’ textual content and HTML structure to detect phishing intentions or cues. Early approaches heavily relied on handcrafted features, which were partially effective but prone to obsolescence due to the dynamic nature of phishing attacks.
Back to conventional works,
CANTINA, introduced by Zhang et al. [
19], uses the top five words from TF-IDF values for reverse search, classifying websites as legitimate if the domain appears in the first N results, achieving a 97% true positive rate. Despite its success,
CANTINA has limitations, including a reliance on the English language and high false positive rates. To address these issues, Xiang et al. later proposed
CANTINA+ [
6], which employs 15 features from the HTML Document Object Model (DOM) and third-party services. Similarly, Hou et al. [
5] developed a method for malicious content detection using dynamic HTML, extracting 17 features and reporting an accuracy of 96.14% through boosted decision trees. Nevertheless, hand-crafted features are prone to being bypassed by attackers, are expensive to extract, and require expert knowledge. Furthermore, as Opara et al. [
16] stated, they struggle to keep pace with the evolving nature of phishing attacks and fail to capture semantic patterns in textual content [
30].
To overcome these limitations, researchers are shifting towards automatic feature extraction. Towards this direction, Opara et al. [
16] introduced
HTMLPhish, which employs convolutional neural networks to extract embeddings from HTML content. They achieved a testing accuracy of 93% with a dataset of 25,000 samples. HTMLPhish, nonetheless, struggles with capturing semantic relationships and relies heavily on training data.
Ouyang et al. [
17] followed a different path by employing a graph neural network (GNN) approach, representing DOM tags as nodes and edges, achieving an accuracy of 95.5% on a large dataset. However, this method can be bypassed by cloning legitimate HTML structures. In a different vein, Benavides-Astudillo et al. [
31] used GloVe [
32] embeddings to capture semantic features in HTML. They achieved a mean accuracy of 97.39% with a Bidirectional Gated Recurrent Unit (BiGRU) on an imbalanced dataset, though a small validation sample size limited it. Likewise, Çolhak et al. [
18] proposed an approach based on fusing CANINE and RoBERTa embeddings using the HTML content. The authors used both textual and numerical features extracted from the HTML content. They reached an accuracy of 97.18% on their dataset and 89.58% on a benchmark dataset using a multilayer perceptron. However, the handcrafted features require expert domain knowledge and can be bypassed by attackers.
2.3. Vision-Based Phishing Detection
Vision-based methods in anti-phishing studies are motivated by the fact that phishing web pages mimic their legitimate counterparts in holistic appearance or specific visual elements, such as logos, layouts, favicons, and color schemes, to deceive users effectively. These schemes often use image classification, object detection, or similar vision models to detect any possible visual similarity between screenshots or smaller components of a phishing material and its legitimate counterpart.
Phoka and Suthaphan [
33] developed a phishing detection method using pre-trained CNNs on login page images for five brands. They implemented data augmentation through sub-image placement and achieved accuracy of 97.1% by applying the Inception-ResNet-v1 model. From the perspective of logo similarity, Bozkir et al. [
34] introduced
LogoSENSE, which employs max-margin object detection (MMOD) for logo detection and extracts features using histogram of oriented gradients (HOG). Evaluated on 3060 training and 1979 testing samples, it achieved a precision of 93.50% and 85.02% F1-score. Addressing the unexplainable classification results of phishing detection systems, Lin et al. [
35] proposed Phishpedia, a two-step deep learning approach that detects logos and matches them with legitimate references. Using Faster R-CNN and a Siamese model with ResNet, Phishpedia obtained 89.2% precision, 87.1% recall, and 99.2% phishing identification rate on a dataset of 30,649 samples.
Wang et al. [
36] introduced a vision-based approach for recognizing phishing web pages using screenshots. Their deep learning-based approach combines local and global features by first locating and extracting the logo and then integrating it with the full screenshot. The authors evaluated their approach on two public datasets, PhishPedia and VisualPhish, achieving 95% and 85% accuracy, respectively.
2.4. Bimodal Approaches
In contrast to the single modal schemes presented above, Bimodal anti-phishing approaches combine two data modalities, often using both textual modalities (URL and HTML content) or visual and textual data, to improve the robustness in phishing detection.
Van Dooremaal et al. [
37] proposed a phishing detection method using HTML and images. They obtained an accuracy of 99.66% by combining visual and textual features, applying reverse image search for brand identification, and logistic regression for classification. Sánchez-Paniagua et al. [
38] introduced the PILWD-134K dataset and used 54 handcrafted features to reach an accuracy of 97.95% using LightGBM, though the approach requires significant manual effort. Liu et al. [
39] developed
PhishIntention, which identifies phishing intent using deep learning models by leveraging visual and HTML content, achieving an accuracy of 95% for credential-requiring page (CRP) detection and 93.3% for CRP transitions.
Vo Quang et al. [
40] proposed a deep learning approach that utilizes URL and HTML DOM structure so-called
Shark-Eyes. The authors report an accuracy of 95.35% on a self-collected dataset and 92.55% on adversarial phishing samples. While Shark-Eyes demonstrates attention-based fusion of URL and DOM features, its reliance on DOM structure makes it vulnerable to HTML obfuscation attacks and limits its ability to capture semantic content. In contrast, CrossPhire’s use of markup-free text extraction avoids DOM-based vulnerabilities while preserving semantic meaning through sentence transformers.
Tong et al. [
41] proposed a new bimodal approach utilizing raw URL and the sequence of HTML tags. ConvBERT + positional encoding for URL and HTML. Likewise, Opara et al. [
30] introduced
WebPhish, a hybrid deep neural network utilizing raw URL and HTML content using CNNs. Their proposed network concatenates character-level embeddings from URLs and word embeddings from HTML. WebPhish outperformed its unimodal alternatives, achieving an accuracy of 98.1% on their self-collected dataset. However, WebPhish’s word-level HTML embeddings are language-specific and training-data dependent, limiting cross-lingual generalization. CrossPhire addresses this through multilingual sentence transformers (MPNet and XLM-RoBERTa) that provide language-independent semantic representations.
Lee et al. [
9] introduced a two-step phishing detection method based on large language models (LLMs) using the web page’s screenshot and HTML. The approach uses LLM prompts for brand identification and domain verification. The method achieved 90% precision and recall for a collection of 4480 samples. While leveraging LLMs offers strong contextual understanding, this approach’s complete reliance on third-party commercial APIs (GPT-4) introduces cost, latency, and availability constraints that limit practical deployment. In contrast, CrossPhire’s self-contained architecture enables offline operation and direct model optimization without external dependencies.
2.5. Multimodal Approaches
According to our taxonomy, multimodal phishing detection methods integrate more than two data modalities (e.g., URL, HTML, and images) to create comprehensive models that capture various aspects of phishing web pages. We review the latest multimodal approaches that combine the three main data modalities for phishing detection.
Rao and Pais [
11] introduced
Jail-Phish, a search engine-based system designed to address Phishing Sites Hosted on Compromised Servers (PSHCS). Jail-Phish classifies web pages by querying Google and comparing domain and title information with the top 10 search results. The system extracts features from the URL, CSS, JavaScript, and image files (e.g., logos or favicons) to calculate the Jaccard similarity between the query and result pages. Jail-Phish achieved an accuracy of 98.61% on a dataset of 11,500 samples. However, its reliance on Google Search API introduces latency, third-party dependency, and potential failures when phishing sites become indexed. CrossPhire’s self-contained feature extraction avoids these external dependencies while maintaining high accuracy.
Yu et al. [
13] developed a phishing detection method by combining URL, HTML, and image features. Their approach uses LSTM layers to process URL and HTML text data, while CNNs with CBAM attention extract image features. Several pre-processing steps are applied to the screenshot. The concatenated feature vectors from each modality are then fed into fully connected layers, achieving a 97.75% accuracy with a multilayer perceptron on a dataset of 6000 samples. The authors also experimented with combinations of the different modalities, resulting in slightly lower accuracies of ∼93% and ∼96%. While their CBAM attention mechanism enhances feature extraction, the multi-stage training process (separate modality encoders) prevents end-to-end joint optimization. CrossPhire’s unified training framework allows gradient flow across all modalities, enabling the discovery of cross-modal patterns unavailable in separately-trained architectures.
Lin et al. [
12] presented
SenseInput, a multimodal system leveraging URL, HTML, and screenshots. The system introduced nine new features, including statistical and sensitive input features, using LightGBM to achieve an F1-score of 98.48%. Despite strong performance, SenseInput relies on 22 handcrafted features requiring domain expertise and manual engineering, making it labor-intensive and potentially brittle to evolving attack patterns. In contrast, CrossPhire’s automatic feature extraction via deep learning eliminates manual feature engineering while maintaining adaptability to new phishing tactics. In a similar vein, Tan et al. [
14] extended
PhishWHO [
42] by incorporating both visual and textual identity features using logo and text extraction to identify phishing sites. They validated their model on two datasets (DS-1 and DS-2), achieving the accuracy of 98.60%, with the approach excelling at detecting phishing sites that use either textual or visual identities. However, the method’s dependence on reverse search engines and domain verification services introduces latency and requires continuous internet connectivity. CrossPhire’s offline-capable architecture provides deployment flexibility without compromising detection quality.
Zhou et al. [
15] proposed a multimodal phishing detection method integrating URL, text, and screenshot data. Their
MultiiRECG approach utilized model stacking and achieved an accuracy of 88.82%, outperforming unimodal methods in handling 11 phishing categories. While model stacking combines diverse classifiers, it requires multi-stage training and lacks end-to-end optimization, limiting the model’s ability to learn joint representations. Additionally, the reliance on handcrafted URL features introduces manual effort and potential fragility.
Li et al. [
43] proposed
KnowPhish, a novel brand knowledge base (BKB) incorporating around 20k targeted brands and a method for detecting phishing web pages that combines visual and textual modalities. Their approach, KnowPhish Detector, outperformed the baseline approaches, achieving accuracy of 92.49% for the TR-OP dataset. KnowPhish’s strength lies in its comprehensive brand knowledge base; however, maintaining and updating 20k brand profiles requires continuous manual curation. CrossPhire’s brand-agnostic approach using semantic text analysis generalizes beyond known brands without requiring brand-specific knowledge bases.
KnowPhish was then employed by Cao et al. [
44] in their proposed approach,
PhishAgent. PhishAgent uses both online and offline knowledge bases to detect the targeted brand using HTML and the web page’s logo and classify the web page by comparing the domains. PhishAgent achieved an average accuracy of 95.03% on three benchmark datasets. While PhishAgent demonstrates strong performance through its dual knowledge base strategy, inheriting KnowPhish’s brand database maintenance burden and requiring brand-specific training data limits scalability to emerging brands.
These studies highlight the increasing trend of integrating multimodality to improve detection accuracy and robustness against various phishing attacks.
2.6. Comparative Analysis of Multimodal Approaches
To systematically position CrossPhire within the landscape of multimodal phishing detection methods,
Table 1 presents a comprehensive comparison of architectural designs and fusion strategies across recent multimodal approaches. As demonstrated, CrossPhire distinguishes itself through four key architectural innovations: (1)
automatic semantic feature extraction via sentence transformers on markup-free text, eliminating the handcrafted features required by Jail-Phish [
11] and SenseInput [
12]; (2)
end-to-end joint training of all modality encoders through a unified loss function, contrasting with the multi-stage pipelines employed by Yu et al. [
13] and Zhou et al. [
15] where modalities are processed and optimized separately; (3)
minimal third-party dependencies for core functionality, unlike Lee et al. [
9] which requires GPT-4 API access and Tan et al. [
14] which relies on domain verification services; and (4)
language-independent semantic analysis through multilingual sentence transformers, addressing the English-only limitation observed in WebPhish [
30].
The fusion strategy comparison reveals a critical architectural distinction: while methods like Yu et al. [
13] employ late fusion by concatenating independently-learned representations, CrossPhire’s early fusion approach enables gradient flow through all modality pathways during backpropagation. This allows the network to discover complementary cross-modal patterns between URL syntax, visual appearance, and semantic content that would remain hidden in multi-stage architectures where each modality encoder is optimized in isolation.
3. Motivation
A review of existing anti-phishing studies reveals several shortcomings: lack of standardized datasets, reliance on manual feature engineering, language dependency (typically, English only), third-party service dependencies (search engines, WHOIS, domain age), and vulnerability to attackers who manipulate handcrafted features (e.g., HTTPS, URL length). Additionally, single modality systems are easily bypassed through obfuscation tactics (embedding text in images, HTML obfuscation, link redirection), and vision-based approaches fail when phishing pages don’t mimic known brands.
We hypothesize that most phishing web pages, regardless of the context, involve malicious textual content that demands private information from victims in varying forms that require a careful
carving of the HTML to extract. According to our idea, if this phishing-related and markup free textual content is detected, revealed from the other redundant content, and mapped into a discriminative embedding vector in high-dimensional space then we will be able to (1) obtain the semantic vector space of phishing intentions, (2) utilize it in anti-phishing and (3) evaluate how this space is different from legitimate content. The choice of markup-free text extraction is based on the theoretical distinction between the content and the form of web pages. Michailidou et al. [
45] distinguish explicitly between ‘content on the website’ and ‘the website’s form with respect to user interface, navigation and structure’ as independent factors affecting page perception. Crucially, Deng and Poole (2012) [
46] demonstrate that previous research has only considered web page elements as isolated aesthetic factors and lacks a coherent theoretical framework, suggesting that markup elements contribute to visual aesthetics rather than semantic understanding. Michailidou et al. [
45], moreover, define visual complexity as arising from ‘the quantity of objects, clutter, openness, symmetry, organisation and variety of colors’. Including DOM structure would therefore introduce visual complexity features into a task that requires semantic content analysis. However, since phishing detection fundamentally relies on identifying deceptive linguistic patterns in textual content rather than aesthetic or navigational properties, DOM-aware modelling would introduce noise from form-related features that are unrelated to the semantic deception signals we seek to capture. Additionally, theoretical frameworks for web page aesthetics (order and complexity) have been found to capture the main influence of the environment on people’s aesthetic experience [
46], emphasising that structure influences aesthetic judgement rather than content comprehension. For phishing classification, where the goal is to detect semantic and linguistic deception, markup-free text provides a cleaner signal by removing these aesthetic-oriented structural features.
Last but not least, single modality detection systems, which rely on only one source of information, are also easier for attackers to bypass through various obfuscation tactics, such as embedding text in images, link redirection, or HTML obfuscation. This narrow scope also leaves systems vulnerable to zero-day attacks, where newly developed tactics exploit new vulnerabilities to evade detection. Although multimodality has demonstrated effectiveness in improving detection, it remains underutilized.
To mitigate the problems listed above, we introduce a new multimodal anti-phishing approach, so-called CrossPhire wrapped in an end-to-end deep learning model leveraging three sources of information namely (i) URL, (ii) main textual content, and (iii) web page screenshot. Sharing similar ideas to other multimodal anti-phishing studies at the fundamental level, we aim to capture the essence of phishing intention from different perspectives. To achieve this efficiently and effectively, we build a three-branched deep neural network architecture producing highly discriminative embeddings devoted to each modality and jointly train it in an end-to-end fashion. Consequently, motivated by this idea and the mentioned problems above, this study aims to deliver a modern end-to-end trainable network-in-network architecture to analyze web pages comprehensively to find out different aspects of phishing evidence, along with no use of any third-party service or hand-crafted features. Moreover, we introduce and make public a new multimodal anti-phishing dataset which was carefully collected and curated to obtain diversity and completeness.
4. Datasets
This section presents our approach to multimodal phishing detection, beginning with an explanation of our main motivation. Next, we present a comprehensive overview of the datasets used in our study, followed by a detailed introduction of the feature encoding methods employed for each data modality. Lastly, we give the evaluation metrics.
Despite the long history of phishing attacks, the anti-phishing research community still lacks a standardized benchmark dataset comparable to ImageNet in computer vision. This has led to fragmented research using datasets collected across different time periods, making direct comparisons between methods challenging. Recent years have seen improvements with larger datasets [
3,
26] and multimodal collections incorporating HTML markup, screenshots, and URLs [
37,
38,
39,
47]. However, our initial analysis revealed persistent issues including screenshot rendering errors [
47], incomplete and missing content [
37,
39], and duplicate samples [
38].
To address these limitations, we developed Phish360, a new multimodal dataset designed to ensure (i) uniqueness of samples, (ii) diversity in URL features and targeted brands, and (iii) standardized, high-quality screenshots. This dataset forms the foundation for our experimental evaluations, alongside four publicly available benchmark datasets that enable comprehensive performance comparison.
4.1. A New Multimodal Dataset: Phish360
We introduce Phish360, a new multimodal anti-phishing dataset designed to ensure sample uniqueness, diversity, and standardized screenshot quality. To collect data, we developed a custom Java-based multi-threaded crawler with two modes: (1) seed mode for discovering new pages from seed URLs, and (2) list mode for crawling from URL lists. The crawler uses Selenium to render pages at 1280 × 960 pixels and saves them as .png files with embedded HTML and URL metadata. We also developed PhishBoring, a GUI tool for visual inspection and quality control (
Figure 1).
Legitimate samples were collected by seeding from Alexa’s top 100 pages, accessing over 50,000 web pages. Phishing samples were collected weekly between 2020–2024 from PhishTank and OpenPhish. After manual inspection, duplicate removal using PhishBoring and Duplicate Image Finder, the final dataset contains 10,748 unique samples (6416 legitimate, 4332 phishing)—the largest time span among benchmarking datasets.
As shown in
Figure 2, the dataset is organized into TrainVal (80%) and Test (20%) folders. Each sample is stored separately following the naming convention
Pxxxx_brand for phishing and
Lxxxx_legitimate for legitimate samples, including URL, HTML, screenshot, and label.txt files. The Phish360 dataset and codebase presented in this study are available for academic use at
https://web.cs.hacettepe.edu.tr/~selman/phish360-dataset/.
4.2. Benchmarking Datasets
To assess the effectiveness of our proposed methodology, we evaluate it using four publicly available multimodal datasets: PWD2016 [
47], PhishIntention [
39], PILWD134K [
38], and VanNL126K [
37]. These datasets span different time periods and consist of URL, HTML content, and screenshots. We summarize the data collection information in
Table 2.
PWD2016: The Phishing Website Dataset (PWD2016) was collected in 2016 [
47], containing 30,000 website samples in total, equally divided into phishing and legitimate classes. The phishing samples were collected from PhishTank, and to ensure the inclusion of less popular legitimate websites, the authors collected samples from DMOZ, BOTW, and Alexa.
PhishIntention: The PhishIntention dataset consists of 29,496 phishing samples, 25,400 legitimate and 3049 misleading legitimate samples [
39]. The legitimate samples were collected from Alexa, while the phishing cases were collected from OpenPhish between 2019 and 2020. Misleading legitimate samples refer to sign-in/login pages of legitimate web sites including but not limited to popular social media platforms (e.g., LinkedIn, Facebook, or Google). In this work, we combine legitimate and misleading legitimate samples into a single legitimate class for creating a more challenging environment since many of the phishing samples involves similar login pages.
PILWD-134K: The Phishing Index Login Website Dataset (PILWD-134K) consists of 133,928 samples, constituting the largest benchmarking data in our experiment, divided equally into phishing and legitimate samples [
38]. The legitimate samples were collected from Quantcast Top Sites and Majestic List, while the phishing samples were collected from PhishTank between 2019 and 2020.
VanNL126K: This dataset contains a non-balanced case distribution by having 125,938 phishing and 25,938 legitimate samples [
37]. According to the authors, the legitimate samples were collected from the DMOZ directory, whereas the phishing cases were collected from PhishTank, OpenPhish, and PhishStats between September and December 2019.
4.3. Evaluation of Datasets
We perform an in-depth analysis of the datasets, starting with evaluating URL samples to assess diversity, then examining other data modalities using preprocessed Parquet files for comprehensive analysis.
4.3.1. URL Analysis
This section examines the distinctions between legitimate and phishing URLs across various datasets to identify trends and concealed patterns. Thus, we begin by obtaining histograms of URL lengths to examine the overall distribution and frequency of values. As illustrated in
Figure 3, there are discernible discrepancies in URL lengths between phishing and legitimate samples across a range of datasets. For instance, in the PWD2016 dataset, legitimate URLs are significantly shorter than those in other datasets. Upon further examination, it was discovered that all legitimate URL samples in PWD2016 are restricted to just the domain and top-level domain (e.g., facebook.com). This could introduce bias in URL-based detection methods. Similarly, datasets such as PhishIntention and VanNL126k demonstrate a higher prevalence of shorter legitimate URLs, whereas phishing URLs tend to be longer due to the incorporation of additional path and query parameters employed by attackers to direct victims to malicious landing pages.
To evaluate diversity, we calculated the proportion of unique URL components (domains, TLDs, FLDs, subdomains).
Table 3 and
Table 4 show phishing and legitimate metrics. Legitimate samples demonstrate greater diversity than phishing samples. Phish360 exhibits the highest uniqueness (73.63% unique domains and 28.69% subdomains) by spacing collection over extended periods to minimize near-duplicate URLs from the same domain.
Table 3 reveals a relatively low percentage of unique phishing URLs in PWD2016, with a considerable number of duplicate samples. The high number of exact duplicate URLs presents a significant challenge that can lead to data leakage in machine learning. Data leakage occurs when there is overlap between the training and testing data sets, resulting in models that are overly optimistic and fail to generalize. Reducing the number of duplicate URLs is crucial to prevent biased results and enhance model performance on new data. This issue is worthy of attention because a unimodal URL-based approach would be susceptible to data leakage if duplicate URL samples were not minimized.
4.3.2. Content and Screenshot Analysis
We analyzed HTML code and extracted text using Trafilatura (TF) and BeautifulSoup (BS), which yield superior quality compared to html2text and lxml.
We compare the percentages of unique HTML and BS text for phishing and legitimate samples. As shown in
Figure 4, phishing samples exhibit reduced uniqueness due to HTML code reuse across disparate URLs, with benchmark datasets ranging from 16% to 61%. However, as demonstrated in
Figure 4, our dataset Phish360 exhibits a markedly higher degree of uniqueness in phishing HTML samples, reaching 96.63%, compared to other benchmark datasets. The high uniqueness observed in Phish360 can be attributed to three key factors: manual sample verification, source diversification, and the collection of samples over extended periods. These factors have collectively enhanced both the uniqueness and diversity of phishing samples.
A potential issue inherent to single modality anti-phishing datasets, particularly those comprising screenshots or HTML content, is the risk of data leakage due to duplicate samples. Conversely, multimodal datasets that integrate diverse information sources can alleviate this issue by enhancing overall variability. To evaluate the degree of uniqueness, we consider each sample as a triplet comprising: the URL, the HTML code, and a hash of the screenshot image. The SHA-256 hash function was employed to generate unique identifiers for each web page. As illustrated in
Figure 5, the triplet approach resulted in a notable increase in the percentage of unique samples in benchmark datasets, from approximately 20% to 80–90%, even when individual components such as URLs or HTML were duplicated. Furthermore, Phish360 curation underwent multiple human-based visual inspections through
Duplicate Image Finder tests to eliminate samples with even very similar screenshots.
Figure 5 demonstrates that representing data as triplets of URL, extracted textual content (BS or TF), and image hash significantly improves uniqueness metrics. The percentages in subfigures (a) and (b) of
Figure 5 are calculated by dividing the number of unique triplets by the number of valid samples (those having a valid URL, HTML, and image).
An important diversity measure often overlooked within anti-phishing datasets is linguistic composition. Diversifying languages is essential given the multilingual nature of phishing websites, which target more than just English-speaking users. Recent APWG phishing activity trend reports have shown significant increases in attacks across various countries, underscoring the necessity for datasets to encompass comprehensive language varieties. As shown in
Figure 6, we incorporated several widely spoken world languages in Phish360, including English, German, French, Spanish, and Portuguese. The Phish360 collection process ensured diversity by including 30 languages for phishing samples and 27 for legitimate ones. Employing the
langdetect 1.09 Python package for language detection on extracted plain text,
Figure 6 demonstrates the language distribution of Phish360 samples for both classes compared to benchmark datasets, illustrating the superior linguistic diversity of our dataset.
Upon examining screenshot images within the four benchmark datasets, we noticed discrepancies in resolution dimensions. Contrary to expectations of consistent image dimensions across all screenshots, we noted variations:
PhishIntention: A total of 38% of screenshot image sizes are 1920 × 1080 pixels, 19% are 1366 × 768 pixels, and the remaining images vary in dimensions.
PWD2016: The image sizes are distributed as follows: 6% are 510 × 1330 pixels, 2% are 18 × 18 pixels, and the remaining images vary in size.
PILWD-134K: The majority of screenshot image sizes, constituting 68%, are 1906 × 922. A smaller proportion, 22%, are sized at 1853 × 922. The remaining images are distributed across 26 different dimensions.
VanNL126k: All screenshot images are 1280 × 768.
In contrast to benchmark datasets, 99.8% of screenshots in the Phish360 dataset have a consistent resolution of 1280 × 960 pixels. It is worth noting that we also identified screenshots of inactive websites, as well as entirely white or black images, within the benchmark datasets.
5. Methodology
This section details the CrossPhire framework for robust phishing detection. We first provide details of data preprocessing and feature encoding for each modality, followed by an introduction to the neural architecture.
Figure 7 overviews the end-to-end workflow. Next, we explain the evaluation strategy.
5.1. Data Preprocessing
In this subsection, we outline the preprocessing steps for our datasets, explaining how we extract raw data and obtain the required representations. We process multimodal data in a column-oriented format using Pandas data frames for efficient storage and retrieval. Since handling all datasets in a single large data frame would be memory-intensive, we opted to store each dataset in separate data frames. We further added important columns to these data frames for further analysis. The rationale behind using data frames is to avoid reprocessing data from scratch, yielding fast experimentation.
Using the Parquet format, we can load only the required columns into memory, optimizing performance when working with large datasets. This step is crucial for handling multimodal datasets with extensive textual observations.
We initially extracted plain text from the HTML source code using five Python parser libraries:
Trafilatura [
48],
BeautifulSoup [
49],
html2text [
50],
lxml [
51], and
html_text [
52]. Each library produced slightly different outputs due to variations in its text extraction processes. Note that, during our non-reported experiments, we found out that the most successful parsing (i.e., obtaining the most discriminating information) is done via Beautifulsoup and Trafilatura. Hence, experiments presented in the remaining part of this paper utilized these two parsers.
Table 5 provides a detailed breakdown of each data frame column, showcasing a sample from Phish360 dataset and illustrating differences across the obtained text contents.
Our dataset includes three key data modalities: URLs, HTML source code, and web page screenshots. We ensured that each sample contains all three modalities and removed those with missing or corrupted files.
Table 6 summarizes the missing or invalid files in each dataset. The table includes details on missing files, files with naming issues, and files with invalid extensions or HTML encoding errors.
We processed all multimodal datasets in the same way and saved the resulting data frames into separate Parquet files. Notably, we stored each dataset’s data frame into two Parquet files, separating the samples classwise: (1) phishing and (2) legitimate.
5.2. Encoding the Modalities
In this subsection, we present the details of our feature encoders devoted to each modality.
5.2.1. URL
As is known, URL information is a commonly employed source of information in phishing detection literature, since it involves rich wording patterns that can be extracted through several NLP methods ranging from conventional TF-IDF to recent transformer-based approaches. In this study, for obtaining URL-based representations, we employ GramBeddings [
3]. In this algorithmic selection, we first benchmarked GramBeddings with other well-known URL-based models, namely URLNet [
53] and URLTran [
54]. We, further, assessed the models’ generalization capability on unseen data through cross-dataset experiments, where we trained each model on one dataset and tested it on another.
The GramBeddings model combines character-level (unigram) and specific n-gram features (4, 5, and 6) to create embeddings that capture contextual nuances in the URL. In the GramBeddings pipeline, URLs are tokenized, with each character encoded numerically. Character-level embeddings are padded or truncated to a fixed length of 128. For n-gram analysis, chi-square-based feature selection is applied to address the curse of dimensionality problem, and identical sub-networks are used to produce independent feature vectors. Each sub-network includes convolutional and BiLSTM layers, enhanced by an attention mechanism to highlight crucial features [
3]. The resultant 1024-dimensional embedding vector is further condensed to 256, ready for integration with CrossPhire’s multimodal architecture.
5.2.2. Screenshot
In CrossPhire, we utilize web page screenshots to capture visual cues that help distinguish phishing web pages from legitimate ones, as attackers often design phishing materials to mimic legitimate websites visually. For image classification and dense embedding generation, we employ ResNet [
55] and DenseNet [
56], two CNN architectures that have achieved remarkable results. ResNet is more oriented to detecting local features, whereas DenseNet considers both local and global relationships through dense skip connections. We selected ResNet50 and DenseNet121 for their balance of effectiveness and runtime complexity.
Residual Networks (ResNet): ResNet utilizes residual blocks with skip connections to address the vanishing gradient problem, allowing gradients to flow more effectively across layers [
55]. The fundamental building block is defined as:
where x and y denote the input and output, and
represents the residual mapping.
Densely Connected Networks (DenseNet): DenseNet enhances feature reuse and gradient flow by connecting each layer to every other layer within a dense block [
56]. The output of layer
ℓ is given by:
where
denotes the concatenation of feature maps from previous layers [
56].
In our experiments, we evaluated both architectures to select the model that best enhances CrossPhire’s visual analysis capabilities.
5.2.3. Text
While previous studies [
16,
57] have utilized HTML source code, they often overlooked the potential of using markup-free text, which leads to the loss of semantic relationships. We address this gap by extracting markup-free text from HTML using BeautifulSoup [
49] and Trafilatura [
48] to capture the malicious and misleading context that characterizes phishing attempts, as such content maintains contextual connections with documented attack patterns [
58].
To represent this text, we employ sentence transformer models that convert the extracted content into numerical embeddings capturing context and semantic meaning. Unlike traditional word embeddings, sentence transformers leverage transformer architectures and self-attention mechanisms to produce embeddings that preserve word order, context, and meaning within the entire paragraph [
59]. To the best of our knowledge, this is the first application of sentence transformers on markup-free text for phishing web page detection.
In this study, we leveraged the sentence transformer ‘all-mpnet-base-v2’ based on MPNet [
60], which combines masked language modeling (MLM) from BERT and permuted language modeling (PLM) from XLNet to inherit the advantages of both models. The training objective optimizes:
where MPNet leverages masked tokens
as inputs, conditioning on preceding tokens
to model token interdependencies and positional information effectively [
60]. The sentence embedding is obtained through mean-pooling:
where
represents the token embedding after transformer layers, producing a 768-dimensional vector that captures context and semantic meaning.
Given the fact that many legitimate and phishing web pages are often non-English, we made sure our approach is language-independent, supporting non-English text processing. Here we apply two methods: first, a multilingual ST model trained on 100 languages [
61] to generate comparable embeddings across languages, and second, translating non-English text to English before applying a monolingual ST model. This dual approach allows us to compare the representation quality of monolingual and multilingual language models through MPNet [
60] for English-only text processing and XLM-RoBERTa [
61] for multilingual contents respectively. Unlike the situation in our vision compartment having a trainable image classifier, we use the ST models as fixed encoders without any supervised fine-tuning. Both MPNet and XLM-R produce fixed-size 768-dimensional vectors for the input text covering up to 512 tokens.
5.3. Proposed Approach: CrossPhire
CrossPhire is a multimodal deep learning model designed for phishing detection, leveraging web page data in three forms: the raw URL, screenshot image, and extracted text content. Unlike traditional phishing detection methods, which often involve extensive manual feature engineering, CrossPhire performs automatic feature extraction on these raw data modalities, minimizing preprocessing efforts. Specifically, CrossPhire is unique in extracting markup-free textual content from HTML.
As previously stated, we hypothesize that carving and vectorizing the web page’s information-rich textual content can reveal distinguishing characteristics that effectively differentiate phishing behavior from legitimacy. In this regard, we employed Trafilatura and BeautifulSoup packages.
CrossPhire integrates three specialized sub-neural networks: (i) GramBeddings for URL analysis [
3], (ii) fine-tuned image models (ResNet and DenseNet) for visual feature extraction, and (iii) Sentence Transformer models for carved textual content encoding. Together, these modules fuse insights from each modality to capture both textual and visual patterns, enabling a binary classification into “phishing” (1) or “legitimate” (0). A detailed illustration of CrossPhire’s components is shown in
Figure 7.
The Grambeddings approach embeds the URL string into 256-d vector through Equations (
5)–(
7) where
represents each selected n-gram channel whereas
indicates the BiLSTM module outcomes.
It should also be noted that each of the four n-gram encoding channels is equipped with
ZhangAttention layers as given in Equations (
10)–(
12). Here, each attention layers have their own
and
parameters that produces context vectors of
.
Finally, the overall URL embedding generation is done through concatenation given in Equation (
13) and applying non-linear activation next to dense layers presented in Equation (
14).
Following the extraction of deep features from the URL, image, and textual content, each encoder produces logits (embedding vectors) corresponding to its respective data modality. Once the embeddings are generated, we implement several hidden layers composed of dense layers to reduce the dimensions of these vectors. Subsequently, we merge the reduced feature vectors (URL-256, image-512, and text-16) by concatenating them into a single vector with a dimension of 784 given in Equation (
15). Finally, we incorporate fully connected (FC) layers and classification layers. Through the use of Adam optimizer, we jointly train the URL and screenshot-related sub-networks whereas the weights of the ST compartment are kept frozen.
The ST models are maintained in a frozen state (gradients not updated during training) following established best practice in sentence transformer applications [
59]. This design choice is theoretically motivated by the substantial scale mismatch between pre-training (1B+ sentence pairs) and our fine-tuning datasets (max 107K samples), which is a four-order-of-magnitude difference that typically leads to catastrophic forgetting where task-specific optimization degrades the general semantic capabilities that make pre-trained transformers valuable [
60,
62]. Moreover, phishing detection requires recognizing generalizable linguistic deception patterns (urgency cues, authority exploitation, credential solicitation) rather than dataset-specific lexical artifacts, making the preservation of broad semantic knowledge preferable to task-specific adaptation. As shown in the results, this architectural decision enables our model to achieve strong cross-dataset generalization while maintaining computational efficiency, as the frozen ST serves as a fixed semantic feature extractor while the URL encoder (GramBeddings) and vision encoder (ResNet/DenseNet) components are jointly trained to learn task-specific multimodal fusion patterns.
where
. In addition, the final classification is done through Equations (
16)–(
18):
Meanwhile, we have used binary cross-entropy loss for optimization of the whole neural net in joint-training procedure.
5.4. Evaluation Strategy
As previously stated, our scheme was evaluated with four benchmark datasets in addition to the Phish360 and Phish360-Zeroday, which were collected by ourselves. We first evaluate our neural net on each dataset individually (i.e., assessment in same-dataset) to understand its effectiveness in each dataset.
To further assess CrossPhire’s generalization capabilities on previously unseen data, a series of cross-dataset evaluations were conducted as well. In this context, CrossPhire is trained on the training portion of one dataset and subsequently evaluated on the testing portion of another. This technique enables an evaluation of CrossPhire’s resilience to a range of phishing techniques.
As CrossPhire is constituted of discrete components, single-modality experiments are also conducted to evaluate the performance of each component in isolation. The objective of these experiments is to identify the optimal model for each modality, ensuring their optimal integration into CrossPhire. Furthermore, cross-dataset experiments are conducted for the single-modality models to assess their generalizability across different datasets. Once the optimal unimodal models have been identified, they are integrated into CrossPhire, and its performance results across all datasets are presented, including both same-dataset and cross-dataset evaluations.
Ultimately, after identifying the optimal configuration for CrossPhire, we evaluate its performance in comparison to established baseline methods that utilize the same benchmark datasets. Given that our solution is multimodal, we also assess its efficacy in comparison to a prominent state-of-the-art text-image model, CLIP [
63].
6. Experimental Setup and Results
In this section, we begin with the introduction of our experimental setup. Next, we provide our evaluations with an extensive set of assessments devoted to the sole use of each modality. Next, we designate the assessment of CrossPhire by utilizing all modalities in a joint training scheme. Afterwards, we compare our proposed scheme with other models in the literature. Moreover, we assess our model in zero day attacks along with comparing it with another multimodal approach, so-called CLIP.
6.1. Experimental Setup
Our methodology was implemented using the Python 3 programming language on the Keras platform. To guarantee computational reproducibility and facilitate the reuse of CrossPhire, we made our codebase and datasets publicly accessible to the research community. During the training phase, CrossPhire was configured with a batch size of 32 and the Adam optimizer was employed with an initial learning rate of 0.001. A cosine annealing learning rate schedule was employed to dynamically adjust the learning rate throughout the training process, while the binary cross-entropy loss was utilized to monitor the model’s loss.
The number of training epochs varied depending on the specific experimental conditions. In instances where the training and testing data were derived from a single source (same-dataset evaluations), CrossPhire was trained for 20 epochs. In the case of cross-dataset evaluations, where the training and testing data were derived from disparate sources, the CrossPhire model was trained for 30 epochs. Another crucial element is the ratio of training to testing data. While there is no definitive rule for splitting datasets, the majority of anti-phishing research employs a train-test ratio of 80%–20% or 70%–30%, as mentioned by Adane and Beyene [
64]. During our evaluation experiments, we adopted an 80%–20% training-to-testing split, setting the
random_state to 42 to ensure consistent data splits. Throughout the experiments, we employed standard metrics, including accuracy, precision, recall, and F1 score.
6.2. Assessment and Selection of Modality Encoders
In this phase, a series of experiments is conducted to select the best encoder for each compartment of CrossPhire through a comprehensive assessment. To validate the single-modal encoders, we employ benchmarking datasets and experiment with each model to assess its generalization capability. Based on the findings of our literature review, we identified various models that align well with our goal. It should be noted that we consider (i) effectiveness and (ii) efficiency at the same time, yielding a mobile-compatible model. Thus, we decided to evaluate (1) GramBeddings [
3], (2) URLNet [
53] and (3) URLTran [
54] as candidate architectures for URL-based classification component. For the context of visual classification, we chose pre-trained ResNet50 and DenseNet121 architectures. Similarly, we preferred pre-trained Sentence Transformers MPNet [
60] (English-only) and XLM-RoBERTa [
61] (multilingual) for content embedding generation. At this juncture, the relative efficacy of embeddings derived from a monolingual MPNet on the English translations of a GPT-4 level proprietary translator (i.e., Google Translate) and those vectorized with a multilingual encoder (i.e., XLM-RoBERTa) across multiple languages can be discussed.
6.2.1. URL Only-Based Assessment
As previously stated, phishing attackers continuously refine their tactics and develop new methods to bypass detection systems. Consequently, phishing URL samples from different years reflect a range of evolving tactics. To address this, we evaluated three different candidate URL models not only through the same-dataset mode but also by assessing their generalization performance in cross-dataset regime.
Table 7 presents a detailed comparison of the selected models, showcasing both same- and cross-dataset results. The same 80%–20% train-test split ratio was applied to ensure consistency across all experiments, with the seed value set to 42. The optimal performance for each experiment is highlighted in bold. The results demonstrate that GramBeddings exhibits superior performance compared to other URL models, achieving the highest test accuracy in eight out of 16 experiments. In light of these findings, the GramBeddings model has been integrated into CrossPhire for URL processing.
6.2.2. Screenshot Only-Based Assessment
As with the URL-based assessment, a series of experiments was conducted to select either ResNet50 or DenseNet121 models. In these experiments, the models were fine-tuned with the use of screenshot images from the benchmark datasets for phishing classification. The objective of this study was to enhance the phishing detection capabilities of the models by leveraging the pre-trained weights from models trained on approximately 14.2 million images from ImageNet. To ensure a fair comparison, the train-test splits and seed values were maintained consistently across all experiments. Subsequently, the models underwent a fine-tuning process using the training sets of the benchmark datasets. Following this, an evaluation of the models was conducted on the respective test sets. However, cross-dataset experimentation was not possible due to the substantial GPU hours required and the limitation of a single GPU.
Table 8 presents the results for both models across all datasets. In addition to the accuracy of the tests, the F1 scores are included, which integrate both precision and recall to provide a more comprehensive performance metric. While the performance of both models is nearly identical on certain datasets, DenseNet121 exhibits a slight advantage over ResNet50. Consequently, both models were included in CrossPhire’s evaluation.
6.2.3. Text Only Content Assessment
This section presents the results of unimodal experiments using the extracted textual content. Two distinct categories of extracted texts were subjected to experimentation: one derived from Trafilatura (TF) and another from BeautifulSoup (BS). In contrast with the preceding unimodal experiments that involved the fine-tuning of neural networks, we utilize conventional machine learning (ML) models for text classification. This is because we employ sentence transformers (pre-trained language models) by maintaining their pre-trained weights in a frozen state and obtaining the sentence or paragraph embedding of the given textual input. Therefore, we have selected Support Vector Machine (SVM), XGBoost, and CatBoost classifiers, as they are widely used and effective in text classification tasks. The models were run using their default hyperparameter values.
Given the existence of two distinct categories of extracted texts, four principal experiments were conducted: two for each text type (BS and TF) and two for the purpose of comparing the performance of monolingual (MPNet) and multilingual (XLM-R) sentence transformers. First, the original multilingual BS and TF texts were evaluated. As illustrated in
Figure 8, the BS text exhibited greater discriminative features, resulting in higher classification accuracy scores across all the ML models tested.
In light of the enhanced performance demonstrated by the use of BS text, we conducted a comparative analysis of the efficacy of monolingual and multilingual sentence transformers, employing both the original and English-translated versions of the BS text.
Table 9 compares the performance of multilingual and monolingual transformers using both the original and translated text. As evidenced in
Table 9, the results demonstrate that when the translated BS text is used, MPNet exhibits enhanced accuracy compared to the original multilingual text processed by the XLM-R model.
These findings underscore the significance of selecting an appropriate input text parser, language, and sentence transformer and their influence on overall performance. However, from our perspective, the most significant outcome of these findings is the considerable potential of token-limited (e.g., 512) texts in phishing classification, which we had previously hypothesized to be effective. The results clearly indicate that markup-free phishing and legitimate web page contents exhibit different semantics that can be mapped to separable manifolds in hyperdimensional space via sentence transformers. The second most significant finding is that BeautifulSoup extracts text fragments with greater discriminatory power than Trafilatura, which challenges our initial assumptions (see
Figure 8). Thirdly, the indispensable contribution of the monolingual MPNet prompts us to question the effectiveness of multilingual sentence transformers. The results presented in
Table 9 indicate that a model trained in a more resource-rich language yields better discriminatory performance.
6.3. Assessment of Multimodal CrossPhire
Following the initial experiments to justify the design considerations for each compartment, we continue to evaluate the multimodal performance of our approach using all datasets. We present the results obtained from the same-dataset experiments, where the model is trained and validated on the same dataset. Next, we examine the cross-dataset experiments, where CrossPhire is trained on one dataset and validated on others, to assess its generalization capability and robustness. Moreover, we also compared two techniques of modality integration: (1) vector concatenation and (2) mixture of experts.
6.3.1. Evaluation in the Same-Dataset Regime
In this phase, we introduce CrossPhire’s performance assessment in the
same dataset regime. Throughout the experiments, CrossPhire integrates GramBeddings for URL analysis, ResNet and DenseNet for screenshot interpretation, and the monolingual MPNet sentence transformer for encoding translated BeautifulSoup content. As demonstrated in
Table 10, CrossPhire exhibited remarkable performance, with accuracy scores ranging from 97.71% on the Phish360 dataset to a perfect 100% on the PWD2016 dataset. The findings unequivocally underscore the preeminence of a multimodal approach, eclipsing the limitations imposed by reliance on a single modality within a singular dataset regime. Excluding the Phish360 dataset, it can be concluded that the incorporation of ResNet or DenseNet does not result in a substantial impact on the overall performance metrics. It is also noteworthy that using the XLM-R model with multilingual content yielded slightly lower accuracy scores (mean: 0.38, s.d.: 0.21). Due to space constraints, we do not report the results of multimodal modeling through XLM-R.
6.3.2. Evaluation in the Cross-Dataset Regime
In this stage, the robustness of CrossPhire is assessed in a more challenging environment through cross-dataset evaluations. In this approach, the model is trained on a portion of one dataset and tested on a separate portion of a different dataset. The objective of this approach is to subject the neural architecture to an evaluation in which it is exposed to data that it has not previously encountered, with a view to assessing its capacity for generalization. This evaluation scheme is a further contribution to the study, as none of the related works covered conducted experiments of a similar nature. It is hypothesized that this evaluation scheme signifies a substantial contribution to the anti-phishing domain. During these evaluations, GramBeddings were utilized for URL embeddings, ResNet and DenseNet for screenshot-based feature extraction, and the MPNet sentence transformer to generate embeddings from translated BeautifulSoup text.
As demonstrated in
Table 11, CrossPhire frequently exhibits robustness across a range of experiments, thereby underscoring its efficacy in detecting phishing web pages from a multitude of sources. It is noteworthy that training on Phish360, the smallest multimodal dataset comprising approximately 8600 training samples, yields impressive results. The CrossPhire model, when trained on the Phish360 dataset, demonstrates an accuracy of 94.66% on the PhishIntention dataset. This performance surpasses that of CrossPhire model instances trained on larger datasets, such as PILWD-134K and VanNL126K. Furthermore, CrossPhire, trained on Phish360, achieved over 90% accuracy on all datasets except PILWD-134K, containing approximately 26,800 test samples. This ratio is nearly three times the size of Phish360’s training set.
Conversely, models trained on PWD2016 demonstrated a conspicuous inability to generalize, exhibiting accuracy scores as low as 40% when evaluated on other datasets. A hypothesis can be formulated regarding the underlying causes of this outcome, which can be attributed to three primary factors: The following factors must be considered:
The incompatibility of screenshots;
The ease of URLs exhibited by the PWD2016 dataset;
The distribution differences related to the collection period.
The model that was trained on PhishIntention achieved an accuracy of 99.69% when evaluated on PWD2016. In addition, the model that was trained on Phish360 demonstrated superior performance in comparison to the model that was trained on PhishIntention and evaluated on the PILWD-134K dataset. This model achieved an accuracy of 85.74%, which is the highest among models trained on disparate datasets. The largest dataset, PILWD-134K, demonstrated robust cross-dataset performance, with accuracy scores of 93.17% on VanNL126K, 96.27% on PhishIntention, 93.7% on PWD2016, and 93.38% on Phish360 for models trained on it.
6.3.3. Fusion of Modalities via Mixture of Experts
Apart from the joint learning-based final feature vector concatenation, we tested the Mixture of Experts (MoE) model, which represents a breakthrough architecture that enables massive scaling of neural networks while maintaining computational efficiency through sparse activation patterns. First introduced by [
65], MoE architectures consist of multiple specialized sub-networks called “experts” coordinated by a learned gating mechanism that routes inputs to the most relevant experts for each example. The gating mechanism serves as the architectural cornerstone, using softmax functions to compute probabilistic expert assignments based on input characteristics while expert specialization emerges naturally through competitive learning, where different experts develop domain-specific knowledge for distinct input regions or task types. Moreover, the
expert choice routing [
66] revolutionized the field by inverting the traditional paradigm—experts select top-k tokens rather than tokens selecting experts, achieving perfect load balancing and 2x improvement in training convergence. Recent open-source models like Mixtral 8x7B [
67] democratized MoE technology, delivering 6x faster inference than comparable dense models while matching GPT-3.5 performance with only 12.9 billion active parameters from 46.7 billion total. MoE models are also capable of handling missing modalities. In our problem domain, it may not be possible to capture a screenshot or extract meaningful textual content. The MoE approach, thus, provides an inherent solution to the missing modality problem.
For this reason, in this study, we tested two fusion models: (1) vector concatenation and (2) MoE. In our MoE implementation, the model uses projection layers to integrate three modalities (URL, images, and content) into a common expert dimension space. It also incorporates learnable default embeddings to handle missing modalities, which directly addresses robustness issues. Before expert routing occurs, cross-modal attention layers allow each modality to consider the others, creating enhanced feature representations that capture intermodal dependencies. This architecture replaces conventional MLP experts with TransformerExpert networks, which contain multi-head self-attention and feed-forward blocks. This allows for more sophisticated feature processing within each expert. Instead of using simple linear gating, our model uses an attention-based routing mechanism that computes expert selection through multi-head attention between a global representation and learnable expert embeddings. We created a hierarchical attention design with multiple attention layers throughout the pipeline: cross-modal attention for feature enhancement, gating attention for expert routing, and final output attention for result refinement. We also included an optional load balancing loss, and comprehensive dropout and layer normalization ensure stable training while encouraging uniform expert utilization. During MoE model training, we found that using four MoE heads produced the best results. In summary, our MoE scheme is theoretically sound. It combines the advantages of sparse expert computation, attention-driven multimodal fusion, and resilience to missing modalities.
We measured the performance of the MoE in both the same- and cross-dataset configurations. However, as the results given in
Table 12 and
Table 13 show, the MoE approach underperforms compared to the simple joint learning of concatenated embeddings. This outcome may be related to the fact that the concatenation operation provides full access to information, whereas the MoE creates information bottlenecks through its gated routing mechanisms. Further, the mathematical foundation explains this phenomenon: concatenation operates in the complete joint feature space
where all modalities contribute simultaneously to every prediction, while MoE’s gated approach
only allows selected experts to contribute, potentially missing crucial cross-modal dependencies essential for the model. The “unimodal bias” problem identified by [
68] shows that complex fusion architectures can rely too heavily on dominant modalities while ignoring the rest. However, this supports the advantage of concatenation in our problem domain, where all modality information must be preserved. Similarly, biomedical multimodal research [
69] confirms that early fusion strategies (concatenation) learn joint representations directly, without relying on marginal representations. This enables the model to capture complex interactions among modalities more effectively than late fusion approaches. Regarding our anti-phishing detection scheme that combines URL, image, and text modalities, concatenation preserves all three types of information simultaneously. However, MoE architectures risk losing critical cross-modal security indicators through selective expert activation, which explains our empirical findings.
6.3.4. Benefits of Multimodality
This sub-section addresses a critical question regarding multimodality:
Does integrating multiple information sources improve phishing detection performance? The answer depends on several aspects. This part aims to ascertain whether multimodality enhances the efficacy of phishing web page recognition. To investigate this, a comparison is made between the performance of CrossPhire and that of its unimodal components as shown in
Figure 9. To ensure a fair comparison, CrossPhire is compared against its unimodal compartments: (a) GramBeddings for URLs, (b) ResNet50 for screenshots, and (c) MPNet for translated BS text contents.
At first glance, the unimodal models, the URL-only models in particular, seem to demonstrate satisfactory performance for most datasets. However, it is imperative to acknowledge that, except for Phish360, the other datasets encompass a limited collection period. The exceptional performance of URL-only detection on these datasets (94.1–99.5%) suggests these collections perhaps contain traditional or period-specific phishing attacks that exhibit distinguishable URL patterns. On the other hand, the dramatic performance drop of URL-based detection on Phish360 (82.7% vs. 99.0% on PhishIntention) illuminates a critical shift in phishing attack sophistication. The longer period for sample collection likely captures modern and newer phishing campaigns over time that may include URLs containing (but not limited to):
Legitimate hosting services (e.g., GitHub pages, cloud platforms) with clean URL structures;
URL shortening services that obfuscate the actual destination;
Domain-spoofing techniques that closely mimic legitimate URLs;
Compromised legitimate websites hosting phishing content.
This phenomenon—
dataset temporal bias—can also be observed in
Table 7 when cross-dataset performance scores are investigated. Nonetheless, when
Figure 9 is reinvestigated, it can be posited that enhanced accuracy in results can be achieved when other modalities, such as content, assume a more prominent role. From the perspective of machine learning, this could be indicative of a more gradual shift in the
language utilized by attackers. This is a truly expected outcome since phishing web pages often mimic their legitimate counterparts, making the
content less controllable by attackers, compared to the URL. Nevertheless, it is an indispensable fact that neither the content nor the screenshot is immune to the perpetual evolution of phishing attacks.
From this point of view, though it brings more computation burden, we can conclude that the multimodal anti-phishing mechanisms, when correctly implemented, are advantageous for the following reasons:
Reducing single-point-of-failure risk: When URL-based features become less discriminative (as in Phish360), visual and textual modalities maintain detection capability;
Capturing complementary attack vectors: Modern phishing often combines legitimate URLs with deceptive visual design and persuasive content;
Providing detection resilience: As attackers adapt to evade one modality, the system maintains effectiveness through alternative information channels.
Our experiments demonstrate that incorporating multiple data modalities enhances the accuracy of the detection process. This enhancement is most pronounced in Phish360, where the challenge of dataset temporal bias is most significant among the other systems. The extant research provides substantial support for this finding, indicating that the combination of distinct data modalities (hybridisation) not only enhances overall detection accuracy but also yields more resilient anti-phishing solutions [
13,
27,
30,
70].
6.4. Comparative Study
In this subsection, we compare CrossPhire’s performance with that of studies presenting the aforementioned datasets. Although we initially intended to conduct a method-based comparison with these studies, we deemed it unfeasible due to the absence of a codebase for the respective works. Therefore, we decided to proceed with a dataset-based benchmarking strategy. Thus, CrossPhire was trained with the original form of the benchmarking datasets from the related studies, and the findings were reported accordingly.
As can be seen from
Table 14, CrossPhire outperforms the presented approaches in terms of accuracy, except for the method proposed by Van Dooremaal et al. [
37]. While their approach achieved a slightly higher accuracy of 99.66% compared to our 99.42%, it should be noted that they took 2000 samples from the dataset without disclosing their data selection criteria. As a result, their approach was validated on only 600 samples, whereas our method was trained and validated on the entire dataset, which consists of approximately 126,000 samples, as shown in
Table 14.
To enrich the comparative study, we also investigated the feasibility and effectiveness of the Contrastive Language-Image Pretraining (CLIP) model by fine-tuning it for phishing detection on all our benchmarking datasets. Originally developed by Radford et al. [
63], CLIP is a multimodal model that associates images with textual descriptions through joint training of a text and image encoder in a shared embedding space. Although not specifically trained on ImageNet, CLIP has demonstrated impressive zero-shot performance, contributing to its popularity among researchers.
To enhance the detection of phishing, we have refined CLIP by incorporating screenshots and translated BS texts from our datasets, augmented it with a two-layer MLP comprising 512 neurons each, and optimized it using an Adam optimizer with a 0.0001 learning rate over 15 epochs. As illustrated in
Table 15, CrossPhire demonstrated better performance compared to CLIP across all employed datasets, with the most pronounced accuracy difference observed on the Phish360 dataset. The accuracy achieved by CLIP was 96.22%, while CrossPhire reached 99.21%. These results demonstrate the robustness of CrossPhire, which consistently outperformed CLIP across datasets, thereby underscoring its effectiveness in phishing web page detection.
To assess whether performance differences represent statistically significant improvements rather than random variation, we conducted two-proportion Z-tests to assess whether performance differences represent statistically significant improvements. Importantly, our evaluation used the exact same test samples as baseline methods, enabling direct comparison without confounding factors from different data splits.
Table 16 presents results with 95% confidence intervals.
Results reveal that among published multimodal baselines (
Table 14), only PWD2016 shows statistically significant improvement (
p < 0.001, 2.40% gain). The 0.14% improvement on PILWD-134K is not statistically significant (
p = 0.301), while PhishIntention and VanNL126k show no significant differences from baselines (
p > 0.05), with CrossPhire performing slightly worse. In contrast, comparisons with CLIP demonstrate statistically significant improvements across all five datasets (
p < 0.05), with particularly strong results on PILWD-134K (
p < 0.001, 1.64% gain) and VanNL126k (
p < 0.001, 1.44% gain).
These results indicate that, while our domain-specific architectural design provides meaningful advantages over general-purpose multimodal models such as CLIP, it does not consistently outperform specialised prior phishing detection methods when evaluated on the same dataset. Therefore, our scheme achieves competitive or superior results compared to other benchmarked approaches.
6.5. Robustness Against Zero-Day Attacks
The identification of phishing sites remains a critical cybersecurity challenge, particularly in the context of zero-day attacks that exploit previously unknown vulnerabilities. Conventional identification schemes that rely on blacklists or static features have proven ineffective against sophisticated phishing campaigns. These campaigns employ various techniques to evade detection, including (a) dynamic HTML content loading, (b) URL generation algorithms, and (c) compromised domains. For instance, attackers frequently employ techniques such as slight modifications to legitimate websites, a practice known as typosquatting, whereby similar-looking characters are used (e.g., replacing “o” with “0”). In a similar vein, the implementation of redirect chains through compromised legitimate web pages constitutes an additional prevalent technique. Another indispensable fact is the ever-changing trends of attackers over time. These evolving tactics underscore the necessity for robust and adaptive detection systems that can demonstrate effectiveness over time.
To assess the robustness of our approach against zero-day attacks, we initiated a comprehensive data collection effort in February 2025. In this context, ‘zero-day’ refers to the temporal gap between the training data (2020–2024 for Phish360) and the testing data (February 2025). This simulates real-world scenarios in which models encounter phishing campaigns with novel tactics that were not seen during training. The dataset was created by randomly sampling 3012 web pages in February 2025. Phishing samples were sourced from PhishTank’s verified submissions, while legitimate samples were randomly selected from active websites (including authentication pages to increase the difficulty level). To ensure uniqueness and novelty, we implemented a rigorous two-stage filtering process. First, we performed intra-dataset deduplication using the Duplicate Image Finder tool to identify exact screenshot matches, as well as conducting a domain-level analysis to eliminate URL duplicates within the 3012 samples. Second, we conducted cross-dataset novelty verification by comparing the remaining samples against the Phish360 dataset. This involved URL similarity matching, screenshot comparison and domain overlap analysis, with the aim of excluding any samples that were present in, or structurally similar to, the training data. This process reduced the collection to 1080 unique samples (540 legitimate and 540 phishing), which were then processed using our PhishBoring application to capture all three modalities (URL, HTML and a screenshot at a resolution of 1280 × 960).
According to the results, CrossPhire trained with PILWD-134K performs on par with CLIP with an accuracy of 88%. Similarly, our model trained with PhishIntention achieves an accuracy of 85.44%, while the CLIP outperforms it in almost all epochs. Phish360, the most recent dataset among the others, enables our model to be more robust against the zero-day attack dataset, with a maximum accuracy of 94.51%. In contrast to these results, our model trained with VanNL126K surprisingly lags behind the CLIP and Resnet50 models, achieving a maximum accuracy score of 78.1%.
Inspection of the results led to a number of observations. First, as expected, we observe that the similarity between distributions of training and test samples plays a crucial role in prediction performance. The collection period of the Phish360 dataset ended in 2024, whereas PILWD-134K and PhishIntention cover the period from 2019 to 2020. Furthermore, the samples belonging to VanNL126K cover a very short period, from September to December 2019. Because of this, we can observe the dramatic effect of the historical difference between the data samples—dataset temporal bias. Secondly, CrossPhire utilizes three sources of information, while CLIP benefits from two, and Resnet50 only uses screenshots. It can be concluded that the predictive capability obtained by concatenating embeddings from three different modalities requires the continual introduction of new samples to ensure the models remain robust. The well-known domain-shifting problem manifests in this context, as attackers identify novel evasion techniques and exhibit emerging trends. Consequently, the enhanced predictive capability stemming from modality richness entails the introduction of novel trends into the model. This results in a trade-off between modality richness and update interval.
6.6. Handling Missing Modality Problem
The inherent vulnerability of concatenation-based multimodal architectures to missing or corrupted input modalities requires regularization techniques that promote robust feature learning across individual modality pathways. Recent literature has demonstrated that modality dropout serves as a regularization mechanism and a technique that enhances robustness. It prevents models from overreliance on specific modality combinations that may not generalize to real-world deployment scenarios [
71,
72]. Additionally, missing modality training has been shown to create ensemble-like internal effects, whereby models learn multiple expert pathways that can make informed decisions with different modality combinations [
73]. Given these theoretical foundations and the practical necessity of handling incomplete multimodal data in phishing detection scenarios, we implemented a structured modality dropout approach operating during both the training and validation phases to enhance model robustness and generalization capability.
As detailed in Algorithm 1, our modality dropout algorithm operates on three modalities: content, URL, and visual. To ensure independence between training and validation patterns, we used different random seeds (base seed for training, base seed + 1000 for validation) while maintaining reproducibility across experimental runs. Using a seeded random number generator (RNG), the algorithm pre-generates reproducible missing modality patterns, selecting a specified percentage that we call the
Missing Modality Sampling Ratio—MMSR (ranging from 0% to 50%) of training and/or testing samples. For instance, if MMSR is set to 50% for the test data, half of the test data gets affected. For each affected sample, the algorithm randomly selects one of the following six combinations: (1) keep screenshot + URL (drop content), (2) keep content + URL (drop screenshot), (3) keep content + screenshot (drop URL), (4) keep URL only (drop content + screenshot), (5) keep screenshot only (drop content + URL), and (6) keep content only (drop screenshot + URL). The masking is implemented by zeroing out the corresponding feature vectors at the input level, effectively simulating real-world scenarios where certain data sources may be corrupted, unavailable, or unreliable. Available modalities retain their original representations. This approach ensures consistent missing patterns across training epochs while maintaining the architectural integrity of the multimodal fusion layer.
| Algorithm 1 Modality Dropout Training and Validation |
- 1:
Input: Dataset - 2:
Input: Missing rate , seed s - 3:
Input: Dataset type - 4:
- 5:
Initialize RNG with - 6:
- 7:
RandomChoice(, , replace = False) - 8:
Define modality combinations: - 9:
sshot_content, URL_content, URL_sshot, content_only, sshot_only, URL_only} - 10:
for to N do - 11:
if then - 12:
RandomChoice(C) - 13:
Apply masking based on combination c: - 14:
if c drops URL then - 15:
- 16:
end if - 17:
if c drops screenshot then - 18:
- 19:
end if - 20:
if c drops content then - 21:
- 22:
end if - 23:
end if - 24:
end for - 25:
Return: Modified dataset with modality dropout applied
|
Table 17 introduces experiments with three different target datasets. Note that the first row of each group in
Table 17 shows the performance drop when the test data has missing values. As can be seen from
Table 17, our empirical validation reveals significant benefits by applying the modality dropout in two different configurations. First, the modality dropout applied to training data demonstrates remarkable recovery capabilities under missing modality conditions. These models maintain substantially higher performance levels than their previously recorded baseline counterparts (first rows) across three tested MMSR values (10%: light, 25%: moderate, and 50%: severe). The higher the MMSR usage, the better the generalization we obtained. Second, models trained with the MMSR of 10% and 25% consistently outperformed their original baselines even when all modalities were available. These models achieve an accuracy improvement of 0.36% (e.g., from 97.71% to 98.32%). This phenomenon aligns with the theoretical expectation that modality dropout serves as an effective regularization mechanism, preventing overfitting to spurious intermodal correlations while encouraging the development of more generalizable feature representations within each modality pathway. This performance improvement suggests that forcing individual modalities to become more discriminative through periodic isolation results in stronger collaborative decision-making compared to the situation when all modalities are present. In effect, this creates an ensemble-like internal architecture where multiple expert pathways contribute to the final prediction.
6.7. Explainability
In response to the urgent demand for transparent decision-making in cybersecurity applications, we have developed a thorough Local Interpretable Model-Agnostic Explanations—LIME framework [
74] tailored for our multimodal anti-phishing neural network that fuses different information sources via concatenation. Our approach employs a hierarchical explanation method operating at two levels. First, we quantify the contribution of each input modality (URL, visual screenshot, and HTML content) by replacing modalities with neutral counterparts and measuring the resulting prediction variance. Second, we generate fine-grained explanations within modalities using tailored perturbation strategies. For URL analysis, we apply character-level perturbations through random alphanumeric substitution at 1–5 positions per URL to identify suspicious character patterns and domain components. For visual analysis, we use grid-based segmentation (16 × 16 spatial partitioning with 256 segments) to highlight image regions that contribute to phishing classification, such as deceptive login forms or fraudulent logos. Moreover, for textual content, we use token-level masking to identify phrases that are semantically important and trigger phishing detection. Each explanation method generates modality-specific perturbation sets: 120 perturbations for URL and content analysis, and 40 perturbations for image analysis to balance computational efficiency with explanation quality. Then, it fits a local linear surrogate model to approximate the complex neural network’s decision boundary and extracts feature importance coefficients.
Technically speaking, our LIME implementation employs a hierarchical perturbation-based explanation framework that operates through local linear approximation around individual instances. Given an input instance
and a complex multimodal classifier
, we first quantify modality-level contributions by systematic ablation:
, where
represents the instance with modality
m replaced by domain-neutral counterparts (neutral URL: “
https://www.example.com/index.html”, neutral content: generic business text, neutral image: synthetic white background with generic web elements). For intra-modality explanations, we generate
N perturbations
around
using modality-specific strategies: (1) URL perturbations via character-level random substitution where
positions are modified with alphanumeric replacements; (2) Image perturbations through grid-based segmentation using
spatial partitioning (grid_size=14 pixels, yielding ∼256 segments) with binary occlusion masks
; (3) Content perturbations via LIME’s default token-level masking strategy. For each modality, we fit a local linear model
using scikit-learn’s LinearRegression that minimizes the locality-weighted loss
, where
represents LIME’s exponential locality kernel. The image prediction function returns class probability vectors
to satisfy LIME’s binary classification requirements, while URL and content explanations utilize
perturbations each with feature selection limited to
most influential components, and image explanations use
perturbations for computational efficiency. The final explanation coefficient vector
provides feature importance rankings, where positive weights indicate phishing-supportive features and negative weights indicate legitimacy-supportive features, enabling practitioners to identify specific visual regions (via grid segment masks), URL character positions (via perturbation vectors), and content phrases (via token importance scores) that drive classification decisions.
Based on the techniques mentioned above, we provided a GUI-based Python application shown in
Figure 10. This approach enables cybersecurity practitioners to understand the specific evidence within each input modality that drove the final classification, facilitating trust, validation, and actionable threat intelligence in real-world phishing detection scenarios.
6.8. Runtime Analysis
The comprehensive experiments have shown the effectiveness of this approach. Moreover, excluding the duration for data preparation (i.e., content extraction of BS and screenshot taking) CrossPhire runs in real-time with 0.08 s per image inference on a computer equipped with Nvidia 3080 TI mobile GPU having 16 GB of memory, 12th generation Intel i9 CPU, and 32 GB of system memory. The cost of data preparation for the mentioned processes takes around 1.5 s in a Google Chrome engine browser.
To assess the engineering feasibility of CrossPhire for real-world deployment, we conducted comprehensive runtime profiling on an NVIDIA RTX 3080 Ti mobile GPU (16 GB VRAM) with Intel i9-12900H CPU and 32 GB system RAM.
Table 18 presents the component-wise latency analysis for single-sample inference. The results reveal that data preparation (1.5 s, 95% of total time) dominates the computational pipeline, while neural network inference requires only 80 ms (5%). Specifically, screenshot capture via Selenium WebDriver accounts for 75.9% of total latency, followed by HTML retrieval and BeautifulSoup parsing (19.0%). Among the neural components, ResNet50 vision encoding (45 ms) represents the most computationally intensive operation, whereas GramBeddings URL analysis (12 ms) and MPNet content encoding (18 ms) impose negligible overhead. This bottleneck is inherent to any multimodal phishing detection system requiring web page rendering and cannot be eliminated through model optimization alone. However, the neural inference component can be substantially accelerated through batch processing and quantization techniques.
For server-side deployment scenarios where URLs are processed asynchronously, batch processing significantly improves throughput.
Table 19 demonstrates the scalability characteristics of CrossPhire’s neural inference pipeline across varying batch sizes. With a batch size of 32, the system achieves 55.2 samples per second on a single GPU, translating to a theoretical capacity of 4.76 million classifications per day. GPU memory consumption scales linearly from 2.1 GB (batch = 1) to 12.4 GB (batch = 32), remaining well within the constraints of modern consumer-grade GPUs. The model’s memory footprint consists 144 MB in FP32 format.
Although our current implementation achieves real-time performance at 0.08 s per inference, this is a non-optimized baseline that can be significantly improved with modern acceleration techniques. Using TensorRT optimization with INT8 post-training quantization reduces inference time to between 0.02 and 0.05 s while maintaining 98–99% of the original model’s accuracy. The quantization approach would also reduce our 36M parameter model from 144 MB to approximately 36 MB, making mobile deployment feasible on modern smartphones.
A critical dependency in our current implementation is the Google Translate API for processing non-English content with the monolingual MPNet encoder. Translation requests introduce an additional 150–300 ms network latency and incur costs of approximately $20 per million characters ($0.50 per 1000 translated web pages). To mitigate this dependency, we offer two alternatives: (1) the multilingual XLM-RoBERTa encoder eliminates translation requirements at the cost of 2.7% average accuracy reduction and (2) language detection with selective translation processes only non-English content, reducing translation volume by approximately 62% based on our dataset language distribution.
7. Discussion
While we have taken extensive measures to ensure sample quality and diversity (
Section 4.3), several methodological choices introduce potential biases that users of this dataset should consider. Our phishing samples were collected from PhishTank and OpenPhish, which are user-reported platforms that may overrepresent easily-detected phishing campaigns. For legitimate samples, we used Alexa’s top 100 websites as seed URLs for random crawling, accessing over 50,000 web pages by following links from these initial seeds. While this seeding approach reduces direct bias toward only popular websites, it may still underrepresent certain categories of legitimate sites, particularly small business websites, regional services, and non-English content that phishing attacks often target. Additionally, our screenshot rendering configuration at 1280 × 960 resolution represents desktop-oriented phishing and may not fully capture mobile phishing attacks, which constitute an increasing proportion of real-world threats. The weekly collection schedule over 2020–2024 provides substantial temporal diversity but may introduce seasonal biases or miss short-lived phishing campaigns. These biases reflect inherent trade-offs in dataset construction between quality, diversity, and resource constraints. We argue that transparent documentation of these limitations, combined with our extensive comparative analysis demonstrating Phish360’s superior uniqueness and diversity metrics (
Section 4.3), enables informed usage while still representing a substantial improvement over existing benchmarks.
Our empirical evaluation reveals that simple embedding concatenation consistently outperforms mixture of experts (MoE) approaches across all experimental configurations (See
Table 12 and
Table 13). This finding can be attributed to the fundamental difference in information processing: concatenation preserves complete multimodal information in the joint feature space while MoE’s gated routing creates information bottlenecks through selective expert activation. In anti-phishing detection, where subtle cross-modal dependencies between URL, visual, and textual features are critical, the complete information preservation offered by concatenation proves more effective than the specialized but potentially incomplete representations generated by expert routing mechanisms.
One might argue that there is a lack of cross-dataset performance evaluations in pure vision models. While we acknowledge this limitation, we would argue that the absence of such evaluations does not undermine our core conclusions for several reasons. Firstly, both ResNet50 and DenseNet121 are pre-trained on ImageNet and have well-documented, well-established cross-domain generalisation properties [
55,
56]. Our fine-tuning approach inherits these robust transfer learning capabilities, particularly with regard to extracting universal visual features (such as logos, form layouts, and UI components), which remain consistent across phishing datasets collected from different periods and sources. Secondly, as shown in
Table 8, both architectures achieve remarkably similar performance across all five temporally and geographically diverse datasets.
The standard training scheme we applied has some shortcomings, such as an inability to work with missing modalities. Due to the intentions of attackers, some phishing web pages may not be parsed to obtain useful semantic content. This often occurs when attackers replace text regions with images. We, therefore, applied the Modality Dropout technique to mitigate this problem and increase robustness. This training scheme demonstrates empirical improvements of up to 7.6% in model robustness and provides some resilience against missing modalities. Nevertheless, this technique has limitations too. It should be noted that modality dropout assumes random missingness patterns during training, which may not accurately reflect the systematic or adversarial nature of missing modalities in actual phishing attacks. In these attacks, attackers deliberately obscure specific information types. Besides, the stochastic nature of modality dropout during training may lead to inconsistent convergence and requires longer training periods to achieve stable performance across all possible modality combinations.
The proposed system does not benefit from a third party feature such as domain checking. Although it is possible to use such additional and possibly useful features, we avoided them to achieve a real-time and self-contained end-to-end neural architecture. Further, our initial idea was to rely purely on a multilingual sentence transformer like XLM-RoBERTa for the sake of simplicity and speed. However, the slight loss of accuracy has led us to integrate an online translator API such as Google Translate, making the use of a third party service indispensable. This decline highlights the main problem of multilingual sentence transformers: the lack of training data in non-English languages. Although a lot of today’s software includes a Software-as-a-Service (SAAS) ecosystem, relying on a third-party system inevitably causes short delays and makes our solution vulnerable to API failures. We, therefore, believe that CrossPhire can be equipped with a better offline multilingual sentence transformer in the future. At this point, one may question the use of current open/closed source state-of-the-art large language models (LLMs) such as Anthropic’s Claude, OpenAI’s GPT or Meta’s Llama models. However, these models require high-end GPUs or token-based costs, which in turn result in additional costs when analyzing millions of suspicious web pages. Conversely, our objective is to institute a cost-effective (and potentially cost-free) approach that is replicable and can be utilized by researchers, students, and industry.
In this study, the superior performance of vector concatenation over the mixture of experts can be attributed to the unique characteristics of the anti-phishing problem domain. The three modalities employed in this work exhibit high complementarity rather than redundancy. Each modality provides distinct, non-overlapping information that is crucial for accurate classification. URL features reveal technical deception indicators, such as suspicious domains and redirects; visual features capture design mimicry and visual social engineering tactics; and textual content exposes linguistic manipulation and semantic deception strategies. Effective phishing detection requires simultaneous access to all three modalities to identify subtle cross-modal dependencies. For example, legitimate-looking visual designs might be paired with suspicious URL patterns, or trustworthy textual content might mask underlying technical vulnerabilities. Concatenation-based joint learning preserves complete information flow across all three modalities. This enables the model to learn critical cross-modal interactions. In contrast, the expert routing mechanism of MoE may inadvertently create information bottlenecks by specializing in individual modalities. This could cause it to miss essential inter-modal security indicators. These findings suggest that architectural simplicity through complete information preservation may be more effective than complex routing mechanisms designed for scenarios with redundant or competing modalities in security domains where multiple information sources are highly complementary and all contribute unique evidence for threat detection.
Although CrossFire achieves high accuracy on the same dataset (97.96–100%), we caution against interpreting these results as evidence that the task is trivial. Our cross-dataset experiments (See
Table 11) reveal a 30–51% drop in accuracy under distribution shift, and our zero-day evaluations (
Section 6.5) demonstrate a 10–16% degradation, showing that there is a genuine challenge in realistic deployment scenarios. The high same-dataset scores partially reflect the dataset-specific biases that we documented in
Section 4.3, such as URL length artefacts in PWD2016, temporal homogeneity in datasets with a short collection window, and HTML duplication enabling memorisation. We emphasise that cross-dataset and zero-day performance metrics are more reliable indicators of real-world effectiveness than same-dataset scores. The strength of the multimodal architecture lies not in solving an easy problem, but in maintaining robustness when individual modalities become unreliable, which is a critical requirement for adversarial domains such as phishing detection, where attackers continuously adapt.
It is well known that phishers evolve their techniques over time, resulting in zero-day attacks. There are many ML-based anti-phishing studies in the literature, using various features. As reported by Ariyadasa et al. [
75], the main shortcoming of these methods is their inability to remain robust against these new techniques. In this study, we also aimed to solve this problem by cross-firing with three modalities. An evaluation of the results of the zero-day attack reveals that CrossPhire demonstrates good to moderate performance on the zero-day attack dataset. Nevertheless, we are compelled to acknowledge the imperative nature of continual updates to the data. The performance scores obtained with the latest Phish360 and the largest PILWD-134K datasets clearly demonstrate this phenomenon. As a solution, Ariyadasa et al. [
75] proposed the integration of deep learning with reinforcement learning to keep the system on the safe side. However, their model still requires constant updates from known resources such as Phishtank. Ideally, the ultimate goal of the anti-phishing community is to invent a mechanism that can remain robust even after several years without retraining. In this regard, we posit that the notion of self-supervised learning has the potential to contribute to the problem domain in the future. For instance, the vision transformer, which is based on
Masked Siamese Networks, facilitates the acquisition of useful representations through contrastive learning. This process entails the application of controllable levels of perturbations to images, thereby obviating the need for reconstruction errors. It is our contention that, if implemented properly in the context of URLs, textual content, or images, this concept can prove beneficial in emulating adversarial attacks and future trends, reducing the need for the most recent data to some extent.
8. Conclusions
In this study, a new neural network was designed and implemented to develop an efficient and effective method for classifying phishing and legitimate web pages. This method was developed using three modalities: text, URL, and web page screenshots. In evaluations conducted on a dataset, the proposed model demonstrated superior performance in comparison to other studies. Extensive experimentation on multiple datasets has demonstrated the efficacy of the multimodal perspective in identifying phishing websites. Additionally, the semantics of the primary content of web pages have been found to exhibit discriminatory tendencies within the context of the problem domain. The experiments conducted on zero-day attack datasets underscore the importance of diversity and the incorporation of contemporary trends to inform machine learning models, thereby ensuring their resilience. Although the model under consideration has been shown to produce advantageous prediction results in zero-day benchmarks, it is hypothesized that the incorporation of self-supervised methods involving contrastive learning, in conjunction with minor perturbations in both text and image space, has the potential to enhance the model’s generalizability.