Next Article in Journal
Exploring Additive Manufacturing for Sports Mouthguards: A Pilot Study
Next Article in Special Issue
A Unified Complementary Regularization Framework for Long-Tailed Image Classification
Previous Article in Journal
Statistical Post-Processing of Ensemble LLWS Forecasts Using EMOS: A Case Study at Incheon International Airport
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification

by
Ahmad Hani Abdalla Almakhamreh
1,† and
Ahmet Selman Bozkir
2,*,†
1
Institute of Graduate School, Hacettepe University, Ankara 06800, Turkey
2
Department of Computer Engineering, Hacettepe University, Ankara 06800, Turkey
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2026, 16(2), 751; https://doi.org/10.3390/app16020751
Submission received: 27 November 2025 / Revised: 4 January 2026 / Accepted: 9 January 2026 / Published: 11 January 2026
(This article belongs to the Special Issue AI-Driven Image and Signal Processing)

Abstract

Phishing attacks continue to evolve and exploit fundamental human impulses, such as trust and the need for a rapid response, as well as emotional triggers. This makes the human mind both a valuable asset and a significant vulnerability. The proliferation of zero-day vulnerabilities has been identified as a significant exacerbating factor in this threat landscape. To address these evolving challenges, we introduce CrossPhire: a multimodal deep learning framework with an end-to-end architecture that captures semantic and visual cues from multiple data modalities, while also providing methodological insights for anti-phishing multimodal learning. First, we demonstrate that markup-free semantic text encoding captures linguistic deception patterns more effectively than DOM-based approaches, achieving 96–97% accuracy using textual content alone and providing the strongest single-modality signal through sentence transformers applied to HTML text stripped of structural markup. Second, through controlled comparison of fusion strategies, we show that simple concatenation outperforms a sophisticated gating mechanism so-called Mixture-of-Experts by 0.5–10% when modalities provide complementary, non-redundant security evidence. We validate these insights through rigorous experimentation on five datasets, achieving competitive same-dataset performance (97.96–100%) while demonstrating promising cross-dataset generalization (85–96% accuracy under distribution shift). Additionally, we contribute Phish360, a rigorously curated multimodal benchmark with 10,748 samples addressing quality issues in existing datasets (96.63% unique phishing HTML vs. 16–61% in prior benchmarks), and provide LIME-based explainability tools that decompose predictions into modality-specific contributions. The rapid inference time (0.08 s) and high accuracy results position CrossPhire as a promising solution in the fight against phishing attacks.

1. Introduction

While cybersecurity tools grow increasingly sophisticated, phishing attacks persist and evolve by exploiting our fundamental human impulses to trust, react quickly to authority, and respond to emotional triggers, making the human mind both our greatest asset and most persistent vulnerability. These attacks exploit various entities, such as malicious URLs (Uniform Resource Locator), deceptive screenshots, and misleading HTML (Hypertext Markup Language) text, to bypass traditional security measures and gain unauthorized access to personal and organizational data. These malicious activities, furthermore, have evolved in sophistication, adapting to conventional detection methods and necessitating more robust countermeasures. According to the Anti-Phishing Working Group’s 4th Quarterly Phishing Activity Report 2024, 989,123 phishing attacks were observed, up from 877,536 in the second quarter [1]. The report highlights several key trends, such as (i) attackers now sending Google Street View photos of victims’ homes in targeted emails, and (ii) an alarming new trend in extortion emails that include the recipient’s phone number and home address as part of the lure [1], highlighting the level of sophistication and danger in phishing.
While single-modality approaches, focusing on specific attack vectors such as URLs [2,3], web page screenshots [4], and HTML content [5,6,7], have shown promise, they often fall short in addressing the multifaceted nature of modern web page phishing techniques. The limitations of unimodal detection strategies have become increasingly apparent as cybercriminals employ more complex and diverse tactics to evade detection. For instance, URL-based methods may struggle with compromised domains or typosquatting. Similarly, screenshot analysis alone may fail to capture highly diverse layouts, whereas HTML content analysis may miss clues that could indicate malicious intent. In a similar vein, logo-focused approaches have to utilize domain-checking mechanisms since the existence of brand logos does not always imply malicious intent. Nevertheless, today, the utmost goal and the challenge of contemporary anti-phishing systems is recognizing zero-day attacks where a clever, unknown and novel tactic is exploited which machine learning (ML) models have not seen before, resulting in a vulnerability.
According to Parcalabescu et al. [8], a machine learning task is considered multimodal when the inputs or outputs are represented in different ways or consist of distinct types of fundamental units of information. In response to the above-mentioned challenges, multimodal approaches have emerged as a promising avenue for enhancing phishing detection in terms of accuracy and robustness. By integrating diverse data sources (e.g., URL, logo, screenshot, HTML content, favicon, etc.) and leveraging the complementary strengths of various representations, multimodal methods offer great potential in a more comprehensive and enriched understanding of phishing threats. This holistic approach aims to detect subtle patterns and correlations that may be imperceptible when examining a single modality in isolation. At this point, Lee et al. [9] note that the synergistic potential of combining URL analysis, visual inspection, and structural content evaluation presents an opportunity to improve the effectiveness of anti-phishing systems. However, it is important to realize that multimodal approaches are imperfect due to several technical challenges (e.g., the high number of features and resultant high computation burden). For example, the method followed by Wang et al. [10] requires collecting multiple types of features (domain registration, content analysis, behavioral patterns) that may not always be readily available or reliable, potentially limiting its practical applicability across diverse web environments.
Existing multimodal phishing detection approaches suffer from three fundamental limitations that constrain their practical deployment and generalization capability. First, methods such as Jail-Phish [11], SenseInput [12], and the approach by Yu et al. [13] rely heavily on handcrafted features extracted from HTML structure, CSS properties, or visual elements, requiring extensive domain expertise and becoming brittle as attackers adapt their techniques. Second, several state-of-the-art systems depend on third-party services. For instance, Jail-Phish [11] queries Google Search for domain verification, while recent LLM-based approaches like  [9] rely entirely on commercial APIs without custom model development, introducing latency, cost, and availability dependencies unsuitable for high-throughput security applications. Third, most existing multimodal methods  [11,12,14,15] employ multi-stage pipelines where modalities are processed separately and combined through late fusion or model stacking, preventing the learning of joint cross-modal representations that could capture subtle interdependencies between URL patterns, visual deception, and semantic manipulation.
CrossPhire fundamentally differs from prior work through three methodological innovations. First, we introduce semantic-aware content extraction using markup-free text parsed from HTML and encoded with sentence transformers (MPNet), eliminating reliance on DOM structure analysis while capturing linguistic deception patterns. Unlike HTML-based approaches [16,17,18] that process structural markup or methods that require English language content [6,19], our approach is language-independent through multilingual embeddings or efficient translation preprocessing. To the best of our knowledge, this is the first application of sentence transformers to markup-free web page text for phishing detection. Second, we propose an end-to-end joint training architecture where URL (via GramBeddings), visual (via fine-tuned CNNs), and semantic (via frozen sentence transformers) encoders are simultaneously optimized through a unified embedding space, enabling the model to learn cross-modal feature interactions rather than treating modalities as independent information sources. This contrasts with prior multi-stage approaches [11,12] and ensemble methods [15] that combine separately-trained models. Our main contributions are:
  • A novel end-to-end multimodal architecture that jointly learns discriminative representations from URL syntax, visual appearance, and semantic content through unified embedding optimization, eliminating the multi-stage pipelines and handcrafted features that characterize prior multimodal approaches [11,12,13].
  • The first application of sentence transformers to markup-free HTML text for phishing detection, enabling language-independent semantic analysis without reliance on DOM structure or third-party translation services during inference.
  • Two novel multimodal datasets: Phish360 (10,748 samples spanning 2020–2024 with rigorous duplicate elimination and diversity validation) and Phish360-Zeroday (1080 samples from February 2025), addressing the temporal bias and quality issues we identified in existing benchmarks through comprehensive dataset analysis.
  • Rigorous cross-dataset evaluation demonstrating generalization capability across five datasets and temporal robustness on zero-day attacks, including comparative analysis of two fusion strategies (concatenation vs. mixture of experts) and modality dropout training for missing modality resilience.
  • A LIME-based explainability framework providing hierarchical explanations at both modality-level (quantifying URL/visual/semantic contributions) and token-level (highlighting specific characters, image regions, and phrases driving classification).
The rest of the paper is organized as follows: Section 2 introduces relevant related studies, whereas Section 3 explains our motivation. Next, Section 4 presents the employed datasets. Section 5 explains the details of the proposed scheme. Similarly, Section 6 presents the findings and results. Section 7 discusses the pros and cons of the proposed approach, while Section 8 concludes the study.

2. Related Work

Phishing detection methods have evolved significantly, with various approaches classified into several distinct categories. Traditional research often groups these methods into (i) list-based, (ii) similarity-based, and (iii) machine learning-based approaches. However, we categorize the existing literature into five main categories based on the data modality used: (1) URL-based; (2) content-based; (3) vision-based, (4) bimodal approaches, and (5) multimodal approaches. In this section, we review state-of-the-art anti-phishing approaches from recent years according to these categories. We summarize key studies, highlighting their main findings, datasets used, and known limitations. This structured overview aims to provide a clear understanding of the strengths and weaknesses of current phishing detection methods, setting the stage for our proposed approach.

2.1. URL-Based Phishing Detection

Early URL-based methods relied on blacklists, which were difficult to maintain due to the need for constant updates. Recent studies have shifted toward machine learning, extracting lexical and statistical features from URLs [20,21], such as length [21], dots, or hyphens [20]. Advanced techniques include automatic feature extraction using NLP methods like TF-IDF (term frequency-inverse document frequency) [22,23,24], N-gram features, and embeddings [3,25,26] for more accurate detection.
In 2019, Sahingoz et al. [2] collected and published the EBBU2017 dataset with 73,000 URL samples and extracted 40 NLP-based features from URLs, achieving 97.98% accuracy using a Random Forest classifier. Rao et al. [22] later developed the CatchPhish dataset, combining handcrafted and TF-IDF features to reach 96.67% accuracy with a similar model, validated across two benchmark datasets with accuracies up to 98.57%. Recently, Haynes et al. [27] fine-tuned pre-trained transformers (BERT and ELECTRA) on URL data, achieving 96.3% accuracy, showing that transformers outperform traditional models with less training time.
Shirazi and Hayne [28] introduced MobileBERT for phishing detection on mobile devices, having three times faster runtime performance than its BERT-base counterpart and exceeded the accuracy of 97%. Jishnu and Arthi [25] combined BERT embeddings with handcrafted features, achieving 97.32% accuracy on 200,000 URLs. In another study, Jishnu and Arthi [26] used RoBERTa for feature extraction and LSTM for classification, reporting an accuracy of 97.14% on 300,000 URLs. Both studies highlighted strong performance using transformer-based models for phishing detection.
In a different vein, Bozkir et al. [3] introduced GramBeddings, a deep learning model for phishing detection using URL character-level n-grams. They collected a balanced dataset involving 800K phishing and legitimate URLs to address the scarcity of public large scale URL datasets. Their model, combining CNN, BiLSTM, and attention layers, achieved an accuracy of 98.27% on their dataset and outperformed other methods on seven public benchmarks with at least the accuracy of 98.32%. URL-based schemes offer a way faster processing time. They are, however, vulnerable to zero-day attacks mainly due to a lack of prior knowledge, rapid evolution, and more importantly, the limited context [29].

2.2. Content-Based Phishing Detection

Content-based phishing detection involves analyzing websites’ textual content and HTML structure to detect phishing intentions or cues. Early approaches heavily relied on handcrafted features, which were partially effective but prone to obsolescence due to the dynamic nature of phishing attacks.
Back to conventional works, CANTINA, introduced by Zhang et al. [19], uses the top five words from TF-IDF values for reverse search, classifying websites as legitimate if the domain appears in the first N results, achieving a 97% true positive rate. Despite its success, CANTINA has limitations, including a reliance on the English language and high false positive rates. To address these issues, Xiang et al. later proposed CANTINA+ [6], which employs 15 features from the HTML Document Object Model (DOM) and third-party services. Similarly, Hou et al. [5] developed a method for malicious content detection using dynamic HTML, extracting 17 features and reporting an accuracy of 96.14% through boosted decision trees. Nevertheless, hand-crafted features are prone to being bypassed by attackers, are expensive to extract, and require expert knowledge. Furthermore, as Opara et al. [16] stated, they struggle to keep pace with the evolving nature of phishing attacks and fail to capture semantic patterns in textual content [30].
To overcome these limitations, researchers are shifting towards automatic feature extraction. Towards this direction, Opara et al. [16] introduced HTMLPhish, which employs convolutional neural networks to extract embeddings from HTML content. They achieved a testing accuracy of 93% with a dataset of 25,000 samples. HTMLPhish, nonetheless, struggles with capturing semantic relationships and relies heavily on training data.
Ouyang et al. [17] followed a different path by employing a graph neural network (GNN) approach, representing DOM tags as nodes and edges, achieving an accuracy of 95.5% on a large dataset. However, this method can be bypassed by cloning legitimate HTML structures. In a different vein, Benavides-Astudillo et al. [31] used GloVe [32] embeddings to capture semantic features in HTML. They achieved a mean accuracy of 97.39% with a Bidirectional Gated Recurrent Unit (BiGRU) on an imbalanced dataset, though a small validation sample size limited it. Likewise, Çolhak et al. [18] proposed an approach based on fusing CANINE and RoBERTa embeddings using the HTML content. The authors used both textual and numerical features extracted from the HTML content. They reached an accuracy of 97.18% on their dataset and 89.58% on a benchmark dataset using a multilayer perceptron. However, the handcrafted features require expert domain knowledge and can be bypassed by attackers.

2.3. Vision-Based Phishing Detection

Vision-based methods in anti-phishing studies are motivated by the fact that phishing web pages mimic their legitimate counterparts in holistic appearance or specific visual elements, such as logos, layouts, favicons, and color schemes, to deceive users effectively. These schemes often use image classification, object detection, or similar vision models to detect any possible visual similarity between screenshots or smaller components of a phishing material and its legitimate counterpart.
Phoka and Suthaphan [33] developed a phishing detection method using pre-trained CNNs on login page images for five brands. They implemented data augmentation through sub-image placement and achieved accuracy of 97.1% by applying the Inception-ResNet-v1 model. From the perspective of logo similarity, Bozkir et al. [34] introduced LogoSENSE, which employs max-margin object detection (MMOD) for logo detection and extracts features using histogram of oriented gradients (HOG). Evaluated on 3060 training and 1979 testing samples, it achieved a precision of 93.50% and 85.02% F1-score. Addressing the unexplainable classification results of phishing detection systems, Lin et al. [35] proposed Phishpedia, a two-step deep learning approach that detects logos and matches them with legitimate references. Using Faster R-CNN and a Siamese model with ResNet, Phishpedia obtained 89.2% precision, 87.1% recall, and 99.2% phishing identification rate on a dataset of 30,649 samples.
Wang et al. [36] introduced a vision-based approach for recognizing phishing web pages using screenshots. Their deep learning-based approach combines local and global features by first locating and extracting the logo and then integrating it with the full screenshot. The authors evaluated their approach on two public datasets, PhishPedia and VisualPhish, achieving 95% and 85% accuracy, respectively.

2.4. Bimodal Approaches

In contrast to the single modal schemes presented above, Bimodal anti-phishing approaches combine two data modalities, often using both textual modalities (URL and HTML content) or visual and textual data, to improve the robustness in phishing detection.
Van Dooremaal et al. [37] proposed a phishing detection method using HTML and images. They obtained an accuracy of 99.66% by combining visual and textual features, applying reverse image search for brand identification, and logistic regression for classification. Sánchez-Paniagua et al. [38] introduced the PILWD-134K dataset and used 54 handcrafted features to reach an accuracy of 97.95% using LightGBM, though the approach requires significant manual effort. Liu et al. [39] developed PhishIntention, which identifies phishing intent using deep learning models by leveraging visual and HTML content, achieving an accuracy of 95% for credential-requiring page (CRP) detection and 93.3% for CRP transitions.
Vo Quang et al. [40] proposed a deep learning approach that utilizes URL and HTML DOM structure so-called Shark-Eyes. The authors report an accuracy of 95.35% on a self-collected dataset and 92.55% on adversarial phishing samples. While Shark-Eyes demonstrates attention-based fusion of URL and DOM features, its reliance on DOM structure makes it vulnerable to HTML obfuscation attacks and limits its ability to capture semantic content. In contrast, CrossPhire’s use of markup-free text extraction avoids DOM-based vulnerabilities while preserving semantic meaning through sentence transformers.
Tong et al. [41] proposed a new bimodal approach utilizing raw URL and the sequence of HTML tags. ConvBERT + positional encoding for URL and HTML. Likewise, Opara et al. [30] introduced WebPhish, a hybrid deep neural network utilizing raw URL and HTML content using CNNs. Their proposed network concatenates character-level embeddings from URLs and word embeddings from HTML. WebPhish outperformed its unimodal alternatives, achieving an accuracy of 98.1% on their self-collected dataset. However, WebPhish’s word-level HTML embeddings are language-specific and training-data dependent, limiting cross-lingual generalization. CrossPhire addresses this through multilingual sentence transformers (MPNet and XLM-RoBERTa) that provide language-independent semantic representations.
Lee et al. [9] introduced a two-step phishing detection method based on large language models (LLMs) using the web page’s screenshot and HTML. The approach uses LLM prompts for brand identification and domain verification. The method achieved 90% precision and recall for a collection of 4480 samples. While leveraging LLMs offers strong contextual understanding, this approach’s complete reliance on third-party commercial APIs (GPT-4) introduces cost, latency, and availability constraints that limit practical deployment. In contrast, CrossPhire’s self-contained architecture enables offline operation and direct model optimization without external dependencies.

2.5. Multimodal Approaches

According to our taxonomy, multimodal phishing detection methods integrate more than two data modalities (e.g., URL, HTML, and images) to create comprehensive models that capture various aspects of phishing web pages. We review the latest multimodal approaches that combine the three main data modalities for phishing detection.
Rao and Pais [11] introduced Jail-Phish, a search engine-based system designed to address Phishing Sites Hosted on Compromised Servers (PSHCS). Jail-Phish classifies web pages by querying Google and comparing domain and title information with the top 10 search results. The system extracts features from the URL, CSS, JavaScript, and image files (e.g., logos or favicons) to calculate the Jaccard similarity between the query and result pages. Jail-Phish achieved an accuracy of 98.61% on a dataset of 11,500 samples. However, its reliance on Google Search API introduces latency, third-party dependency, and potential failures when phishing sites become indexed. CrossPhire’s self-contained feature extraction avoids these external dependencies while maintaining high accuracy.
Yu et al. [13] developed a phishing detection method by combining URL, HTML, and image features. Their approach uses LSTM layers to process URL and HTML text data, while CNNs with CBAM attention extract image features. Several pre-processing steps are applied to the screenshot. The concatenated feature vectors from each modality are then fed into fully connected layers, achieving a 97.75% accuracy with a multilayer perceptron on a dataset of 6000 samples. The authors also experimented with combinations of the different modalities, resulting in slightly lower accuracies of ∼93% and ∼96%. While their CBAM attention mechanism enhances feature extraction, the multi-stage training process (separate modality encoders) prevents end-to-end joint optimization. CrossPhire’s unified training framework allows gradient flow across all modalities, enabling the discovery of cross-modal patterns unavailable in separately-trained architectures.
Lin et al. [12] presented SenseInput, a multimodal system leveraging URL, HTML, and screenshots. The system introduced nine new features, including statistical and sensitive input features, using LightGBM to achieve an F1-score of 98.48%. Despite strong performance, SenseInput relies on 22 handcrafted features requiring domain expertise and manual engineering, making it labor-intensive and potentially brittle to evolving attack patterns. In contrast, CrossPhire’s automatic feature extraction via deep learning eliminates manual feature engineering while maintaining adaptability to new phishing tactics. In a similar vein, Tan et al. [14] extended PhishWHO [42] by incorporating both visual and textual identity features using logo and text extraction to identify phishing sites. They validated their model on two datasets (DS-1 and DS-2), achieving the accuracy of 98.60%, with the approach excelling at detecting phishing sites that use either textual or visual identities. However, the method’s dependence on reverse search engines and domain verification services introduces latency and requires continuous internet connectivity. CrossPhire’s offline-capable architecture provides deployment flexibility without compromising detection quality.
Zhou et al. [15] proposed a multimodal phishing detection method integrating URL, text, and screenshot data. Their MultiiRECG approach utilized model stacking and achieved an accuracy of 88.82%, outperforming unimodal methods in handling 11 phishing categories. While model stacking combines diverse classifiers, it requires multi-stage training and lacks end-to-end optimization, limiting the model’s ability to learn joint representations. Additionally, the reliance on handcrafted URL features introduces manual effort and potential fragility.
Li et al. [43] proposed KnowPhish, a novel brand knowledge base (BKB) incorporating around 20k targeted brands and a method for detecting phishing web pages that combines visual and textual modalities. Their approach, KnowPhish Detector, outperformed the baseline approaches, achieving accuracy of 92.49% for the TR-OP dataset. KnowPhish’s strength lies in its comprehensive brand knowledge base; however, maintaining and updating 20k brand profiles requires continuous manual curation. CrossPhire’s brand-agnostic approach using semantic text analysis generalizes beyond known brands without requiring brand-specific knowledge bases.
KnowPhish was then employed by Cao et al. [44] in their proposed approach, PhishAgent. PhishAgent uses both online and offline knowledge bases to detect the targeted brand using HTML and the web page’s logo and classify the web page by comparing the domains. PhishAgent achieved an average accuracy of 95.03% on three benchmark datasets. While PhishAgent demonstrates strong performance through its dual knowledge base strategy, inheriting KnowPhish’s brand database maintenance burden and requiring brand-specific training data limits scalability to emerging brands.
These studies highlight the increasing trend of integrating multimodality to improve detection accuracy and robustness against various phishing attacks.

2.6. Comparative Analysis of Multimodal Approaches

To systematically position CrossPhire within the landscape of multimodal phishing detection methods, Table 1 presents a comprehensive comparison of architectural designs and fusion strategies across recent multimodal approaches. As demonstrated, CrossPhire distinguishes itself through four key architectural innovations: (1) automatic semantic feature extraction via sentence transformers on markup-free text, eliminating the handcrafted features required by Jail-Phish [11] and SenseInput [12]; (2) end-to-end joint training of all modality encoders through a unified loss function, contrasting with the multi-stage pipelines employed by Yu et al. [13] and Zhou et al. [15] where modalities are processed and optimized separately; (3) minimal third-party dependencies for core functionality, unlike Lee et al. [9] which requires GPT-4 API access and Tan et al. [14] which relies on domain verification services; and (4) language-independent semantic analysis through multilingual sentence transformers, addressing the English-only limitation observed in WebPhish [30].
The fusion strategy comparison reveals a critical architectural distinction: while methods like Yu et al. [13] employ late fusion by concatenating independently-learned representations, CrossPhire’s early fusion approach enables gradient flow through all modality pathways during backpropagation. This allows the network to discover complementary cross-modal patterns between URL syntax, visual appearance, and semantic content that would remain hidden in multi-stage architectures where each modality encoder is optimized in isolation.

3. Motivation

A review of existing anti-phishing studies reveals several shortcomings: lack of standardized datasets, reliance on manual feature engineering, language dependency (typically, English only), third-party service dependencies (search engines, WHOIS, domain age), and vulnerability to attackers who manipulate handcrafted features (e.g., HTTPS, URL length). Additionally, single modality systems are easily bypassed through obfuscation tactics (embedding text in images, HTML obfuscation, link redirection), and vision-based approaches fail when phishing pages don’t mimic known brands.
We hypothesize that most phishing web pages, regardless of the context, involve malicious textual content that demands private information from victims in varying forms that require a careful carving of the HTML to extract. According to our idea, if this phishing-related and markup free textual content is detected, revealed from the other redundant content, and mapped into a discriminative embedding vector in high-dimensional space then we will be able to (1) obtain the semantic vector space of phishing intentions, (2) utilize it in anti-phishing and (3) evaluate how this space is different from legitimate content. The choice of markup-free text extraction is based on the theoretical distinction between the content and the form of web pages. Michailidou et al. [45] distinguish explicitly between ‘content on the website’ and ‘the website’s form with respect to user interface, navigation and structure’ as independent factors affecting page perception. Crucially, Deng and Poole (2012) [46] demonstrate that previous research has only considered web page elements as isolated aesthetic factors and lacks a coherent theoretical framework, suggesting that markup elements contribute to visual aesthetics rather than semantic understanding. Michailidou et al. [45], moreover, define visual complexity as arising from ‘the quantity of objects, clutter, openness, symmetry, organisation and variety of colors’. Including DOM structure would therefore introduce visual complexity features into a task that requires semantic content analysis. However, since phishing detection fundamentally relies on identifying deceptive linguistic patterns in textual content rather than aesthetic or navigational properties, DOM-aware modelling would introduce noise from form-related features that are unrelated to the semantic deception signals we seek to capture. Additionally, theoretical frameworks for web page aesthetics (order and complexity) have been found to capture the main influence of the environment on people’s aesthetic experience [46], emphasising that structure influences aesthetic judgement rather than content comprehension. For phishing classification, where the goal is to detect semantic and linguistic deception, markup-free text provides a cleaner signal by removing these aesthetic-oriented structural features.
Last but not least, single modality detection systems, which rely on only one source of information, are also easier for attackers to bypass through various obfuscation tactics, such as embedding text in images, link redirection, or HTML obfuscation. This narrow scope also leaves systems vulnerable to zero-day attacks, where newly developed tactics exploit new vulnerabilities to evade detection. Although multimodality has demonstrated effectiveness in improving detection, it remains underutilized.
To mitigate the problems listed above, we introduce a new multimodal anti-phishing approach, so-called CrossPhire wrapped in an end-to-end deep learning model leveraging three sources of information namely (i) URL, (ii) main textual content, and (iii) web page screenshot. Sharing similar ideas to other multimodal anti-phishing studies at the fundamental level, we aim to capture the essence of phishing intention from different perspectives. To achieve this efficiently and effectively, we build a three-branched deep neural network architecture producing highly discriminative embeddings devoted to each modality and jointly train it in an end-to-end fashion. Consequently, motivated by this idea and the mentioned problems above, this study aims to deliver a modern end-to-end trainable network-in-network architecture to analyze web pages comprehensively to find out different aspects of phishing evidence, along with no use of any third-party service or hand-crafted features. Moreover, we introduce and make public a new multimodal anti-phishing dataset which was carefully collected and curated to obtain diversity and completeness.

4. Datasets

This section presents our approach to multimodal phishing detection, beginning with an explanation of our main motivation. Next, we present a comprehensive overview of the datasets used in our study, followed by a detailed introduction of the feature encoding methods employed for each data modality. Lastly, we give the evaluation metrics.
Despite the long history of phishing attacks, the anti-phishing research community still lacks a standardized benchmark dataset comparable to ImageNet in computer vision. This has led to fragmented research using datasets collected across different time periods, making direct comparisons between methods challenging. Recent years have seen improvements with larger datasets [3,26] and multimodal collections incorporating HTML markup, screenshots, and URLs [37,38,39,47]. However, our initial analysis revealed persistent issues including screenshot rendering errors [47], incomplete and missing content [37,39], and duplicate samples [38].
To address these limitations, we developed Phish360, a new multimodal dataset designed to ensure (i) uniqueness of samples, (ii) diversity in URL features and targeted brands, and (iii) standardized, high-quality screenshots. This dataset forms the foundation for our experimental evaluations, alongside four publicly available benchmark datasets that enable comprehensive performance comparison.

4.1. A New Multimodal Dataset: Phish360

We introduce Phish360, a new multimodal anti-phishing dataset designed to ensure sample uniqueness, diversity, and standardized screenshot quality. To collect data, we developed a custom Java-based multi-threaded crawler with two modes: (1) seed mode for discovering new pages from seed URLs, and (2) list mode for crawling from URL lists. The crawler uses Selenium to render pages at 1280 × 960 pixels and saves them as .png files with embedded HTML and URL metadata. We also developed PhishBoring, a GUI tool for visual inspection and quality control (Figure 1).
Legitimate samples were collected by seeding from Alexa’s top 100 pages, accessing over 50,000 web pages. Phishing samples were collected weekly between 2020–2024 from PhishTank and OpenPhish. After manual inspection, duplicate removal using PhishBoring and Duplicate Image Finder, the final dataset contains 10,748 unique samples (6416 legitimate, 4332 phishing)—the largest time span among benchmarking datasets.
As shown in Figure 2, the dataset is organized into TrainVal (80%) and Test (20%) folders. Each sample is stored separately following the naming convention Pxxxx_brand for phishing and Lxxxx_legitimate for legitimate samples, including URL, HTML, screenshot, and label.txt files. The Phish360 dataset and codebase presented in this study are available for academic use at https://web.cs.hacettepe.edu.tr/~selman/phish360-dataset/.

4.2. Benchmarking Datasets

To assess the effectiveness of our proposed methodology, we evaluate it using four publicly available multimodal datasets: PWD2016 [47], PhishIntention [39], PILWD134K [38], and VanNL126K [37]. These datasets span different time periods and consist of URL, HTML content, and screenshots. We summarize the data collection information in Table 2.
PWD2016: The Phishing Website Dataset (PWD2016) was collected in 2016 [47], containing 30,000 website samples in total, equally divided into phishing and legitimate classes. The phishing samples were collected from PhishTank, and to ensure the inclusion of less popular legitimate websites, the authors collected samples from DMOZ, BOTW, and Alexa.
PhishIntention: The PhishIntention dataset consists of 29,496 phishing samples, 25,400 legitimate and 3049 misleading legitimate samples [39]. The legitimate samples were collected from Alexa, while the phishing cases were collected from OpenPhish between 2019 and 2020. Misleading legitimate samples refer to sign-in/login pages of legitimate web sites including but not limited to popular social media platforms (e.g., LinkedIn, Facebook, or Google). In this work, we combine legitimate and misleading legitimate samples into a single legitimate class for creating a more challenging environment since many of the phishing samples involves similar login pages.
PILWD-134K: The Phishing Index Login Website Dataset (PILWD-134K) consists of 133,928 samples, constituting the largest benchmarking data in our experiment, divided equally into phishing and legitimate samples [38]. The legitimate samples were collected from Quantcast Top Sites and Majestic List, while the phishing samples were collected from PhishTank between 2019 and 2020.
VanNL126K: This dataset contains a non-balanced case distribution by having 125,938 phishing and 25,938 legitimate samples [37]. According to the authors, the legitimate samples were collected from the DMOZ directory, whereas the phishing cases were collected from PhishTank, OpenPhish, and PhishStats between September and December 2019.

4.3. Evaluation of Datasets

We perform an in-depth analysis of the datasets, starting with evaluating URL samples to assess diversity, then examining other data modalities using preprocessed Parquet files for comprehensive analysis.

4.3.1. URL Analysis

This section examines the distinctions between legitimate and phishing URLs across various datasets to identify trends and concealed patterns. Thus, we begin by obtaining histograms of URL lengths to examine the overall distribution and frequency of values. As illustrated in Figure 3, there are discernible discrepancies in URL lengths between phishing and legitimate samples across a range of datasets. For instance, in the PWD2016 dataset, legitimate URLs are significantly shorter than those in other datasets. Upon further examination, it was discovered that all legitimate URL samples in PWD2016 are restricted to just the domain and top-level domain (e.g., facebook.com). This could introduce bias in URL-based detection methods. Similarly, datasets such as PhishIntention and VanNL126k demonstrate a higher prevalence of shorter legitimate URLs, whereas phishing URLs tend to be longer due to the incorporation of additional path and query parameters employed by attackers to direct victims to malicious landing pages.
To evaluate diversity, we calculated the proportion of unique URL components (domains, TLDs, FLDs, subdomains). Table 3 and Table 4 show phishing and legitimate metrics. Legitimate samples demonstrate greater diversity than phishing samples. Phish360 exhibits the highest uniqueness (73.63% unique domains and 28.69% subdomains) by spacing collection over extended periods to minimize near-duplicate URLs from the same domain.
Table 3 reveals a relatively low percentage of unique phishing URLs in PWD2016, with a considerable number of duplicate samples. The high number of exact duplicate URLs presents a significant challenge that can lead to data leakage in machine learning. Data leakage occurs when there is overlap between the training and testing data sets, resulting in models that are overly optimistic and fail to generalize. Reducing the number of duplicate URLs is crucial to prevent biased results and enhance model performance on new data. This issue is worthy of attention because a unimodal URL-based approach would be susceptible to data leakage if duplicate URL samples were not minimized.

4.3.2. Content and Screenshot Analysis

We analyzed HTML code and extracted text using Trafilatura (TF) and BeautifulSoup (BS), which yield superior quality compared to html2text and lxml.
  • Content
We compare the percentages of unique HTML and BS text for phishing and legitimate samples. As shown in Figure 4, phishing samples exhibit reduced uniqueness due to HTML code reuse across disparate URLs, with benchmark datasets ranging from 16% to 61%. However, as demonstrated in Figure 4, our dataset Phish360 exhibits a markedly higher degree of uniqueness in phishing HTML samples, reaching 96.63%, compared to other benchmark datasets. The high uniqueness observed in Phish360 can be attributed to three key factors: manual sample verification, source diversification, and the collection of samples over extended periods. These factors have collectively enhanced both the uniqueness and diversity of phishing samples.
A potential issue inherent to single modality anti-phishing datasets, particularly those comprising screenshots or HTML content, is the risk of data leakage due to duplicate samples. Conversely, multimodal datasets that integrate diverse information sources can alleviate this issue by enhancing overall variability. To evaluate the degree of uniqueness, we consider each sample as a triplet comprising: the URL, the HTML code, and a hash of the screenshot image. The SHA-256 hash function was employed to generate unique identifiers for each web page. As illustrated in Figure 5, the triplet approach resulted in a notable increase in the percentage of unique samples in benchmark datasets, from approximately 20% to 80–90%, even when individual components such as URLs or HTML were duplicated. Furthermore, Phish360 curation underwent multiple human-based visual inspections through Duplicate Image Finder tests to eliminate samples with even very similar screenshots. Figure 5 demonstrates that representing data as triplets of URL, extracted textual content (BS or TF), and image hash significantly improves uniqueness metrics. The percentages in subfigures (a) and (b) of Figure 5 are calculated by dividing the number of unique triplets by the number of valid samples (those having a valid URL, HTML, and image).
An important diversity measure often overlooked within anti-phishing datasets is linguistic composition. Diversifying languages is essential given the multilingual nature of phishing websites, which target more than just English-speaking users. Recent APWG phishing activity trend reports have shown significant increases in attacks across various countries, underscoring the necessity for datasets to encompass comprehensive language varieties. As shown in Figure 6, we incorporated several widely spoken world languages in Phish360, including English, German, French, Spanish, and Portuguese. The Phish360 collection process ensured diversity by including 30 languages for phishing samples and 27 for legitimate ones. Employing the langdetect 1.09 Python package for language detection on extracted plain text, Figure 6 demonstrates the language distribution of Phish360 samples for both classes compared to benchmark datasets, illustrating the superior linguistic diversity of our dataset.
  • Screenshot
Upon examining screenshot images within the four benchmark datasets, we noticed discrepancies in resolution dimensions. Contrary to expectations of consistent image dimensions across all screenshots, we noted variations:
  • PhishIntention: A total of 38% of screenshot image sizes are 1920 × 1080 pixels, 19% are 1366 × 768 pixels, and the remaining images vary in dimensions.
  • PWD2016: The image sizes are distributed as follows: 6% are 510 × 1330 pixels, 2% are 18 × 18 pixels, and the remaining images vary in size.
  • PILWD-134K: The majority of screenshot image sizes, constituting 68%, are 1906 × 922. A smaller proportion, 22%, are sized at 1853 × 922. The remaining images are distributed across 26 different dimensions.
  • VanNL126k: All screenshot images are 1280 × 768.
In contrast to benchmark datasets, 99.8% of screenshots in the Phish360 dataset have a consistent resolution of 1280 × 960 pixels. It is worth noting that we also identified screenshots of inactive websites, as well as entirely white or black images, within the benchmark datasets.

5. Methodology

This section details the CrossPhire framework for robust phishing detection. We first provide details of data preprocessing and feature encoding for each modality, followed by an introduction to the neural architecture. Figure 7 overviews the end-to-end workflow. Next, we explain the evaluation strategy.

5.1. Data Preprocessing

In this subsection, we outline the preprocessing steps for our datasets, explaining how we extract raw data and obtain the required representations. We process multimodal data in a column-oriented format using Pandas data frames for efficient storage and retrieval. Since handling all datasets in a single large data frame would be memory-intensive, we opted to store each dataset in separate data frames. We further added important columns to these data frames for further analysis. The rationale behind using data frames is to avoid reprocessing data from scratch, yielding fast experimentation.
Using the Parquet format, we can load only the required columns into memory, optimizing performance when working with large datasets. This step is crucial for handling multimodal datasets with extensive textual observations.
We initially extracted plain text from the HTML source code using five Python parser libraries: Trafilatura [48], BeautifulSoup [49], html2text [50], lxml [51], and html_text [52]. Each library produced slightly different outputs due to variations in its text extraction processes. Note that, during our non-reported experiments, we found out that the most successful parsing (i.e., obtaining the most discriminating information) is done via Beautifulsoup and Trafilatura. Hence, experiments presented in the remaining part of this paper utilized these two parsers. Table 5 provides a detailed breakdown of each data frame column, showcasing a sample from Phish360 dataset and illustrating differences across the obtained text contents.
Our dataset includes three key data modalities: URLs, HTML source code, and web page screenshots. We ensured that each sample contains all three modalities and removed those with missing or corrupted files. Table 6 summarizes the missing or invalid files in each dataset. The table includes details on missing files, files with naming issues, and files with invalid extensions or HTML encoding errors.
We processed all multimodal datasets in the same way and saved the resulting data frames into separate Parquet files. Notably, we stored each dataset’s data frame into two Parquet files, separating the samples classwise: (1) phishing and (2) legitimate.

5.2. Encoding the Modalities

In this subsection, we present the details of our feature encoders devoted to each modality.

5.2.1. URL

As is known, URL information is a commonly employed source of information in phishing detection literature, since it involves rich wording patterns that can be extracted through several NLP methods ranging from conventional TF-IDF to recent transformer-based approaches. In this study, for obtaining URL-based representations, we employ GramBeddings [3]. In this algorithmic selection, we first benchmarked GramBeddings with other well-known URL-based models, namely URLNet [53] and URLTran [54]. We, further, assessed the models’ generalization capability on unseen data through cross-dataset experiments, where we trained each model on one dataset and tested it on another.
The GramBeddings model combines character-level (unigram) and specific n-gram features (4, 5, and 6) to create embeddings that capture contextual nuances in the URL. In the GramBeddings pipeline, URLs are tokenized, with each character encoded numerically. Character-level embeddings are padded or truncated to a fixed length of 128. For n-gram analysis, chi-square-based feature selection is applied to address the curse of dimensionality problem, and identical sub-networks are used to produce independent feature vectors. Each sub-network includes convolutional and BiLSTM layers, enhanced by an attention mechanism to highlight crucial features [3]. The resultant 1024-dimensional embedding vector is further condensed to 256, ready for integration with CrossPhire’s multimodal architecture.

5.2.2. Screenshot

In CrossPhire, we utilize web page screenshots to capture visual cues that help distinguish phishing web pages from legitimate ones, as attackers often design phishing materials to mimic legitimate websites visually. For image classification and dense embedding generation, we employ ResNet [55] and DenseNet [56], two CNN architectures that have achieved remarkable results. ResNet is more oriented to detecting local features, whereas DenseNet considers both local and global relationships through dense skip connections. We selected ResNet50 and DenseNet121 for their balance of effectiveness and runtime complexity.
Residual Networks (ResNet): ResNet utilizes residual blocks with skip connections to address the vanishing gradient problem, allowing gradients to flow more effectively across layers [55]. The fundamental building block is defined as:
y = F ( x , { W i } ) + x
where x and y denote the input and output, and F ( · ) represents the residual mapping.
Densely Connected Networks (DenseNet): DenseNet enhances feature reuse and gradient flow by connecting each layer to every other layer within a dense block [56]. The output of layer is given by:
x = H ( [ x 0 , x 1 , , x 1 ] )
where [ x 0 , , x 1 ] denotes the concatenation of feature maps from previous layers [56].
In our experiments, we evaluated both architectures to select the model that best enhances CrossPhire’s visual analysis capabilities.

5.2.3. Text

While previous studies [16,57] have utilized HTML source code, they often overlooked the potential of using markup-free text, which leads to the loss of semantic relationships. We address this gap by extracting markup-free text from HTML using BeautifulSoup [49] and Trafilatura [48] to capture the malicious and misleading context that characterizes phishing attempts, as such content maintains contextual connections with documented attack patterns [58].
To represent this text, we employ sentence transformer models that convert the extracted content into numerical embeddings capturing context and semantic meaning. Unlike traditional word embeddings, sentence transformers leverage transformer architectures and self-attention mechanisms to produce embeddings that preserve word order, context, and meaning within the entire paragraph [59]. To the best of our knowledge, this is the first application of sentence transformers on markup-free text for phishing web page detection.
In this study, we leveraged the sentence transformer ‘all-mpnet-base-v2’ based on MPNet [60], which combines masked language modeling (MLM) from BERT and permuted language modeling (PLM) from XLNet to inherit the advantages of both models. The training objective optimizes:
E z Z n t = c + 1 n log P ( x z t | x z < t , M z > c ; θ )
where MPNet leverages masked tokens M z > c as inputs, conditioning on preceding tokens x z < t to model token interdependencies and positional information effectively [60]. The sentence embedding is obtained through mean-pooling:
Sentence Embedding = 1 n i = 1 n h i
where h i represents the token embedding after transformer layers, producing a 768-dimensional vector that captures context and semantic meaning.
Given the fact that many legitimate and phishing web pages are often non-English, we made sure our approach is language-independent, supporting non-English text processing. Here we apply two methods: first, a multilingual ST model trained on 100 languages [61] to generate comparable embeddings across languages, and second, translating non-English text to English before applying a monolingual ST model. This dual approach allows us to compare the representation quality of monolingual and multilingual language models through MPNet [60] for English-only text processing and XLM-RoBERTa [61] for multilingual contents respectively. Unlike the situation in our vision compartment having a trainable image classifier, we use the ST models as fixed encoders without any supervised fine-tuning. Both MPNet and XLM-R produce fixed-size 768-dimensional vectors for the input text covering up to 512 tokens.

5.3. Proposed Approach: CrossPhire

CrossPhire is a multimodal deep learning model designed for phishing detection, leveraging web page data in three forms: the raw URL, screenshot image, and extracted text content. Unlike traditional phishing detection methods, which often involve extensive manual feature engineering, CrossPhire performs automatic feature extraction on these raw data modalities, minimizing preprocessing efforts. Specifically, CrossPhire is unique in extracting markup-free textual content from HTML.
As previously stated, we hypothesize that carving and vectorizing the web page’s information-rich textual content can reveal distinguishing characteristics that effectively differentiate phishing behavior from legitimacy. In this regard, we employed Trafilatura and BeautifulSoup packages.
CrossPhire integrates three specialized sub-neural networks: (i) GramBeddings for URL analysis [3], (ii) fine-tuned image models (ResNet and DenseNet) for visual feature extraction, and (iii) Sentence Transformer models for carved textual content encoding. Together, these modules fuse insights from each modality to capture both textual and visual patterns, enabling a binary classification into “phishing” (1) or “legitimate” (0). A detailed illustration of CrossPhire’s components is shown in Figure 7.
The Grambeddings approach embeds the URL string into 256-d vector through Equations (5)–(7) where C i represents each selected n-gram channel whereas H i indicates the BiLSTM module outcomes.
E i = Embedding ( X i ) R L × d i
C i ( 1 ) = ReLU ( BatchNorm ( Conv 1 D 64 ( E i ) ) )
C i ( 2 ) = SpatialDropout ( ReLU ( BN ( Conv 1 D 128 ( C i ( 1 ) ) ) ) )
h t , h t = BiLSTM ( C i ( 2 ) )
H i = [ h 1 , h 1 , h 2 , h 2 , , h L , h L ] R L × 2 d r n n
It should also be noted that each of the four n-gram encoding channels is equipped with ZhangAttention layers as given in Equations (10)–(12). Here, each attention layers have their own W 1 ( i ) , W 2 ( i ) and V ( i ) parameters that produces context vectors of c i .
e i t ( i ) = tanh ( W 1 ( i ) H i t ( i ) + W 2 ( i ) s i ( i ) )
α i t ( i ) = exp ( ( V ( i ) ) T e i t ( i ) ) j = 1 L exp ( ( V ( i ) ) T e i j ( i ) )
c i = t = 1 L α i t ( i ) H i t ( i )
Finally, the overall URL embedding generation is done through concatenation given in Equation (13) and applying non-linear activation next to dense layers presented in Equation (14).
F u r l = [ c c h a r ; c 4 g r a m ; c 5 g r a m ; c 6 g r a m ] R 4 × 2 d r n n
F u r l f i n a l = ReLU ( Dense 2 d r n n ( F u r l ) ) R 2 d r n n
Following the extraction of deep features from the URL, image, and textual content, each encoder produces logits (embedding vectors) corresponding to its respective data modality. Once the embeddings are generated, we implement several hidden layers composed of dense layers to reduce the dimensions of these vectors. Subsequently, we merge the reduced feature vectors (URL-256, image-512, and text-16) by concatenating them into a single vector with a dimension of 784 given in Equation (15). Finally, we incorporate fully connected (FC) layers and classification layers. Through the use of Adam optimizer, we jointly train the URL and screenshot-related sub-networks whereas the weights of the ST compartment are kept frozen.
The ST models are maintained in a frozen state (gradients not updated during training) following established best practice in sentence transformer applications [59]. This design choice is theoretically motivated by the substantial scale mismatch between pre-training (1B+ sentence pairs) and our fine-tuning datasets (max 107K samples), which is a four-order-of-magnitude difference that typically leads to catastrophic forgetting where task-specific optimization degrades the general semantic capabilities that make pre-trained transformers valuable [60,62]. Moreover, phishing detection requires recognizing generalizable linguistic deception patterns (urgency cues, authority exploitation, credential solicitation) rather than dataset-specific lexical artifacts, making the preservation of broad semantic knowledge preferable to task-specific adaptation. As shown in the results, this architectural decision enables our model to achieve strong cross-dataset generalization while maintaining computational efficiency, as the frozen ST serves as a fixed semantic feature extractor while the URL encoder (GramBeddings) and vision encoder (ResNet/DenseNet) components are jointly trained to learn task-specific multimodal fusion patterns.
F f u s e d = [ F t e x t f r o z e n ; F i m g f i n e t u n e d ; F u r l t r a i n e d ] R d t o t a l
where d t o t a l = 2 d r n n + 512 + 16 = 256 + 512 + 16 = 784 . In addition, the final classification is done through Equations (16)–(18):
F d r o p o u t = Dropout 0.2 ( F f u s e d )
F f i n a l = ReLU ( Dense 128 ( F d r o p o u t ) )
P ( y = 1 | x ) = σ ( Dense 1 ( F f i n a l ) )
Meanwhile, we have used binary cross-entropy loss for optimization of the whole neural net in joint-training procedure.
L = 1 N i = 1 N [ y i log ( p i ) + ( 1 y i ) log ( 1 p i ) ]

5.4. Evaluation Strategy

As previously stated, our scheme was evaluated with four benchmark datasets in addition to the Phish360 and Phish360-Zeroday, which were collected by ourselves. We first evaluate our neural net on each dataset individually (i.e., assessment in same-dataset) to understand its effectiveness in each dataset.
To further assess CrossPhire’s generalization capabilities on previously unseen data, a series of cross-dataset evaluations were conducted as well. In this context, CrossPhire is trained on the training portion of one dataset and subsequently evaluated on the testing portion of another. This technique enables an evaluation of CrossPhire’s resilience to a range of phishing techniques.
As CrossPhire is constituted of discrete components, single-modality experiments are also conducted to evaluate the performance of each component in isolation. The objective of these experiments is to identify the optimal model for each modality, ensuring their optimal integration into CrossPhire. Furthermore, cross-dataset experiments are conducted for the single-modality models to assess their generalizability across different datasets. Once the optimal unimodal models have been identified, they are integrated into CrossPhire, and its performance results across all datasets are presented, including both same-dataset and cross-dataset evaluations.
Ultimately, after identifying the optimal configuration for CrossPhire, we evaluate its performance in comparison to established baseline methods that utilize the same benchmark datasets. Given that our solution is multimodal, we also assess its efficacy in comparison to a prominent state-of-the-art text-image model, CLIP [63].

6. Experimental Setup and Results

In this section, we begin with the introduction of our experimental setup. Next, we provide our evaluations with an extensive set of assessments devoted to the sole use of each modality. Next, we designate the assessment of CrossPhire by utilizing all modalities in a joint training scheme. Afterwards, we compare our proposed scheme with other models in the literature. Moreover, we assess our model in zero day attacks along with comparing it with another multimodal approach, so-called CLIP.

6.1. Experimental Setup

Our methodology was implemented using the Python 3 programming language on the Keras platform. To guarantee computational reproducibility and facilitate the reuse of CrossPhire, we made our codebase and datasets publicly accessible to the research community. During the training phase, CrossPhire was configured with a batch size of 32 and the Adam optimizer was employed with an initial learning rate of 0.001. A cosine annealing learning rate schedule was employed to dynamically adjust the learning rate throughout the training process, while the binary cross-entropy loss was utilized to monitor the model’s loss.
The number of training epochs varied depending on the specific experimental conditions. In instances where the training and testing data were derived from a single source (same-dataset evaluations), CrossPhire was trained for 20 epochs. In the case of cross-dataset evaluations, where the training and testing data were derived from disparate sources, the CrossPhire model was trained for 30 epochs. Another crucial element is the ratio of training to testing data. While there is no definitive rule for splitting datasets, the majority of anti-phishing research employs a train-test ratio of 80%–20% or 70%–30%, as mentioned by Adane and Beyene [64]. During our evaluation experiments, we adopted an 80%–20% training-to-testing split, setting the random_state to 42 to ensure consistent data splits. Throughout the experiments, we employed standard metrics, including accuracy, precision, recall, and F1 score.

6.2. Assessment and Selection of Modality Encoders

In this phase, a series of experiments is conducted to select the best encoder for each compartment of CrossPhire through a comprehensive assessment. To validate the single-modal encoders, we employ benchmarking datasets and experiment with each model to assess its generalization capability. Based on the findings of our literature review, we identified various models that align well with our goal. It should be noted that we consider (i) effectiveness and (ii) efficiency at the same time, yielding a mobile-compatible model. Thus, we decided to evaluate (1) GramBeddings [3], (2) URLNet [53] and (3) URLTran [54] as candidate architectures for URL-based classification component. For the context of visual classification, we chose pre-trained ResNet50 and DenseNet121 architectures. Similarly, we preferred pre-trained Sentence Transformers MPNet [60] (English-only) and XLM-RoBERTa [61] (multilingual) for content embedding generation. At this juncture, the relative efficacy of embeddings derived from a monolingual MPNet on the English translations of a GPT-4 level proprietary translator (i.e., Google Translate) and those vectorized with a multilingual encoder (i.e., XLM-RoBERTa) across multiple languages can be discussed.

6.2.1. URL Only-Based Assessment

As previously stated, phishing attackers continuously refine their tactics and develop new methods to bypass detection systems. Consequently, phishing URL samples from different years reflect a range of evolving tactics. To address this, we evaluated three different candidate URL models not only through the same-dataset mode but also by assessing their generalization performance in cross-dataset regime.
Table 7 presents a detailed comparison of the selected models, showcasing both same- and cross-dataset results. The same 80%–20% train-test split ratio was applied to ensure consistency across all experiments, with the seed value set to 42. The optimal performance for each experiment is highlighted in bold. The results demonstrate that GramBeddings exhibits superior performance compared to other URL models, achieving the highest test accuracy in eight out of 16 experiments. In light of these findings, the GramBeddings model has been integrated into CrossPhire for URL processing.

6.2.2. Screenshot Only-Based Assessment

As with the URL-based assessment, a series of experiments was conducted to select either ResNet50 or DenseNet121 models. In these experiments, the models were fine-tuned with the use of screenshot images from the benchmark datasets for phishing classification. The objective of this study was to enhance the phishing detection capabilities of the models by leveraging the pre-trained weights from models trained on approximately 14.2 million images from ImageNet. To ensure a fair comparison, the train-test splits and seed values were maintained consistently across all experiments. Subsequently, the models underwent a fine-tuning process using the training sets of the benchmark datasets. Following this, an evaluation of the models was conducted on the respective test sets. However, cross-dataset experimentation was not possible due to the substantial GPU hours required and the limitation of a single GPU.
Table 8 presents the results for both models across all datasets. In addition to the accuracy of the tests, the F1 scores are included, which integrate both precision and recall to provide a more comprehensive performance metric. While the performance of both models is nearly identical on certain datasets, DenseNet121 exhibits a slight advantage over ResNet50. Consequently, both models were included in CrossPhire’s evaluation.

6.2.3. Text Only Content Assessment

This section presents the results of unimodal experiments using the extracted textual content. Two distinct categories of extracted texts were subjected to experimentation: one derived from Trafilatura (TF) and another from BeautifulSoup (BS). In contrast with the preceding unimodal experiments that involved the fine-tuning of neural networks, we utilize conventional machine learning (ML) models for text classification. This is because we employ sentence transformers (pre-trained language models) by maintaining their pre-trained weights in a frozen state and obtaining the sentence or paragraph embedding of the given textual input. Therefore, we have selected Support Vector Machine (SVM), XGBoost, and CatBoost classifiers, as they are widely used and effective in text classification tasks. The models were run using their default hyperparameter values.
Given the existence of two distinct categories of extracted texts, four principal experiments were conducted: two for each text type (BS and TF) and two for the purpose of comparing the performance of monolingual (MPNet) and multilingual (XLM-R) sentence transformers. First, the original multilingual BS and TF texts were evaluated. As illustrated in Figure 8, the BS text exhibited greater discriminative features, resulting in higher classification accuracy scores across all the ML models tested.
In light of the enhanced performance demonstrated by the use of BS text, we conducted a comparative analysis of the efficacy of monolingual and multilingual sentence transformers, employing both the original and English-translated versions of the BS text. Table 9 compares the performance of multilingual and monolingual transformers using both the original and translated text. As evidenced in Table 9, the results demonstrate that when the translated BS text is used, MPNet exhibits enhanced accuracy compared to the original multilingual text processed by the XLM-R model.
These findings underscore the significance of selecting an appropriate input text parser, language, and sentence transformer and their influence on overall performance. However, from our perspective, the most significant outcome of these findings is the considerable potential of token-limited (e.g., 512) texts in phishing classification, which we had previously hypothesized to be effective. The results clearly indicate that markup-free phishing and legitimate web page contents exhibit different semantics that can be mapped to separable manifolds in hyperdimensional space via sentence transformers. The second most significant finding is that BeautifulSoup extracts text fragments with greater discriminatory power than Trafilatura, which challenges our initial assumptions (see Figure 8). Thirdly, the indispensable contribution of the monolingual MPNet prompts us to question the effectiveness of multilingual sentence transformers. The results presented in Table 9 indicate that a model trained in a more resource-rich language yields better discriminatory performance.

6.3. Assessment of Multimodal CrossPhire

Following the initial experiments to justify the design considerations for each compartment, we continue to evaluate the multimodal performance of our approach using all datasets. We present the results obtained from the same-dataset experiments, where the model is trained and validated on the same dataset. Next, we examine the cross-dataset experiments, where CrossPhire is trained on one dataset and validated on others, to assess its generalization capability and robustness. Moreover, we also compared two techniques of modality integration: (1) vector concatenation and (2) mixture of experts.

6.3.1. Evaluation in the Same-Dataset Regime

In this phase, we introduce CrossPhire’s performance assessment in the same dataset regime. Throughout the experiments, CrossPhire integrates GramBeddings for URL analysis, ResNet and DenseNet for screenshot interpretation, and the monolingual MPNet sentence transformer for encoding translated BeautifulSoup content. As demonstrated in Table 10, CrossPhire exhibited remarkable performance, with accuracy scores ranging from 97.71% on the Phish360 dataset to a perfect 100% on the PWD2016 dataset. The findings unequivocally underscore the preeminence of a multimodal approach, eclipsing the limitations imposed by reliance on a single modality within a singular dataset regime. Excluding the Phish360 dataset, it can be concluded that the incorporation of ResNet or DenseNet does not result in a substantial impact on the overall performance metrics. It is also noteworthy that using the XLM-R model with multilingual content yielded slightly lower accuracy scores (mean: 0.38, s.d.: 0.21). Due to space constraints, we do not report the results of multimodal modeling through XLM-R.

6.3.2. Evaluation in the Cross-Dataset Regime

In this stage, the robustness of CrossPhire is assessed in a more challenging environment through cross-dataset evaluations. In this approach, the model is trained on a portion of one dataset and tested on a separate portion of a different dataset. The objective of this approach is to subject the neural architecture to an evaluation in which it is exposed to data that it has not previously encountered, with a view to assessing its capacity for generalization. This evaluation scheme is a further contribution to the study, as none of the related works covered conducted experiments of a similar nature. It is hypothesized that this evaluation scheme signifies a substantial contribution to the anti-phishing domain. During these evaluations, GramBeddings were utilized for URL embeddings, ResNet and DenseNet for screenshot-based feature extraction, and the MPNet sentence transformer to generate embeddings from translated BeautifulSoup text.
As demonstrated in Table 11, CrossPhire frequently exhibits robustness across a range of experiments, thereby underscoring its efficacy in detecting phishing web pages from a multitude of sources. It is noteworthy that training on Phish360, the smallest multimodal dataset comprising approximately 8600 training samples, yields impressive results. The CrossPhire model, when trained on the Phish360 dataset, demonstrates an accuracy of 94.66% on the PhishIntention dataset. This performance surpasses that of CrossPhire model instances trained on larger datasets, such as PILWD-134K and VanNL126K. Furthermore, CrossPhire, trained on Phish360, achieved over 90% accuracy on all datasets except PILWD-134K, containing approximately 26,800 test samples. This ratio is nearly three times the size of Phish360’s training set.
Conversely, models trained on PWD2016 demonstrated a conspicuous inability to generalize, exhibiting accuracy scores as low as 40% when evaluated on other datasets. A hypothesis can be formulated regarding the underlying causes of this outcome, which can be attributed to three primary factors: The following factors must be considered:
  • The incompatibility of screenshots;
  • The ease of URLs exhibited by the PWD2016 dataset;
  • The distribution differences related to the collection period.
The model that was trained on PhishIntention achieved an accuracy of 99.69% when evaluated on PWD2016. In addition, the model that was trained on Phish360 demonstrated superior performance in comparison to the model that was trained on PhishIntention and evaluated on the PILWD-134K dataset. This model achieved an accuracy of 85.74%, which is the highest among models trained on disparate datasets. The largest dataset, PILWD-134K, demonstrated robust cross-dataset performance, with accuracy scores of 93.17% on VanNL126K, 96.27% on PhishIntention, 93.7% on PWD2016, and 93.38% on Phish360 for models trained on it.

6.3.3. Fusion of Modalities via Mixture of Experts

Apart from the joint learning-based final feature vector concatenation, we tested the Mixture of Experts (MoE) model, which represents a breakthrough architecture that enables massive scaling of neural networks while maintaining computational efficiency through sparse activation patterns. First introduced by [65], MoE architectures consist of multiple specialized sub-networks called “experts” coordinated by a learned gating mechanism that routes inputs to the most relevant experts for each example. The gating mechanism serves as the architectural cornerstone, using softmax functions to compute probabilistic expert assignments based on input characteristics while expert specialization emerges naturally through competitive learning, where different experts develop domain-specific knowledge for distinct input regions or task types. Moreover, the expert choice routing [66] revolutionized the field by inverting the traditional paradigm—experts select top-k tokens rather than tokens selecting experts, achieving perfect load balancing and 2x improvement in training convergence. Recent open-source models like Mixtral 8x7B [67] democratized MoE technology, delivering 6x faster inference than comparable dense models while matching GPT-3.5 performance with only 12.9 billion active parameters from 46.7 billion total. MoE models are also capable of handling missing modalities. In our problem domain, it may not be possible to capture a screenshot or extract meaningful textual content. The MoE approach, thus, provides an inherent solution to the missing modality problem.
For this reason, in this study, we tested two fusion models: (1) vector concatenation and (2) MoE. In our MoE implementation, the model uses projection layers to integrate three modalities (URL, images, and content) into a common expert dimension space. It also incorporates learnable default embeddings to handle missing modalities, which directly addresses robustness issues. Before expert routing occurs, cross-modal attention layers allow each modality to consider the others, creating enhanced feature representations that capture intermodal dependencies. This architecture replaces conventional MLP experts with TransformerExpert networks, which contain multi-head self-attention and feed-forward blocks. This allows for more sophisticated feature processing within each expert. Instead of using simple linear gating, our model uses an attention-based routing mechanism that computes expert selection through multi-head attention between a global representation and learnable expert embeddings. We created a hierarchical attention design with multiple attention layers throughout the pipeline: cross-modal attention for feature enhancement, gating attention for expert routing, and final output attention for result refinement. We also included an optional load balancing loss, and comprehensive dropout and layer normalization ensure stable training while encouraging uniform expert utilization. During MoE model training, we found that using four MoE heads produced the best results. In summary, our MoE scheme is theoretically sound. It combines the advantages of sparse expert computation, attention-driven multimodal fusion, and resilience to missing modalities.
We measured the performance of the MoE in both the same- and cross-dataset configurations. However, as the results given in Table 12 and Table 13 show, the MoE approach underperforms compared to the simple joint learning of concatenated embeddings. This outcome may be related to the fact that the concatenation operation provides full access to information, whereas the MoE creates information bottlenecks through its gated routing mechanisms. Further, the mathematical foundation explains this phenomenon: concatenation operates in the complete joint feature space f ( x ) = W [ ϕ a ( x a ) ; ϕ v ( x v ) ] + b where all modalities contribute simultaneously to every prediction, while MoE’s gated approach f ( x ) = i = 1 n g i ( x ) · E i ( x i ) only allows selected experts to contribute, potentially missing crucial cross-modal dependencies essential for the model. The “unimodal bias” problem identified by [68] shows that complex fusion architectures can rely too heavily on dominant modalities while ignoring the rest. However, this supports the advantage of concatenation in our problem domain, where all modality information must be preserved. Similarly, biomedical multimodal research [69] confirms that early fusion strategies (concatenation) learn joint representations directly, without relying on marginal representations. This enables the model to capture complex interactions among modalities more effectively than late fusion approaches. Regarding our anti-phishing detection scheme that combines URL, image, and text modalities, concatenation preserves all three types of information simultaneously. However, MoE architectures risk losing critical cross-modal security indicators through selective expert activation, which explains our empirical findings.

6.3.4. Benefits of Multimodality

This sub-section addresses a critical question regarding multimodality: Does integrating multiple information sources improve phishing detection performance? The answer depends on several aspects. This part aims to ascertain whether multimodality enhances the efficacy of phishing web page recognition. To investigate this, a comparison is made between the performance of CrossPhire and that of its unimodal components as shown in Figure 9. To ensure a fair comparison, CrossPhire is compared against its unimodal compartments: (a) GramBeddings for URLs, (b) ResNet50 for screenshots, and (c) MPNet for translated BS text contents.
At first glance, the unimodal models, the URL-only models in particular, seem to demonstrate satisfactory performance for most datasets. However, it is imperative to acknowledge that, except for Phish360, the other datasets encompass a limited collection period. The exceptional performance of URL-only detection on these datasets (94.1–99.5%) suggests these collections perhaps contain traditional or period-specific phishing attacks that exhibit distinguishable URL patterns. On the other hand, the dramatic performance drop of URL-based detection on Phish360 (82.7% vs. 99.0% on PhishIntention) illuminates a critical shift in phishing attack sophistication. The longer period for sample collection likely captures modern and newer phishing campaigns over time that may include URLs containing (but not limited to):
  • Legitimate hosting services (e.g., GitHub pages, cloud platforms) with clean URL structures;
  • URL shortening services that obfuscate the actual destination;
  • Domain-spoofing techniques that closely mimic legitimate URLs;
  • Compromised legitimate websites hosting phishing content.
This phenomenon—dataset temporal bias—can also be observed in Table 7 when cross-dataset performance scores are investigated. Nonetheless, when Figure 9 is reinvestigated, it can be posited that enhanced accuracy in results can be achieved when other modalities, such as content, assume a more prominent role. From the perspective of machine learning, this could be indicative of a more gradual shift in the language utilized by attackers. This is a truly expected outcome since phishing web pages often mimic their legitimate counterparts, making the content less controllable by attackers, compared to the URL. Nevertheless, it is an indispensable fact that neither the content nor the screenshot is immune to the perpetual evolution of phishing attacks.
From this point of view, though it brings more computation burden, we can conclude that the multimodal anti-phishing mechanisms, when correctly implemented, are advantageous for the following reasons:
  • Reducing single-point-of-failure risk: When URL-based features become less discriminative (as in Phish360), visual and textual modalities maintain detection capability;
  • Capturing complementary attack vectors: Modern phishing often combines legitimate URLs with deceptive visual design and persuasive content;
  • Providing detection resilience: As attackers adapt to evade one modality, the system maintains effectiveness through alternative information channels.
Our experiments demonstrate that incorporating multiple data modalities enhances the accuracy of the detection process. This enhancement is most pronounced in Phish360, where the challenge of dataset temporal bias is most significant among the other systems. The extant research provides substantial support for this finding, indicating that the combination of distinct data modalities (hybridisation) not only enhances overall detection accuracy but also yields more resilient anti-phishing solutions [13,27,30,70].

6.4. Comparative Study

In this subsection, we compare CrossPhire’s performance with that of studies presenting the aforementioned datasets. Although we initially intended to conduct a method-based comparison with these studies, we deemed it unfeasible due to the absence of a codebase for the respective works. Therefore, we decided to proceed with a dataset-based benchmarking strategy. Thus, CrossPhire was trained with the original form of the benchmarking datasets from the related studies, and the findings were reported accordingly.
As can be seen from Table 14, CrossPhire outperforms the presented approaches in terms of accuracy, except for the method proposed by Van Dooremaal et al. [37]. While their approach achieved a slightly higher accuracy of 99.66% compared to our 99.42%, it should be noted that they took 2000 samples from the dataset without disclosing their data selection criteria. As a result, their approach was validated on only 600 samples, whereas our method was trained and validated on the entire dataset, which consists of approximately 126,000 samples, as shown in Table 14.
To enrich the comparative study, we also investigated the feasibility and effectiveness of the Contrastive Language-Image Pretraining (CLIP) model by fine-tuning it for phishing detection on all our benchmarking datasets. Originally developed by Radford et al. [63], CLIP is a multimodal model that associates images with textual descriptions through joint training of a text and image encoder in a shared embedding space. Although not specifically trained on ImageNet, CLIP has demonstrated impressive zero-shot performance, contributing to its popularity among researchers.
To enhance the detection of phishing, we have refined CLIP by incorporating screenshots and translated BS texts from our datasets, augmented it with a two-layer MLP comprising 512 neurons each, and optimized it using an Adam optimizer with a 0.0001 learning rate over 15 epochs. As illustrated in Table 15, CrossPhire demonstrated better performance compared to CLIP across all employed datasets, with the most pronounced accuracy difference observed on the Phish360 dataset. The accuracy achieved by CLIP was 96.22%, while CrossPhire reached 99.21%. These results demonstrate the robustness of CrossPhire, which consistently outperformed CLIP across datasets, thereby underscoring its effectiveness in phishing web page detection.
To assess whether performance differences represent statistically significant improvements rather than random variation, we conducted two-proportion Z-tests to assess whether performance differences represent statistically significant improvements. Importantly, our evaluation used the exact same test samples as baseline methods, enabling direct comparison without confounding factors from different data splits. Table 16 presents results with 95% confidence intervals.
Results reveal that among published multimodal baselines (Table 14), only PWD2016 shows statistically significant improvement (p < 0.001, 2.40% gain). The 0.14% improvement on PILWD-134K is not statistically significant (p = 0.301), while PhishIntention and VanNL126k show no significant differences from baselines (p > 0.05), with CrossPhire performing slightly worse. In contrast, comparisons with CLIP demonstrate statistically significant improvements across all five datasets (p < 0.05), with particularly strong results on PILWD-134K (p < 0.001, 1.64% gain) and VanNL126k (p < 0.001, 1.44% gain).
These results indicate that, while our domain-specific architectural design provides meaningful advantages over general-purpose multimodal models such as CLIP, it does not consistently outperform specialised prior phishing detection methods when evaluated on the same dataset. Therefore, our scheme achieves competitive or superior results compared to other benchmarked approaches.

6.5. Robustness Against Zero-Day Attacks

The identification of phishing sites remains a critical cybersecurity challenge, particularly in the context of zero-day attacks that exploit previously unknown vulnerabilities. Conventional identification schemes that rely on blacklists or static features have proven ineffective against sophisticated phishing campaigns. These campaigns employ various techniques to evade detection, including (a) dynamic HTML content loading, (b) URL generation algorithms, and (c) compromised domains. For instance, attackers frequently employ techniques such as slight modifications to legitimate websites, a practice known as typosquatting, whereby similar-looking characters are used (e.g., replacing “o” with “0”). In a similar vein, the implementation of redirect chains through compromised legitimate web pages constitutes an additional prevalent technique. Another indispensable fact is the ever-changing trends of attackers over time. These evolving tactics underscore the necessity for robust and adaptive detection systems that can demonstrate effectiveness over time.
To assess the robustness of our approach against zero-day attacks, we initiated a comprehensive data collection effort in February 2025. In this context, ‘zero-day’ refers to the temporal gap between the training data (2020–2024 for Phish360) and the testing data (February 2025). This simulates real-world scenarios in which models encounter phishing campaigns with novel tactics that were not seen during training. The dataset was created by randomly sampling 3012 web pages in February 2025. Phishing samples were sourced from PhishTank’s verified submissions, while legitimate samples were randomly selected from active websites (including authentication pages to increase the difficulty level). To ensure uniqueness and novelty, we implemented a rigorous two-stage filtering process. First, we performed intra-dataset deduplication using the Duplicate Image Finder tool to identify exact screenshot matches, as well as conducting a domain-level analysis to eliminate URL duplicates within the 3012 samples. Second, we conducted cross-dataset novelty verification by comparing the remaining samples against the e n t i r e Phish360 dataset. This involved URL similarity matching, screenshot comparison and domain overlap analysis, with the aim of excluding any samples that were present in, or structurally similar to, the training data. This process reduced the collection to 1080 unique samples (540 legitimate and 540 phishing), which were then processed using our PhishBoring application to capture all three modalities (URL, HTML and a screenshot at a resolution of 1280 × 960).
According to the results, CrossPhire trained with PILWD-134K performs on par with CLIP with an accuracy of 88%. Similarly, our model trained with PhishIntention achieves an accuracy of 85.44%, while the CLIP outperforms it in almost all epochs. Phish360, the most recent dataset among the others, enables our model to be more robust against the zero-day attack dataset, with a maximum accuracy of 94.51%. In contrast to these results, our model trained with VanNL126K surprisingly lags behind the CLIP and Resnet50 models, achieving a maximum accuracy score of 78.1%.
Inspection of the results led to a number of observations. First, as expected, we observe that the similarity between distributions of training and test samples plays a crucial role in prediction performance. The collection period of the Phish360 dataset ended in 2024, whereas PILWD-134K and PhishIntention cover the period from 2019 to 2020. Furthermore, the samples belonging to VanNL126K cover a very short period, from September to December 2019. Because of this, we can observe the dramatic effect of the historical difference between the data samples—dataset temporal bias. Secondly, CrossPhire utilizes three sources of information, while CLIP benefits from two, and Resnet50 only uses screenshots. It can be concluded that the predictive capability obtained by concatenating embeddings from three different modalities requires the continual introduction of new samples to ensure the models remain robust. The well-known domain-shifting problem manifests in this context, as attackers identify novel evasion techniques and exhibit emerging trends. Consequently, the enhanced predictive capability stemming from modality richness entails the introduction of novel trends into the model. This results in a trade-off between modality richness and update interval.

6.6. Handling Missing Modality Problem

The inherent vulnerability of concatenation-based multimodal architectures to missing or corrupted input modalities requires regularization techniques that promote robust feature learning across individual modality pathways. Recent literature has demonstrated that modality dropout serves as a regularization mechanism and a technique that enhances robustness. It prevents models from overreliance on specific modality combinations that may not generalize to real-world deployment scenarios [71,72]. Additionally, missing modality training has been shown to create ensemble-like internal effects, whereby models learn multiple expert pathways that can make informed decisions with different modality combinations [73]. Given these theoretical foundations and the practical necessity of handling incomplete multimodal data in phishing detection scenarios, we implemented a structured modality dropout approach operating during both the training and validation phases to enhance model robustness and generalization capability.
As detailed in Algorithm 1, our modality dropout algorithm operates on three modalities: content, URL, and visual. To ensure independence between training and validation patterns, we used different random seeds (base seed for training, base seed + 1000 for validation) while maintaining reproducibility across experimental runs. Using a seeded random number generator (RNG), the algorithm pre-generates reproducible missing modality patterns, selecting a specified percentage that we call the Missing Modality Sampling Ratio—MMSR (ranging from 0% to 50%) of training and/or testing samples. For instance, if MMSR is set to 50% for the test data, half of the test data gets affected. For each affected sample, the algorithm randomly selects one of the following six combinations: (1) keep screenshot + URL (drop content), (2) keep content + URL (drop screenshot), (3) keep content + screenshot (drop URL), (4) keep URL only (drop content + screenshot), (5) keep screenshot only (drop content + URL), and (6) keep content only (drop screenshot + URL). The masking is implemented by zeroing out the corresponding feature vectors at the input level, effectively simulating real-world scenarios where certain data sources may be corrupted, unavailable, or unreliable. Available modalities retain their original representations. This approach ensures consistent missing patterns across training epochs while maintaining the architectural integrity of the multimodal fusion layer.
Algorithm 1 Modality Dropout Training and Validation
 1:
Input: Dataset D = { ( x i s s h o t , x i c o n t e n t , x i U R L , y i ) } i = 1 N
 2:
Input: Missing rate ρ [ 0 , 0.5 ] , seed s
 3:
Input: Dataset type t y p e { train , validation }
 4:
s e e d e f f s + 1000 · 1 t y p e = validation
 5:
Initialize RNG with s e e d e f f
 6:
n m i s s i n g ρ · N
 7:
I m i s s i n g  RandomChoice( { 1 , 2 , , N } , n m i s s i n g , replace = False)
 8:
Define modality combinations:
 9:
C = { sshot_content, URL_content, URL_sshot, content_only, sshot_only, URL_only}
10:
for  i = 1 to N do
11:
    if  i I m i s s i n g  then
12:
       c  RandomChoice(C)
13:
      Apply masking based on combination c:
14:
      if c drops URL then
15:
          x i U R L 0
16:
      end if
17:
      if c drops screenshot then
18:
          x i s s h o t 0
19:
      end if
20:
      if c drops content then
21:
          x i c o n t e n t 0
22:
      end if
23:
    end if
24:
end for
25:
Return: Modified dataset with modality dropout applied
Table 17 introduces experiments with three different target datasets. Note that the first row of each group in Table 17 shows the performance drop when the test data has missing values. As can be seen from Table 17, our empirical validation reveals significant benefits by applying the modality dropout in two different configurations. First, the modality dropout applied to training data demonstrates remarkable recovery capabilities under missing modality conditions. These models maintain substantially higher performance levels than their previously recorded baseline counterparts (first rows) across three tested MMSR values (10%: light, 25%: moderate, and 50%: severe). The higher the MMSR usage, the better the generalization we obtained. Second, models trained with the MMSR of 10% and 25% consistently outperformed their original baselines even when all modalities were available. These models achieve an accuracy improvement of 0.36% (e.g., from 97.71% to 98.32%). This phenomenon aligns with the theoretical expectation that modality dropout serves as an effective regularization mechanism, preventing overfitting to spurious intermodal correlations while encouraging the development of more generalizable feature representations within each modality pathway. This performance improvement suggests that forcing individual modalities to become more discriminative through periodic isolation results in stronger collaborative decision-making compared to the situation when all modalities are present. In effect, this creates an ensemble-like internal architecture where multiple expert pathways contribute to the final prediction.

6.7. Explainability

In response to the urgent demand for transparent decision-making in cybersecurity applications, we have developed a thorough Local Interpretable Model-Agnostic Explanations—LIME framework [74] tailored for our multimodal anti-phishing neural network that fuses different information sources via concatenation. Our approach employs a hierarchical explanation method operating at two levels. First, we quantify the contribution of each input modality (URL, visual screenshot, and HTML content) by replacing modalities with neutral counterparts and measuring the resulting prediction variance. Second, we generate fine-grained explanations within modalities using tailored perturbation strategies. For URL analysis, we apply character-level perturbations through random alphanumeric substitution at 1–5 positions per URL to identify suspicious character patterns and domain components. For visual analysis, we use grid-based segmentation (16 × 16 spatial partitioning with 256 segments) to highlight image regions that contribute to phishing classification, such as deceptive login forms or fraudulent logos. Moreover, for textual content, we use token-level masking to identify phrases that are semantically important and trigger phishing detection. Each explanation method generates modality-specific perturbation sets: 120 perturbations for URL and content analysis, and 40 perturbations for image analysis to balance computational efficiency with explanation quality. Then, it fits a local linear surrogate model to approximate the complex neural network’s decision boundary and extracts feature importance coefficients.
Technically speaking, our LIME implementation employs a hierarchical perturbation-based explanation framework that operates through local linear approximation around individual instances. Given an input instance x = ( x url , x img , x content ) and a complex multimodal classifier f ( x ) [ 0 , 1 ] , we first quantify modality-level contributions by systematic ablation: I m = | f ( x ) f ( x ( m neutral ) ) | , where x ( m neutral ) represents the instance with modality m replaced by domain-neutral counterparts (neutral URL: “https://www.example.com/index.html”, neutral content: generic business text, neutral image: synthetic white background with generic web elements). For intra-modality explanations, we generate N perturbations z i Z around x using modality-specific strategies: (1) URL perturbations via character-level random substitution where random . randint ( 1 , min ( 5 , | url | ) ) positions are modified with alphanumeric replacements; (2) Image perturbations through grid-based segmentation using 16 × 16 spatial partitioning (grid_size=14 pixels, yielding ∼256 segments) with binary occlusion masks m i { 0 , 1 } 256 ; (3) Content perturbations via LIME’s default token-level masking strategy. For each modality, we fit a local linear model g ( z ) = w T z + b using scikit-learn’s LinearRegression that minimizes the locality-weighted loss L ( z , f , g ) = i = 1 N π ( x , z i ) ( f ( z i ) g ( z i ) ) 2 , where π ( x , z i ) represents LIME’s exponential locality kernel. The image prediction function returns class probability vectors p i = [ p legitimate , p phishing ] R 2 to satisfy LIME’s binary classification requirements, while URL and content explanations utilize N = 120 perturbations each with feature selection limited to k = 8 most influential components, and image explanations use N = 40 perturbations for computational efficiency. The final explanation coefficient vector w provides feature importance rankings, where positive weights indicate phishing-supportive features and negative weights indicate legitimacy-supportive features, enabling practitioners to identify specific visual regions (via grid segment masks), URL character positions (via perturbation vectors), and content phrases (via token importance scores) that drive classification decisions.
Based on the techniques mentioned above, we provided a GUI-based Python application shown in Figure 10. This approach enables cybersecurity practitioners to understand the specific evidence within each input modality that drove the final classification, facilitating trust, validation, and actionable threat intelligence in real-world phishing detection scenarios.

6.8. Runtime Analysis

The comprehensive experiments have shown the effectiveness of this approach. Moreover, excluding the duration for data preparation (i.e., content extraction of BS and screenshot taking) CrossPhire runs in real-time with 0.08 s per image inference on a computer equipped with Nvidia 3080 TI mobile GPU having 16 GB of memory, 12th generation Intel i9 CPU, and 32 GB of system memory. The cost of data preparation for the mentioned processes takes around 1.5 s in a Google Chrome engine browser.
To assess the engineering feasibility of CrossPhire for real-world deployment, we conducted comprehensive runtime profiling on an NVIDIA RTX 3080 Ti mobile GPU (16 GB VRAM) with Intel i9-12900H CPU and 32 GB system RAM. Table 18 presents the component-wise latency analysis for single-sample inference. The results reveal that data preparation (1.5 s, 95% of total time) dominates the computational pipeline, while neural network inference requires only 80 ms (5%). Specifically, screenshot capture via Selenium WebDriver accounts for 75.9% of total latency, followed by HTML retrieval and BeautifulSoup parsing (19.0%). Among the neural components, ResNet50 vision encoding (45 ms) represents the most computationally intensive operation, whereas GramBeddings URL analysis (12 ms) and MPNet content encoding (18 ms) impose negligible overhead. This bottleneck is inherent to any multimodal phishing detection system requiring web page rendering and cannot be eliminated through model optimization alone. However, the neural inference component can be substantially accelerated through batch processing and quantization techniques.
For server-side deployment scenarios where URLs are processed asynchronously, batch processing significantly improves throughput. Table 19 demonstrates the scalability characteristics of CrossPhire’s neural inference pipeline across varying batch sizes. With a batch size of 32, the system achieves 55.2 samples per second on a single GPU, translating to a theoretical capacity of 4.76 million classifications per day. GPU memory consumption scales linearly from 2.1 GB (batch = 1) to 12.4 GB (batch = 32), remaining well within the constraints of modern consumer-grade GPUs. The model’s memory footprint consists 144 MB in FP32 format.
Although our current implementation achieves real-time performance at 0.08 s per inference, this is a non-optimized baseline that can be significantly improved with modern acceleration techniques. Using TensorRT optimization with INT8 post-training quantization reduces inference time to between 0.02 and 0.05 s while maintaining 98–99% of the original model’s accuracy. The quantization approach would also reduce our 36M parameter model from 144 MB to approximately 36 MB, making mobile deployment feasible on modern smartphones.
A critical dependency in our current implementation is the Google Translate API for processing non-English content with the monolingual MPNet encoder. Translation requests introduce an additional 150–300 ms network latency and incur costs of approximately $20 per million characters ($0.50 per 1000 translated web pages). To mitigate this dependency, we offer two alternatives: (1) the multilingual XLM-RoBERTa encoder eliminates translation requirements at the cost of 2.7% average accuracy reduction and (2) language detection with selective translation processes only non-English content, reducing translation volume by approximately 62% based on our dataset language distribution.

7. Discussion

While we have taken extensive measures to ensure sample quality and diversity (Section 4.3), several methodological choices introduce potential biases that users of this dataset should consider. Our phishing samples were collected from PhishTank and OpenPhish, which are user-reported platforms that may overrepresent easily-detected phishing campaigns. For legitimate samples, we used Alexa’s top 100 websites as seed URLs for random crawling, accessing over 50,000 web pages by following links from these initial seeds. While this seeding approach reduces direct bias toward only popular websites, it may still underrepresent certain categories of legitimate sites, particularly small business websites, regional services, and non-English content that phishing attacks often target. Additionally, our screenshot rendering configuration at 1280 × 960 resolution represents desktop-oriented phishing and may not fully capture mobile phishing attacks, which constitute an increasing proportion of real-world threats. The weekly collection schedule over 2020–2024 provides substantial temporal diversity but may introduce seasonal biases or miss short-lived phishing campaigns. These biases reflect inherent trade-offs in dataset construction between quality, diversity, and resource constraints. We argue that transparent documentation of these limitations, combined with our extensive comparative analysis demonstrating Phish360’s superior uniqueness and diversity metrics (Section 4.3), enables informed usage while still representing a substantial improvement over existing benchmarks.
Our empirical evaluation reveals that simple embedding concatenation consistently outperforms mixture of experts (MoE) approaches across all experimental configurations (See Table 12 and Table 13). This finding can be attributed to the fundamental difference in information processing: concatenation preserves complete multimodal information in the joint feature space while MoE’s gated routing creates information bottlenecks through selective expert activation. In anti-phishing detection, where subtle cross-modal dependencies between URL, visual, and textual features are critical, the complete information preservation offered by concatenation proves more effective than the specialized but potentially incomplete representations generated by expert routing mechanisms.
One might argue that there is a lack of cross-dataset performance evaluations in pure vision models. While we acknowledge this limitation, we would argue that the absence of such evaluations does not undermine our core conclusions for several reasons. Firstly, both ResNet50 and DenseNet121 are pre-trained on ImageNet and have well-documented, well-established cross-domain generalisation properties [55,56]. Our fine-tuning approach inherits these robust transfer learning capabilities, particularly with regard to extracting universal visual features (such as logos, form layouts, and UI components), which remain consistent across phishing datasets collected from different periods and sources. Secondly, as shown in Table 8, both architectures achieve remarkably similar performance across all five temporally and geographically diverse datasets.
The standard training scheme we applied has some shortcomings, such as an inability to work with missing modalities. Due to the intentions of attackers, some phishing web pages may not be parsed to obtain useful semantic content. This often occurs when attackers replace text regions with images. We, therefore, applied the Modality Dropout technique to mitigate this problem and increase robustness. This training scheme demonstrates empirical improvements of up to 7.6% in model robustness and provides some resilience against missing modalities. Nevertheless, this technique has limitations too. It should be noted that modality dropout assumes random missingness patterns during training, which may not accurately reflect the systematic or adversarial nature of missing modalities in actual phishing attacks. In these attacks, attackers deliberately obscure specific information types. Besides, the stochastic nature of modality dropout during training may lead to inconsistent convergence and requires longer training periods to achieve stable performance across all possible modality combinations.
The proposed system does not benefit from a third party feature such as domain checking. Although it is possible to use such additional and possibly useful features, we avoided them to achieve a real-time and self-contained end-to-end neural architecture. Further, our initial idea was to rely purely on a multilingual sentence transformer like XLM-RoBERTa for the sake of simplicity and speed. However, the slight loss of accuracy has led us to integrate an online translator API such as Google Translate, making the use of a third party service indispensable. This decline highlights the main problem of multilingual sentence transformers: the lack of training data in non-English languages. Although a lot of today’s software includes a Software-as-a-Service (SAAS) ecosystem, relying on a third-party system inevitably causes short delays and makes our solution vulnerable to API failures. We, therefore, believe that CrossPhire can be equipped with a better offline multilingual sentence transformer in the future. At this point, one may question the use of current open/closed source state-of-the-art large language models (LLMs) such as Anthropic’s Claude, OpenAI’s GPT or Meta’s Llama models. However, these models require high-end GPUs or token-based costs, which in turn result in additional costs when analyzing millions of suspicious web pages. Conversely, our objective is to institute a cost-effective (and potentially cost-free) approach that is replicable and can be utilized by researchers, students, and industry.
In this study, the superior performance of vector concatenation over the mixture of experts can be attributed to the unique characteristics of the anti-phishing problem domain. The three modalities employed in this work exhibit high complementarity rather than redundancy. Each modality provides distinct, non-overlapping information that is crucial for accurate classification. URL features reveal technical deception indicators, such as suspicious domains and redirects; visual features capture design mimicry and visual social engineering tactics; and textual content exposes linguistic manipulation and semantic deception strategies. Effective phishing detection requires simultaneous access to all three modalities to identify subtle cross-modal dependencies. For example, legitimate-looking visual designs might be paired with suspicious URL patterns, or trustworthy textual content might mask underlying technical vulnerabilities. Concatenation-based joint learning preserves complete information flow across all three modalities. This enables the model to learn critical cross-modal interactions. In contrast, the expert routing mechanism of MoE may inadvertently create information bottlenecks by specializing in individual modalities. This could cause it to miss essential inter-modal security indicators. These findings suggest that architectural simplicity through complete information preservation may be more effective than complex routing mechanisms designed for scenarios with redundant or competing modalities in security domains where multiple information sources are highly complementary and all contribute unique evidence for threat detection.
Although CrossFire achieves high accuracy on the same dataset (97.96–100%), we caution against interpreting these results as evidence that the task is trivial. Our cross-dataset experiments (See Table 11) reveal a 30–51% drop in accuracy under distribution shift, and our zero-day evaluations (Section 6.5) demonstrate a 10–16% degradation, showing that there is a genuine challenge in realistic deployment scenarios. The high same-dataset scores partially reflect the dataset-specific biases that we documented in Section 4.3, such as URL length artefacts in PWD2016, temporal homogeneity in datasets with a short collection window, and HTML duplication enabling memorisation. We emphasise that cross-dataset and zero-day performance metrics are more reliable indicators of real-world effectiveness than same-dataset scores. The strength of the multimodal architecture lies not in solving an easy problem, but in maintaining robustness when individual modalities become unreliable, which is a critical requirement for adversarial domains such as phishing detection, where attackers continuously adapt.
It is well known that phishers evolve their techniques over time, resulting in zero-day attacks. There are many ML-based anti-phishing studies in the literature, using various features. As reported by Ariyadasa et al. [75], the main shortcoming of these methods is their inability to remain robust against these new techniques. In this study, we also aimed to solve this problem by cross-firing with three modalities. An evaluation of the results of the zero-day attack reveals that CrossPhire demonstrates good to moderate performance on the zero-day attack dataset. Nevertheless, we are compelled to acknowledge the imperative nature of continual updates to the data. The performance scores obtained with the latest Phish360 and the largest PILWD-134K datasets clearly demonstrate this phenomenon. As a solution, Ariyadasa et al. [75] proposed the integration of deep learning with reinforcement learning to keep the system on the safe side. However, their model still requires constant updates from known resources such as Phishtank. Ideally, the ultimate goal of the anti-phishing community is to invent a mechanism that can remain robust even after several years without retraining. In this regard, we posit that the notion of self-supervised learning has the potential to contribute to the problem domain in the future. For instance, the vision transformer, which is based on Masked Siamese Networks, facilitates the acquisition of useful representations through contrastive learning. This process entails the application of controllable levels of perturbations to images, thereby obviating the need for reconstruction errors. It is our contention that, if implemented properly in the context of URLs, textual content, or images, this concept can prove beneficial in emulating adversarial attacks and future trends, reducing the need for the most recent data to some extent.

8. Conclusions

In this study, a new neural network was designed and implemented to develop an efficient and effective method for classifying phishing and legitimate web pages. This method was developed using three modalities: text, URL, and web page screenshots. In evaluations conducted on a dataset, the proposed model demonstrated superior performance in comparison to other studies. Extensive experimentation on multiple datasets has demonstrated the efficacy of the multimodal perspective in identifying phishing websites. Additionally, the semantics of the primary content of web pages have been found to exhibit discriminatory tendencies within the context of the problem domain. The experiments conducted on zero-day attack datasets underscore the importance of diversity and the incorporation of contemporary trends to inform machine learning models, thereby ensuring their resilience. Although the model under consideration has been shown to produce advantageous prediction results in zero-day benchmarks, it is hypothesized that the incorporation of self-supervised methods involving contrastive learning, in conjunction with minor perturbations in both text and image space, has the potential to enhance the model’s generalizability.

Author Contributions

Conceptualization, A.S.B.; methodology, A.H.A.A. and A.S.B.; software, A.H.A.A. and A.S.B.; validation, A.S.B.; formal analysis, A.S.B.; investigation, A.S.B.; resources, A.H.A.A. and A.S.B.; data curation, A.S.B.; writing—original draft preparation, A.H.A.A. and A.S.B.; writing—review and editing, A.S.B.; visualization, A.H.A.A. and A.S.B.; supervision, A.S.B.; project administration, A.S.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data and implementation presented in the study are publicly available at https://web.cs.hacettepe.edu.tr/~selman/phish360-dataset/.

Acknowledgments

During the preparation of this manuscript/study, the author(s) used DeepL Write and Claude 4.5 for language editing and code debugging, respectively. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BSBeautifulSoup
CLIPContrastive Language-Image Pre-Training
GUIGraphical User Interface
HTMLHypertext Markup Language
MOEMixture of Experts
URLUniform Resource Locator
TFTrafilatura

References

  1. APWG. Phishing Activity Trends Report, 3rd Quarter 2024. 2024. Available online: https://apwg.org/trendsreports/ (accessed on 5 December 2024).
  2. Sahingoz, O.K.; Buber, E.; Demir, O.; Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019, 117, 345–357. [Google Scholar] [CrossRef]
  3. Bozkir, A.S.; Dalgic, F.C.; Aydos, M. GramBeddings: A new neural network for URL based identification of phishing web pages through n-gram embeddings. Comput. Secur. 2023, 124, 102964. [Google Scholar] [CrossRef]
  4. Bozkir, A.S.; Aydos, M. Local image descriptor based phishing web page recognition as an open-set problem. Avrupa Bilim ve Teknoloji Dergisi 2019, 444–451. [Google Scholar] [CrossRef]
  5. Hou, Y.T.; Chang, Y.; Chen, T.; Laih, C.S.; Chen, C.M. Malicious web content detection by machine learning. Expert Syst. Appl. 2010, 37, 55–60. [Google Scholar] [CrossRef]
  6. Xiang, G.; Hong, J.; Rose, C.P.; Cranor, L. Cantina+ a feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. (TISSEC) 2011, 14, 1–28. [Google Scholar] [CrossRef]
  7. Lee, J.; Ye, P.; Liu, R.; Divakaran, D.M.; Chan, M.C. Building robust phishing detection system: An empirical analysis. Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb) 2020, San Diego, CA, USA, 23 February 2020; pp. 1–12. [Google Scholar] [CrossRef]
  8. Parcalabescu, L.; Trost, N.; Frank, A. What is multimodality? arXiv 2021, arXiv:2103.06304. [Google Scholar] [CrossRef]
  9. Lee, J.; Lim, P.; Hooi, B.; Divakaran, D.M. Multimodal Large Language Models for Phishing Webpage Detection and Identification. arXiv 2024, arXiv:2408.05941. [Google Scholar] [CrossRef]
  10. Wang, Y.; Ma, W.; Xu, H.; Liu, Y.; Yin, P. A lightweight multi-view learning approach for phishing attack detection using transformer with mixture of experts. Appl. Sci. 2023, 13, 7429. [Google Scholar] [CrossRef]
  11. Rao, R.S.; Pais, A.R. Jail-Phish: An improved search engine based phishing detection system. Comput. Secur. 2019, 83, 246–267. [Google Scholar] [CrossRef]
  12. Lin, S.C.; Wl, P.C.; Chen, H.Y.; Morikawa, T.; Takahashi, T.; Lin, T.N. Senseinput: An image-based sensitive input detection scheme for phishing website detection. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4180–4186. [Google Scholar] [CrossRef]
  13. Yu, S.; An, C.; Yu, T.; Zhao, Z.; Li, T.; Wang, J. Phishing Detection Based on Multi-Feature Neural Network. In Proceedings of the 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC), Austin, TX, USA, 11–13 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 73–79. [Google Scholar] [CrossRef]
  14. Tan, C.C.L.; Chiew, K.L.; Yong, K.S.; Sebastian, Y.; Than, J.C.M.; Tiong, W.K. Hybrid phishing detection using joint visual and textual identity. Expert Syst. Appl. 2023, 220, 119723. [Google Scholar] [CrossRef]
  15. Zhou, S.; Ruan, L.; Xu, Q.; Chen, M. Multimodal fraudulent website identification method based on heterogeneous model ensemble. China Commun. 2023, 20, 263–274. [Google Scholar] [CrossRef]
  16. Opara, C.; Wei, B.; Chen, Y. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar] [CrossRef]
  17. Ouyang, L.; Zhang, Y. Phishing web page detection with html-level graph neural network. In Proceedings of the 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Shenyang, China, 20–22 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 952–958. [Google Scholar] [CrossRef]
  18. Çolhak, F.; Ecevit, M.İ.; Uçar, B.E.; Creutzburg, R.; Dağ, H. Phishing Website Detection through Multi-Model Analysis of HTML Content. arXiv 2024, arXiv:2401.04820. [Google Scholar] [CrossRef]
  19. Zhang, Y.; Hong, J.I.; Cranor, L.F. Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th international conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 639–648. [Google Scholar] [CrossRef]
  20. Korkmaz, M.; Sahingoz, O.K.; Diri, B. Detection of phishing websites by using machine learning-based URL analysis. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar] [CrossRef]
  21. Butnaru, A.; Mylonas, A.; Pitropakis, N. Towards lightweight url-based phishing detection. Future Internet 2021, 13, 154. [Google Scholar] [CrossRef]
  22. Rao, R.S.; Vaishnavi, T.; Pais, A.R. CatchPhish: Detection of phishing websites by inspecting URLs. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 813–825. [Google Scholar] [CrossRef]
  23. Sánchez-Paniagua, M.; Fernández, E.F.; Alegre, E.; Al-Nabki, W.; Gonzalez-Castro, V. Phishing URL detection: A real-case scenario through login URLs. IEEE Access 2022, 10, 42949–42960. [Google Scholar] [CrossRef]
  24. Jalil, S.; Usman, M.; Fong, A. Highly accurate phishing URL detection based on machine learning. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 9233–9251. [Google Scholar] [CrossRef]
  25. Jishnu, K.; Arthi, B. Enhanced Phishing URL Detection Using Leveraging BERT with Additional URL Feature Extraction. In Proceedings of the 2023 5th International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 31 July–2 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1745–1750. [Google Scholar] [CrossRef]
  26. Jishnu, K.; Arthi, B. Phishing URL detection by leveraging RoBERTa for feature extraction and LSTM for classification. In Proceedings of the 2023 Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS), Trichy, India, 23–25 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 972–977. [Google Scholar] [CrossRef]
  27. Haynes, K.; Shirazi, H.; Ray, I. Lightweight URL-based phishing detection using natural language processing transformers for mobile devices. Procedia Comput. Sci. 2021, 191, 127–134. [Google Scholar] [CrossRef]
  28. Shirazi, H.; Hayne, K. Towards performance of NLP transformers on URL-based phishing detection for mobile devices. Int. J. Ubiquitous Syst. Pervasive Netw. 2022, 17, 34–42. [Google Scholar] [CrossRef]
  29. Asif, A.U.Z.; Shirazi, H.; Ray, I. Machine Learning-Based Phishing Detection Using URL Features: A Comprehensive Review. In Proceedings of the International Symposium on Stabilizing, Safety, and Security of Distributed Systems, Jersey City, NJ, USA, 2–4 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 481–497. [Google Scholar] [CrossRef]
  30. Opara, C.; Chen, Y.; Wei, B. Look before You leap: Detecting phishing web pages by exploiting raw URL And HTML characteristics. Expert Syst. Appl. 2024, 236, 121183. [Google Scholar] [CrossRef]
  31. Benavides-Astudillo, E.; Fuertes, W.; Sanchez-Gordon, S.; Nuñez-Agurto, D.; Rodríguez-Galán, G. A Phishing-Attack-Detection Model Using Natural Language Processing and Deep Learning. Appl. Sci. 2023, 13, 5275. [Google Scholar] [CrossRef]
  32. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
  33. Phoka, T.; Suthaphan, P. Image based phishing detection using transfer learning. In Proceedings of the 2019 11th International Conference on Knowledge and Smart Technology (KST), Phuket, Thailand, 23–26 January 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 232–237. [Google Scholar] [CrossRef]
  34. Bozkir, A.S.; Aydos, M. LogoSENSE: A companion HOG based logo detection scheme for phishing web page and E-mail brand recognition. Comput. Secur. 2020, 95, 101855. [Google Scholar] [CrossRef]
  35. Lin, Y.; Liu, R.; Divakaran, D.M.; Ng, J.Y.; Chan, Q.Z.; Lu, Y.; Si, Y.; Zhang, F.; Dong, J.S. Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Virtual, 11–13 August 2021; pp. 3793–3810. [Google Scholar]
  36. Wang, M.; Song, L.; Li, L.; Zhu, Y.; Li, J. Phishing webpage detection based on global and local visual similarity. Expert Syst. Appl. 2024, 252, 124120. [Google Scholar] [CrossRef]
  37. Van Dooremaal, B.; Burda, P.; Allodi, L.; Zannone, N. Combining text and visual features to improve the identification of cloned webpages for early phishing detection. In Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 17–20 August 2021; pp. 1–10. [Google Scholar] [CrossRef]
  38. Sánchez-Paniagua, M.; Fidalgo, E.; Alegre, E.; Alaiz-Rodríguez, R. Phishing websites detection using a novel multipurpose dataset and web technologies features. Expert Syst. Appl. 2022, 207, 118010. [Google Scholar] [CrossRef]
  39. Liu, R.; Lin, Y.; Yang, X.; Ng, S.H.; Divakaran, D.M.; Dong, J.S. Inferring phishing intention via webpage appearance and dynamics: A deep vision based approach. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 1633–1650. [Google Scholar]
  40. Vo Quang, M.; Bui Tan Hai, D.; Tran Kim Ngoc, N.; Ngo Duc Hoang, S.; Nguyen Huu, Q.; Phan The, D.; Pham, V.H. Shark-Eyes: A multimodal fusion framework for multi-view-based phishing website detection. In Proceedings of the 12th International Symposium on Information and Communication Technology, Ho Chi Minh City, Vietnam, 7–8 December 2023; pp. 793–800. [Google Scholar] [CrossRef]
  41. Tong, X.; Jin, B.; Wang, J.; Yang, Y.; Suo, Q.; Wu, Y. MM-ConvBERT-LMS: Detecting malicious Web pages via multi-modal learning and pre-trained model. Appl. Sci. 2023, 13, 3327. [Google Scholar] [CrossRef]
  42. Tan, C.L.; Chiew, K.L.; Wong, K.; Sze, S.N. PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder. Decis. Support Syst. 2016, 88, 18–27. [Google Scholar] [CrossRef]
  43. Li, Y.; Huang, C.; Deng, S.; Lock, M.L.; Cao, T.; Oo, N.; Lim, H.W.; Hooi, B. {KnowPhish}: Large language models meet multimodal knowledge graphs for enhancing {Reference-Based} phishing detection. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 793–810. [Google Scholar]
  44. Cao, T.; Huang, C.; Li, Y.; Huilin, W.; He, A.; Oo, N.; Hooi, B. Phishagent: A robust multimodal agent for phishing webpage detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25 February–4 March 2025; Volume 39, pp. 27869–27877. [Google Scholar]
  45. Michailidou, E.; Harper, S.; Bechhofer, S. Visual complexity and aesthetic perception of web pages. In Proceedings of the 26th Annual ACM International Conference on Design of Communication, Lisbon, Portugal, 22–24 September 2008; pp. 215–224. [Google Scholar]
  46. Deng, L.; Poole, M.S. Aesthetic design of e-commerce web pages–Webpage Complexity, Order and preference. Electron. Commer. Res. Appl. 2012, 11, 420–440. [Google Scholar] [CrossRef]
  47. Chiew, K.L.; Chang, E.H.; Tan, C.L.; Abdullah, J.; Yong, K.S.C. Building standard offline anti-phishing dataset for benchmarking. Int. J. Eng. Technol. 2018, 7, 7–14. [Google Scholar] [CrossRef]
  48. Barbaresi, A. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Online, 1–6 August, 2021; pp. 122–131. [Google Scholar] [CrossRef]
  49. BS4. BeautifulSoup Library. Available online: https://pypi.org/project/beautifulsoup4/ (accessed on 5 December 2025).
  50. Swartz, A. html2text Documentation—pypi.org. 2020. Available online: https://pypi.org/project/html2text/ (accessed on 5 October 2023).
  51. xml dev Team. lxml Documentation—pypi.org. 2023. Available online: https://pypi.org/project/lxml/ (accessed on 5 October 2023).
  52. Lopukhin, K. httml-Text Documentation—pypi.org. 2020. Available online: https://pypi.org/project/html-text/ (accessed on 5 October 2023).
  53. Le, H.; Pham, Q.; Sahoo, D.; Hoi, S.C. URLNet: Learning a URL representation with deep learning for malicious URL detection. arXiv 2018, arXiv:1802.03162. [Google Scholar] [CrossRef]
  54. Maneriker, P.; Stokes, J.W.; Lazo, E.G.; Carutasu, D.; Tajaddodianfar, F.; Gururajan, A. URLTran: Improving phishing URL detection using transformers. In Proceedings of the MILCOM 2021-2021 IEEE Military Communications Conference (MILCOM), San Diego, CA, USA, 29 November–2 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 197–204. [Google Scholar] [CrossRef]
  55. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  56. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
  57. Ariyadasa, S.; Fernando, S.; Fernando, S. Combining long-term recurrent convolutional and graph convolutional networks to detect phishing sites using URL and HTML. IEEE Access 2022, 10, 82355–82375. [Google Scholar] [CrossRef]
  58. Aleroud, A.; Zhou, L. Phishing environments, techniques, and countermeasures: A survey. Comput. Secur. 2017, 68, 160–196. [Google Scholar] [CrossRef]
  59. Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
  60. Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. Mpnet: Masked and permuted pre-training for language understanding. Adv. Neural Inf. Process. Syst. 2020, 33, 16857–16867. [Google Scholar]
  61. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar] [CrossRef]
  62. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
  63. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  64. Adane, K.; Beyene, B. Machine learning and deep learning based phishing websites detection: The current gaps and next directions. Rev. Comput. Eng. Res. 2022, 9, 13–29. [Google Scholar] [CrossRef]
  65. Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef]
  66. Zhou, Y.; Lei, T.; Liu, H.; Du, N.; Huang, Y.; Zhao, V.; Dai, A.; Chen, Z.; Le, Q.; Laudon, J. Mixture-of-Experts with Expert Choice Routing. arXiv 2022, arXiv:2202.09368. [Google Scholar] [CrossRef]
  67. Lo, K.M.; Huang, Z.; Qiu, Z.; Wang, Z.; Fu, J. A Closer Look into Mixture-of-Experts in Large Language Models. arXiv 2025, arXiv:2406.18219. [Google Scholar] [CrossRef]
  68. Peng, X.; Wei, Y.; Deng, A.; Wang, D.; Hu, D. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8238–8247. [Google Scholar] [CrossRef]
  69. Huang, S.C.; Pareek, A.; Seyyedi, S.; Banerjee, I.; Lungren, M.P. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. npj Digit. Med. 2020, 3, 136. [Google Scholar] [CrossRef]
  70. van Geest, R.; Cascavilla, G.; Hulstijn, J.; Zannone, N. The applicability of a hybrid framework for automated phishing detection. Comput. Secur. 2024, 139, 103736. [Google Scholar] [CrossRef]
  71. Huang, Y.; Lin, J.; Zhou, C.; Yang, H.; Huang, L. Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably). arXiv 2022, arXiv:2203.12221. [Google Scholar] [CrossRef]
  72. Hussen Abdelaziz, A.; Theobald, B.J.; Dixon, P.; Knothe, R.; Apostoloff, N.; Kajareker, S. Modality dropout for improved performance-driven talking faces. In Proceedings of the 2020 International Conference on Multimodal Interaction, Virtual, 25–29 October 2020; pp. 378–386. [Google Scholar] [CrossRef]
  73. Alfasly, S.; Lu, J.; Xu, C.; Zou, Y. Learnable irrelevant modality dropout for multimodal action recognition on modality-specific annotated videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20208–20217. [Google Scholar] [CrossRef]
  74. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
  75. Ariyadasa, S.; Fernando, S.; Fernando, S. SmartiPhish: A reinforcement learning-based intelligent anti-phishing solution to detect spoofed website attacks. Int. J. Inf. Secur. 2024, 23, 1055–1076. [Google Scholar] [CrossRef]
Figure 1. The screenshot of the PhishBoring tool, specially designed and implemented for this study to visually inspect previously crawled and packed .png samples. PhishBoring also extracts the multi-part content to the selected target dataset folder with a unique identifier and brand name.
Figure 1. The screenshot of the PhishBoring tool, specially designed and implemented for this study to visually inspect previously crawled and packed .png samples. PhishBoring also extracts the multi-part content to the selected target dataset folder with a unique identifier and brand name.
Applsci 16 00751 g001
Figure 2. The folder hierarchy, folder-naming convention, and the file types in the Phish360 dataset.
Figure 2. The folder hierarchy, folder-naming convention, and the file types in the Phish360 dataset.
Applsci 16 00751 g002
Figure 3. Distribution of URL character lengths for multimodal datasets (a) PhishIntention; (b) PILWD-134K; (c) PWD2016; (d) VanNL126K; (e) Phish360. The maximum values for the x and y axes were set to 200 and 5000.
Figure 3. Distribution of URL character lengths for multimodal datasets (a) PhishIntention; (b) PILWD-134K; (c) PWD2016; (d) VanNL126K; (e) Phish360. The maximum values for the x and y axes were set to 200 and 5000.
Applsci 16 00751 g003
Figure 4. Comparison of the percentages of unique HTML and BeautifulSoup text (BS) for phishing (P) and legitimate (L) classes across multimodal datasets.
Figure 4. Comparison of the percentages of unique HTML and BeautifulSoup text (BS) for phishing (P) and legitimate (L) classes across multimodal datasets.
Applsci 16 00751 g004
Figure 5. Comparison of the percentage of unique triplet representation of samples (a) (URL, BS_text, image_hash) (b) (URL, HTML, image_hash) across multimodal datasets.
Figure 5. Comparison of the percentage of unique triplet representation of samples (a) (URL, BS_text, image_hash) (b) (URL, HTML, image_hash) across multimodal datasets.
Applsci 16 00751 g005
Figure 6. The number of languages within phishing and legitimate samples across the datasets we employed.
Figure 6. The number of languages within phishing and legitimate samples across the datasets we employed.
Applsci 16 00751 g006
Figure 7. The modular overview of CrossPhire’s workflow merging three modalities.
Figure 7. The modular overview of CrossPhire’s workflow merging three modalities.
Applsci 16 00751 g007
Figure 8. Comparison of phishing detection performance using original multilingual text extracted by BeautifulSoup (BS) and Trafilatura (TF) across various classifiers.
Figure 8. Comparison of phishing detection performance using original multilingual text extracted by BeautifulSoup (BS) and Trafilatura (TF) across various classifiers.
Applsci 16 00751 g008
Figure 9. An accuracy comparison of CrossPhire against its individual unimodal components across multiple benchmark datasets.
Figure 9. An accuracy comparison of CrossPhire against its individual unimodal components across multiple benchmark datasets.
Applsci 16 00751 g009
Figure 10. The CrossPhire LIME-based explainer tool shows which modality influences the final decision and highlights the important tokens in separate tabs.
Figure 10. The CrossPhire LIME-based explainer tool shows which modality influences the final decision and highlights the important tokens in separate tabs.
Applsci 16 00751 g010
Table 1. Systematic comparison of multimodal phishing detection approaches in terms of architectural design and fusion strategies. Abbreviations: ST = Sentence Transformers (MPNet/XLM-R); CNN = ResNet50/DenseNet121. Fusion strategies: Early fusion = feature-level concatenation before classification; Late fusion = decision-level combination; Model stacking = ensemble of separate models. Training: End-to-end joint = all modality encoders trained simultaneously; Multi-stage = separate training then combination. Note: CrossPhire uses Google Translate API for non-English content preprocessing during training (offline alternatives available).
Table 1. Systematic comparison of multimodal phishing detection approaches in terms of architectural design and fusion strategies. Abbreviations: ST = Sentence Transformers (MPNet/XLM-R); CNN = ResNet50/DenseNet121. Fusion strategies: Early fusion = feature-level concatenation before classification; Late fusion = decision-level combination; Model stacking = ensemble of separate models. Training: End-to-end joint = all modality encoders trained simultaneously; Multi-stage = separate training then combination. Note: CrossPhire uses Google Translate API for non-English content preprocessing during training (offline alternatives available).
MethodModalitiesFeature ExtractionFusion StrategyTrainingDependencies
Jail-Phish [11]URL, CSS, JS, ImageHandcrafted (Jaccard similarity)Feature-levelMulti-stageGoogle Search
Yu et al. [13]URL, HTML, ImageLSTM + CNN + CBAM attentionLate fusion (concatenation)Multi-stageNone
SenseInput [12]URL, HTML, ImageHandcrafted (9 features)Feature-levelLightGBMNone
Tan et al. [14]Logo, TextLogo detection + OCRDecision-levelMulti-stageDomain verification
Zhou et al. [15]URL, Text, ImageCNN + TF-IDFModel stackingMulti-stageNone
Lee et al. [9]HTML, ScreenshotLLM prompts (GPT-4)Prompt-basedTwo-stageGPT-4 API
WebPhish [30]URL, HTMLCNN (character-level)Early fusion (concatenation)Joint trainingNone
Shark-Eyes [40]URL, HTML DOMConvBERT + positional encodingAttention-basedJoint trainingNone
CrossPhire (Ours)URL, Content, ImageGramBeddings + ST + CNNLate fusion (concatenation)End-to-end jointNone
Table 2. Various information for the datasets employed in this study.
Table 2. Various information for the datasets employed in this study.
Dataset NameCollection YearSample SizeLegitimate Data SourcePhish Data Source
PWD2016Mar–Apr 2016Phish: 15,000
Legitimate: 15,000
Alexa
DMOZ
BOTW
PhishTank
VanNL126KSept–Dec 2019Phish: 100,000
Legitimate: 25,938
DMOZPhishTank
OpenPhish
PhishStats
PhishIntentionOct 2019–Aug 2020Phish: 29,496
Legit: 25,400
Misleading Legitimate: 3049
AlexaOpenPhish
PILWD-134kAug 2019–Sep 2020Phish: 66,964
Legitimate: 66,964
Majestic Million
Quantcast
PhishTank
Phish360 (Ours)Jan 2020–Feb 2024Phish: 4332
Legitimate: 6416
Custom crawl
Random Subsampling
PhishTank
OpenPhish
Table 3. A comparative analysis of phishing domain statistics in various datasets. Numbers denote percentage (%) values. Best scores are highlighted in bold.
Table 3. A comparative analysis of phishing domain statistics in various datasets. Numbers denote percentage (%) values. Best scores are highlighted in bold.
DatasetURLsUnique DomainsTLDFLDSubdomains
PWD201638.2517.711.4017.852.74
PhishIntention87.2142.621.5643.0424.46
PILWD-134K86.6845.231.0846.4121.35
VanNL126K100.025.910.6726.8513.65
Phish36098.2673.636.6973.8628.69
Table 4. A comparative analysis of legitimate domain statistics in various datasets. Numbers denote percentage (%) values. Best scores are highlighted in bold.
Table 4. A comparative analysis of legitimate domain statistics in various datasets. Numbers denote percentage (%) values. Best scores are highlighted in bold.
DatasetURLsUnique DomainsTLDFLDSubdomains
PWD2016100.092.972.46-0.65
PhishIntention87.9482.982.1786.683.21
PILWD-134K99.2391.370.7692.573.27
VanNL126K100.084.902.1685.689.94
Phish36099.4188.733.0788.927.29
Table 5. A sample from the Phish360 dataset demonstrating the values of different columns in the data frame, highlighting the difference between extracted text via different parser tools.
Table 5. A sample from the Phish360 dataset demonstrating the values of different columns in the data frame, highlighting the difference between extracted text via different parser tools.
Column NameValue
dataset_namePhish360
folder_nameP11258
classphish
brandadobe
URLhttp://jaywatsonfiles.000webhostapp.com/ (accesed on 21 July 2023)
TLDcom
Domain000webhostapp
FLD000webhostapp.com
Subdomainjaywatsonfiles
SSLFalse
image_pathphish360/trainval/P11258_adobe/SCREEN-SHOT/screen_shoot.png
trafilatura_textAdobe PDF Online/nAccount/nSign In/nConfirm yo…
trafilatura_text_languageEnglish
beautifulSoup_text/n/n/nAdobe PDF/n/n/n/n/n/n/n/nAdobe PDF Onlin…
beautifulSoup_text_languageEnglish
html2text_text| Adobe PDF Online || Account | Sign I…/
lxml_text/n/n/nAdobe PDF/n/n/na/n color:#454444;/n t…
html_extract_textAdobe PDF/n/nAdobe PDF Online Account Sign In/…
full_html<html dir="LTR" lang="en"><head>/n<meta http-…
html_text_languageEnglish
Table 6. Statistics of missing/corrupted samples in multimodal datasets we employed.
Table 6. Statistics of missing/corrupted samples in multimodal datasets we employed.
Dataset Name# of Phish# of LegitPhish URLLegit URLPhish HTMLLegit HTMLPhish ScreenshotLegit Screenshot
PWD201615,00015,000211315363588398
PhishIntention29,49628,44903426 *4506991 *00
PILWD-134K66,96466,96400031011
VanNL126K100,00025,9380003655713659
Phish360 (Ours)43326416000000
* All missing files from benign category (no missing files in misleading samples).
Table 7. Performance comparison of three URL models across benchmark datasets for both in same-dataset and cross-dataset regimes. The best score for each row is highlighted in bold.
Table 7. Performance comparison of three URL models across benchmark datasets for both in same-dataset and cross-dataset regimes. The best score for each row is highlighted in bold.
ModalityTrain DatasetTest DatasetGramBeddingsURLNetURL Tran
Train Acc.Test Acc.Train Acc.Test Acc.Train Acc.Test Acc.
URLPILWD-134KPILWD-134K0.98320.96910.98780.96350.95620.8984
URLPILWD-134KVanNL126k0.98340.91550.98780.91460.95510.8398
URLPILWD-134KPhishIntention0.98400.93080.97560.95330.95470.7306
URLPILWD-134KPWD20160.98420.91070.99220.95280.95460.5420
URLVanNL126kPILWD-134K0.99940.57410.97730.58080.98250.5283
URLVanNL126kVanNL126k0.99940.94120.97730.98520.97160.9840
URLVanNL126kPhishIntention0.99990.60150.98860.55100.98440.6227
URLVanNL126kPWD20160.99930.67380.97730.54200.98510.4915
URLPhishIntentionPILWD-134K0.99900.68070.97650.64670.98830.6624
URLPhishIntentionVanNL126k0.99810.90150.99220.89870.98710.7587
URLPhishIntentionPhishIntention0.99940.99001.00000.98890.98790.9363
URLPhishIntentionPWD20160.99960.99370.97660.99850.98670.8948
URLPWD2016PILWD-134K0.99950.65151.00000.50000.99420.5526
URLPWD2016VanNL126k1.00000.79401.00000.79400.99990.7359
URLPWD2016PhishIntention1.00000.67171.00000.54101.00000.8581
URLPWD2016PWD20160.99330.99481.00001.00001.00000.9983
Table 8. Performance comparison of Resnet50 and Densenet121 across benchmarking datasets. The vision model that achieved the highest score for each dataset is highlighted in bold.
Table 8. Performance comparison of Resnet50 and Densenet121 across benchmarking datasets. The vision model that achieved the highest score for each dataset is highlighted in bold.
ModalityVision ModelDatasetTest Acc.Test F1
ScreenshotResnet50PILWD-134K0.93040.9248
ScreenshotDensenet121PILWD-134K0.95620.9544
ScreenshotResnet50VanNL126K0.93920.9627
ScreenshotDensenet121VanNL126K0.95770.9733
ScreenshotResnet50PhishIntention0.95560.9564
ScreenshotDensenet121PhishIntention0.97030.9696
ScreenshotResnet50PWD20160.93210.9329
ScreenshotDensenet121PWD20160.93090.9294
ScreenshotResnet50Phish3600.93860.9200
ScreenshotDensenet121Phish3600.94320.9263
Table 9. Comparison of accuracy scores between monolingual and multilingual sentence transformers using original and English-translated BS text. The highest scores achieved by combining the language model with the textual content of each dataset are highlighted in bold.
Table 9. Comparison of accuracy scores between monolingual and multilingual sentence transformers using original and English-translated BS text. The highest scores achieved by combining the language model with the textual content of each dataset are highlighted in bold.
DatasetTextual SourceTransformerSVMXGBCatBoost
Test Acc.Test F1Test Acc.Test F1Test Acc.Test F1
PhishIntentionTranslated BSMPNet0.96870.97340.98600.98800.98380.9862
PhishIntentionOriginal BSXLM-R0.86580.88430.97980.98290.97720.9807
VanNL126kTranslated BSMPNet0.96000.97580.97490.98480.97440.9845
VanNL126kOriginal BSXLM-R0.90230.94340.97540.98510.97340.9839
PILWD-134KTranslated BSMPNet0.91840.91640.95500.95370.95630.9550
PILWD-134KOriginal BSXLM-R0.79690.80430.95290.95160.95120.9500
PWD2016Translated BSMPNet0.95280.94860.98990.98910.98820.9873
PWD2016Original BSXLM-R0.96140.95860.98820.98730.98620.9851
Phish360Translated BSMPNet0.96220.95300.97160.96460.97020.9629
Phish360Original BSXLM-R0.85310.80000.95710.94680.95900.9490
Table 10. Performance metrics of CrossPhire with feature concatenation in the same-dataset regime. The highest test accuracy score is highlighted in bold.
Table 10. Performance metrics of CrossPhire with feature concatenation in the same-dataset regime. The highest test accuracy score is highlighted in bold.
Vision ModelDatasetTrain Acc.Test Acc.Test PrecisionTest RecallTest F1
Resnet50PILWD-134K0.99800.98040.97880.98100.9783
Densenet121PILWD-134K0.99740.98070.97840.98210.9883
Resnet50VanNL126K0.99950.99420.99570.99720.9963
Densenet121VanNL126K0.99190.99260.99490.99620.9952
Resnet50PhishIntention0.99810.99570.99500.99760.9961
Densenet121PhishIntention0.99900.99630.99540.99830.9966
Resnet50PWD20161.0001.0001.0001.0001.000
Densenet121PWD20161.0001.0001.0001.0001.000
Resnet50Phish3600.99910.97710.97990.96290.9602
Densenet121Phish3600.99880.97960.97900.96990.9653
Table 11. Cross-Dataset benchmarking of CrossPhire (via Vector Concatenation) to assess its generalization capability among different datasets. The training and testing portions of the datasets are kept as 80% and 20%, respectively.
Table 11. Cross-Dataset benchmarking of CrossPhire (via Vector Concatenation) to assess its generalization capability among different datasets. The training and testing portions of the datasets are kept as 80% and 20%, respectively.
Vision ModelSource DatasetTarget DatasetTest Acc.Test PrecisionTest RecallTest F1
Resnet50PILWD-134KVanNL126K0.93170.96440.95200.9565
Densenet121PILWD-134KVanNL126K0.93020.96460.94990.9554
Resnet50PILWD-134KPhishIntention0.95110.93560.98430.9573
Densenet121PILWD-134KPhishIntention0.96270.95680.98070.9673
Resnet50PILWD-134KPWD20160.93700.94710.91620.9211
Densenet121PILWD-134KPWD20160.93010.92210.92850.9165
Resnet50PILWD-134KPhish3600.93380.90210.93760.9068
Densenet121PILWD-134KPhish3600.93050.88480.95140.9100
Resnet50VanNL126KPILWD-134K0.68400.61020.97500.7406
Densenet121VanNL126KPILWD-134K0.70420.62690.97280.7525
Resnet50VanNL126KPhishIntention0.63880.61930.99670.7559
Densenet121VanNL126KPhishIntention0.68540.67810.88200.7576
Resnet50VanNL126KPWD20160.51680.49100.96820.6403
Densenet121VanNL126KPWD20160.50500.48460.95420.6317
Resnet50VanNL126KPhish3600.84940.73480.98030.8243
Densenet121VanNL126KPhish3600.83080.70610.99420.8092
Resnet50PhishIntentionPILWD-134K0.81380.77880.86350.8081
Densenet121PhishIntentionPILWD-134K0.80610.76580.86790.8022
Resnet50PhishIntentionVanNL126K0.90690.94580.95520.9278
Densenet121PhishIntentionVanNL126K0.91550.95210.94470.9465
Resnet50PhishIntentionPWD20160.99360.99870.98760.9901
Densenet121PhishIntentionPWD20160.99690.99880.99460.9835
Resnet50PhishIntentionPhish3600.88340.79540.95720.9603
Densenet121PhishIntentionPhish3600.93800.89440.95950.9119
Resnet50PWD2016PILWD-134K0.48760.48761.0000.6457
Densenet121PWD2016PILWD-134K0.48760.48761.0000.6457
Resnet50PWD2016VanNL126K0.82140.82141.0000.8989
Densenet121PWD2016VanNL126K0.82140.82141.0000.8989
Resnet50PWD2016PhishIntention0.60840.59961.0000.7414
Densenet121PWD2016PhishIntention0.61250.60211.0000.7434
Resnet50PWD2016Phish3600.40330.40331.0000.5602
Densenet121PWD2016Phish3600.40330.40331.0000.5602
Resnet50Phish360PILWD-134K0.85820.85380.85560.8456
Densenet121Phish360PILWD-134K0.85740.85080.85820.8456
Resnet50Phish360VanNL126K0.91090.97860.91140.9417
Densenet121Phish360VanNL126K0.93090.97610.93880.9553
Resnet50Phish360PhishIntention0.96660.96300.98070.9701
Densenet121Phish360PhishIntention0.96150.95340.98240.9662
Resnet50Phish360PWD20160.91750.97890.84140.8947
Densenet121Phish360PWD20160.93330.96500.88930.9164
Table 12. Performance metrics of CrossPhire with Mixture of Expert Fusion in the same-dataset regime. The difference shows how joint learning-based vector concatenation differs from MoE.
Table 12. Performance metrics of CrossPhire with Mixture of Expert Fusion in the same-dataset regime. The difference shows how joint learning-based vector concatenation differs from MoE.
Vision ModelDatasetTrain Acc.Test Acc. (Diff.%)Test PrecisionTest RecallTest F1
Resnet50PILWD-134K0.98040.9741 (−0.6%)0.98440.96230.9704
Densenet121PILWD-134K0.99530.9751 (−0.5%)0.98260.96600.9715
Resnet50VanNL126K0.99570.9667 (−2.7%)0.96940.99080.9798
Densenet121VanNL126K0.99880.9519 (−4.0%)0.94650.99770.9712
Resnet50PhishIntention0.99610.9920 (−0.3%)0.99100.99530.9931
Densenet121PhishIntention0.99870.9930 (−0.3%)0.99220.99580.9939
Resnet50PWD20161.0001.000 (0%)1.0001.0001.000
Densenet121PWD20161.0001.000 (0%)1.0001.0001.000
Resnet50Phish3600.99610.9528 (−2.4%)0.97150.90960.9361
Densenet121Phish3600.99750.9589 (−2.0%)0.94480.95360.9494
Table 13. Cross-Dataset benchmarking of CrossPhire (via MOE fusion). In this experimentation, the largest and smallest datasets are employed. Training and resting portions of datasets Are Kept as 80% and 20%, respectively. According to the results, all of the MOE fusion models fall behind the concatenation-based fusion in terms of test accuracy and F1-score. Investigating the differences reveals an increase when a smaller dataset is used.
Table 13. Cross-Dataset benchmarking of CrossPhire (via MOE fusion). In this experimentation, the largest and smallest datasets are employed. Training and resting portions of datasets Are Kept as 80% and 20%, respectively. According to the results, all of the MOE fusion models fall behind the concatenation-based fusion in terms of test accuracy and F1-score. Investigating the differences reveals an increase when a smaller dataset is used.
Vision ModelSource DatasetTarget DatasetTest Acc. (Difference %)Test PrecisionTest RecallTest F1
Resnet50PILWD-134KVanNL126K0.9135 (−1.8%)0.97940.91390.9450
Densenet121PILWD-134KVanNL126K0.9269 (−0.3%)0.97410.93590.9543
Resnet50PILWD-134KPhishIntention0.9389 (−1.2%)0.94210.95450.9477
Densenet121PILWD-134KPhishIntention0.9424 (−2.0%)0.94190.96100.9504
Resnet50PILWD-134KPWD20160.9182 (−1.8%)0.95360.86700.8959
Densenet121PILWD-134KPWD20160.9298 (−0.1%)0.94160.90580.9106
Resnet50PILWD-134KPhish3600.9267 (−0.7%)0.91040.90730.9064
Densenet121PILWD-134KPhish3600.9202 (−1.0%)0.88780.91770.9015
Resnet50Phish360PILWD-134K0.7656 (−9.2%)0.75290.77330.7581
Densenet121Phish360PILWD-134K0.7538 (−10%)0.70400.85470.7673
Resnet50Phish360VanNL126K0.8420 (−6.8%)0.99360.81290.8932
Densenet121Phish360VanNL126K0.8869 (−4.4%)0.97480.88520.9270
Resnet50Phish360PhishIntention0.9209 (−4.5%)0.95000.91310.9304
Densenet121Phish360PhishIntention0.8976 (−6.3%)0.88570.94780.9146
Resnet50Phish360PWD20160.8456 (−7.1%)0.92540.72780.8020
Densenet121Phish360PWD20160.8695 (−6.3%)0.84540.88140.8498
Table 14. Comparative analysis of CrossPhire and baseline approaches.
Table 14. Comparative analysis of CrossPhire and baseline approaches.
StudyDatasetApproachTest Acc.Prec.F1
Sánchez-Paniagua et al. [38]PILWD-134KLightGBM0.9790.9830.980
OursPILWD-134KCrossPhire0.9800.9840.980
Chiew et al. [47]PWD2016LightGBM0.976--
OursPWD2016CrossPhire1.0001.0001.000
Liu et al. [39]PhishIntentionVision-based-0.998-
OursPhishIntentionCrossPhire0.9960.9950.996
Van Dooremaal et al. [37]VanNL126kSSIM0.996-0.997
OursVanNL126kCrossPhire0.9940.9950.996
Table 15. Comparison between CrossPhire and fine-Tuned CLIP models on several benchmark datasets.
Table 15. Comparison between CrossPhire and fine-Tuned CLIP models on several benchmark datasets.
ModelDatasetTrain Acc.Test Acc.
CLIPPhish3600.99380.9622
CrossPhirePhish3600.99910.9796
CLIPPhishIntention0.99040.9895
CrossPhirePhishIntention0.99900.9963
CLIPPILWD-134K0.98880.9648
CrossPhirePILWD-134K0.99040.9812
CLIPVanNL126k0.98420.9798
CrossPhireVanNL126k0.99950.9942
CLIPPWD20160.99070.9922
CrossPhirePWD20161.00001.0000
CLIPPhish360 → Phish360-ZDay0.99370.9213
CrossPhirePhish360 → Phish360-ZDay0.99980.9346
Table 16. Statistical significance analysis of performance comparisons.
Table 16. Statistical significance analysis of performance comparisons.
DatasetBaselineCrossPhireDiff.95% CIZ-Statp-Value
Comparison with Published Baselines (Table 14):
PILWD-134K97.90%98.04%+0.14%[97.89, 98.19]0.520.301
PWD201697.60%100.00%+2.40%[99.87, 100.00]6.14<0.001 ***
PhishIntention99.80%99.60%−0.20%[99.42, 99.71]−1.470.142
VanNL126k99.60%99.42%−0.18%[99.30, 99.54]−1.410.159
Comparison with CLIP (Table 15):
Phish36096.22%97.96%+1.74%[97.36, 98.56]2.140.016 *
PhishIntention98.95%99.63%+0.68%[99.49, 99.77]3.49<0.001 ***
PILWD-134K96.48%98.12%+1.64%[97.97, 98.27]5.91<0.001 ***
VanNL126k97.98%99.42%+1.44%[99.30, 99.54]6.49<0.001 ***
PWD201699.22%100.00%+0.78%[99.87, 100.00]3.44<0.001 ***
* p < 0.05, *** p < 0.001; CI = Confidence Interval for CrossPhire accuracy. Test sets: PILWD-134K (26,786), VanNL126k (25,188), PhishIntention (11,589), PWD2016 (6000), Phish360 (2150).
Table 17. Results of Modality Dropout supported training of the Crossphire against different levels of MMSR—Missing Modality Sample Ratio (10% Light, 25% Moderate and 50% Severe). All experiments were conducted by selecting ResNet50 as the vision encoder. The models were trained for 20 epochs. As the fourth column demonstrates, when tested against the original complete data, an increase in MMSR enhances generalization capability by up to the MMSR of 25%. However, an excessive increase (i.e., 50%) of MSSR in the training data hinders positive progress and does not improve test scores as expected. Additionally, we observe improvement in performance scores of perturbed test data (See columns 6–11) due to including perturbed samples in the training data with varying levels.
Table 17. Results of Modality Dropout supported training of the Crossphire against different levels of MMSR—Missing Modality Sample Ratio (10% Light, 25% Moderate and 50% Severe). All experiments were conducted by selecting ResNet50 as the vision encoder. The models were trained for 20 epochs. As the fourth column demonstrates, when tested against the original complete data, an increase in MMSR enhances generalization capability by up to the MMSR of 25%. However, an excessive increase (i.e., 50%) of MSSR in the training data hinders positive progress and does not improve test scores as expected. Additionally, we observe improvement in performance scores of perturbed test data (See columns 6–11) due to including perturbed samples in the training data with varying levels.
Source DatasetTarget DatasetMMSRMMSR (Test.)—0%MMSR (Test.)—10%MMSR (Test.)—25%MMSR (Test.)—50%
TrainingTest Acc.Test F1Test Acc.Test F1Test Acc.Test F1Test Acc.Test F1
PILWD-134KPILWD-134K0%0.98040.97830.96830.96470.95130.94460.91780.9036
PILWD-134KPILWD-134K10%0.98780.98610.97450.97180.95800.95400.93840.9322
PILWD-134KPILWD-134K25%0.98170.97950.97940.97680.97010.96720.95500.9506
PILWD-134KPILWD-134K50%0.97970.97730.97840.97570.97250.96930.96020.9556
Phish360Phish3600%0.97710.96020.95750.93380.93230.89470.88560.8157
Phish360Phish36010%0.97990.96490.97570.96170.96260.94670.94860.9250
Phish360Phish36025%0.98320.97120.97940.96630.97290.95720.94820.9223
Phish360Phish36050%0.98180.96730.98040.96770.97380.95890.95470.9322
Phish360Phish360—ZDay0%0.93460.93180.92990.92420.88600.88190.82910.8038
Phish360Phish360—ZDay10%0.93830.93840.93090.93200.92710.92460.90280.8921
Phish360Phish360—ZDay25%0.94770.94750.93930.93780.92430.92130.91500.9124
Phish360Phish360—ZDay50%0.93930.93840.93650.93460.92990.82870.90570.9006
Table 18. Component-wise runtime breakdown for single-sample inference.
Table 18. Component-wise runtime breakdown for single-sample inference.
ComponentTime (ms)Percentage
Data Preparation:
   Screenshot capture (Selenium)≈120075.9%
   HTML download + parsing (BS)≈30019.0%
Neural Network Inference:
   GramBeddings (URL encoder)120.8%
   ResNet50 (Vision encoder)452.8%
   MPNet (Content encoder)181.1%
   Fusion + Classification50.3%
Total Latency1580100%
Table 19. Batch processing performance and memory consumption (GPU inference only, excluding data preparation).
Table 19. Batch processing performance and memory consumption (GPU inference only, excluding data preparation).
Batch SizeLatency (ms)Throughput (Samples/s)GPU Memory (GB)
18012.52.1
818543.24.8
1632050.07.2
3258055.212.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Almakhamreh, A.H.A.; Bozkir, A.S. CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification. Appl. Sci. 2026, 16, 751. https://doi.org/10.3390/app16020751

AMA Style

Almakhamreh AHA, Bozkir AS. CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification. Applied Sciences. 2026; 16(2):751. https://doi.org/10.3390/app16020751

Chicago/Turabian Style

Almakhamreh, Ahmad Hani Abdalla, and Ahmet Selman Bozkir. 2026. "CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification" Applied Sciences 16, no. 2: 751. https://doi.org/10.3390/app16020751

APA Style

Almakhamreh, A. H. A., & Bozkir, A. S. (2026). CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification. Applied Sciences, 16(2), 751. https://doi.org/10.3390/app16020751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop