Phishing Website Impersonation: Comparative Analysis of Detection and Target Recognition Methods
Abstract
1. Introduction
- To allow a fair comparison of the selected phishing detection methods, we design and develop a dedicated framework that is easily extendable to incorporate other future solutions and datasets as they emerge. This framework is available to the community at GitHub (https://github.com/Percival33/phish-target-recognition (accessed on 15 December 2025)).
- We perform a thorough comparative analysis of phishing website detection methods that rely on website screenshots and enable the recognition of their impersonation targets. The analysis is conducted on three datasets for two state-of-the-art methods (Phishpedia and VisualPhishNet), and a third solution (a Baseline solution), which uses perceptual hash similarity.
- We demonstrate that the proposed Baseline method achieves superior stability and binary classification performance compared to complex deep learning models, particularly on diverse or synthetically augmented datasets.
2. Related Work
2.1. State-of-the-Art in Phishing Detection and Target Identification
2.1.1. Heuristic and Feature-Based Approaches
2.1.2. Deep Learning-Based Visual Analysis
- VisualPhishNet: Proposed by Abdelnabi et al. [11], this method represents the holistic approach. It employs a triplet Convolutional Neural Network (CNN) based on a VGG-16 backbone to learn a universal representation of website layouts. The core assumption is that different pages belonging to the same entity (e.g., a login page and a homepage) share a visual “feel” that can be grouped in an embedding space. The network is trained using a triplet loss function, which minimizes the Euclidean distance between an anchor and a positive sample while maximizing the distance to a negative sample. A key aspect of this method is its reliance on “hard negative mining” during training to refine the decision boundaries between visually similar but distinct brands.
- Phishpedia: In contrast to the layout-based approach, Phishpedia [13], developed by Lin et al., frames the problem as a high-precision object detection and recognition task. The system utilizes a two-stage pipeline: first, a Faster R-CNN model extracts potential logo regions from the screenshot; second, a Siamese network identifies the specific brand by comparing the extracted region against a reference database. Unlike VisualPhishNet, Phishpedia utilizes transfer learning on the Logo-2K+ dataset and avoids standard triplet loss in favor of a classification-based fine-tuning approach. This design choice aims to handle visual variations of specific logos (e.g., “Adobe” vs. “Adobe AIR”) more effectively than holistic layout matching.
2.2. The Challenge of Reproducibility and Evaluation Frameworks
3. Fundamentals
3.1. Visual Hashing and Descriptors
3.2. Evaluation Metrics
3.2.1. Binary Classification Metrics
- F1 Score: The F1 score is the harmonic mean of Precision and Recall. It is particularly useful when the cost of False Negatives (missing a phishing attack) and False Positives (blocking a legitimate user) requires a balanced view. It is defined as:
- Matthews Correlation Coefficient (MCC): The MCC is widely regarded as one of the most robust metrics for binary classification on imbalanced datasets. Unlike the F1 score, which ignores True Negatives, MCC incorporates all four quadrants of the confusion matrix. It returns a value between −1 and +1, where +1 represents a perfect prediction, 0 indicates no better than random guessing, and −1 indicates total disagreement between prediction and observation. It is defined as:
- ROC AUC (Receiver Operating Characteristic Area Under the Curve): The methods evaluated in this study, particularly VisualPhishNet and the Baseline, rely on calculating a distance metric that is compared against a threshold to determine the final classification. To evaluate the performance of these classifiers across all possible classification thresholds, rather than a single, fixed cut-off point, we employ the ROC AUC metric. The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings:
3.2.2. Target Recognition Metrics
- Macro-averaged F1: We utilize Macro-averaged F1, which calculates the F1 score independently for each class and then takes the unweighted mean. This metric treats all classes equally, regardless of their frequency. In the context of phishing, this is crucial for ensuring that the system protects smaller, less frequently attacked brands just as effectively as major global targets.
- Micro-averaged F1: Complementing the macro approach, we employ the Micro-averaged F1. Unlike macro-averaging, this metric aggregates the contributions of all classes by summing the total True Positives, False Positives, and False Negatives across all N classes before calculating the score.
- Identification Rate: To specifically evaluate the utility of the system in a real-world triage scenario, we utilized the Identification Rate. Proposed in Phishpedia [13], this metric measures the system’s ability not only to detect a phishing page, but also to correctly identify the target. It is calculated as the ratio of correctly recognized targets () to the total number of correctly detected phishing pages (), effectively acting as a recall metric for the specific sub-task of brand recognition. It is important to note that this is a conditional metric, i.e., it evaluates classification performance only on the subset of samples already correctly identified as phishing.
4. Framework for Fair Comparison of Phishing Detection Methods
4.1. The Framework’s Architecture and Design
4.2. Data Management and Standardization
- The VisualPhishNet (VP) dataset: This dataset originates from VisualPhishNet (VP) paper. The original VP dataset comprises 9363 benign images and 1195 phishing samples covering 155 brands, sourced from Alexa and PhishTank. To adapt the dataset to other methods than those presented in the original paper, a set product was calculated of its trusted list with the target list of the Phishpedia dataset. 144 classes remained, comprising 8835 benign screenshots and 2107 phishing samples. Note that the benign count includes 78 images added via augmentation to ensure sufficient training samples per class. To address the class imbalance, 2537 synthetic screenshots were added to the phishing subset.
- The Phishpedia (PP) dataset: This dataset originates from the Phishpedia (PP) paper. The original dataset consisted of 29,496 phishing samples from OpenPhish and 29,951 benign websites from Alexa. To adapt the dataset to our needs, we created a set of products with targets that include logos and screenshots in the benign_sample_30k directory. We obtained 56 targets, with 14,183 phishing screenshots and 311 benign samples. To fulfill the minimum limit of 25 samples per class, we augmented the dataset with 317 phishing and 1231 benign samples.
- The CERT Polska dataset: To evaluate performance on real-world, region-specific threats, we collaborated with CERT Polska to curate a dataset of 15,049 unique URLs and screenshots collected between June 2023 and June 2024. This dataset is of particularly high quality; the phishing examples were verified by analysts, and we filtered out low-quality screenshots (e.g., loading screens or error pages). Unlike the other two, this dataset was not augmented, but includes only classes with at least 25 samples of benign and 25 samples of phishing screenshots. Overall, it covers 36 distinct impersonation targets.
4.3. Adaptation and Implementation of Detection Methods
- VisualPhishNet adaptation: The original implementation of VisualPhishNet was provided as a series of Jupyter Notebooks, which are unsuitable for automated pipelines. We refactored this code into executable Python scripts, adding command-line argument support and integrating the Weights & Biases platform for experiment tracking. This enables real-time monitoring of loss convergence and GPU resource utilization during the training of the triplet loss network. Training progress was monitored using Weights & Biases to ensure all models reached convergence and to trigger early stopping based on validation loss, preventing underfitting. The training phase was performed using implementation details provided by the VisualPhishNet’s authors, including training protocol and sizes of mini-batches.
- Threshold determination for VisualPhishNet: For the VisualPhishNet model, the classification decision relies on a distance-based metric: a suspicious website is classified as a phishing attempt if the distance between its embedding and the embedding of a protected target falls below a specific cutoff threshold. To ensure a faithful reproduction of the method, we adopted the threshold selection strategy described in the original VisualPhishNet paper [11], optimizing for the Equal Error Rate (EER) point where the difference between the False Positive Rate (FPR) and the False Negative Rate (FNR) is minimized. To locate this optimal point efficiently, the search was conducted in two stages: (i) coarse search—we analyzed potential threshold values ranging from 0 to the maximum distance observed in the validation set, incrementing with a step size of 10; (ii) fine-grained search—we refined the search within the range of the mean distance ± one standard deviation, utilizing a step size of 1. This process ensured that the binary classification boundary was tuned specifically to the data distribution of each dataset prior to final testing. The optimal threshold was computed independently for each of the three datasets using their respective validation partitions. The procedure involved calculating the distances between the embeddings of the validation samples and the training examples. The objective was to identify the Equal Error Rate (EER) point, defined as the threshold value where the difference between the False Positive Rate (FPR) and the False Negative Rate (FNR) is minimized. The optimal threshold for the CERT dataset was identified at 8.00, while for the VP dataset, it was significantly higher, determined at 50.00. Finally, for the PP dataset, the threshold was determined to be 3.00. We further analyzed the efficacy of these thresholds by examining the distribution of embedding distances between the phishing and legitimate validation sets during training, using the fitted Gaussian Probability Density Functions. The distributions are presented in the Figure 2.
- Phishpedia integration: Phishpedia combines object detection (Faster R-CNN) with a Siamese network for logo matching. The original codebase lacked training scripts and suffered from data inconsistencies, including missing mappings between protected brands and their corresponding domains. We extended the implementation to include data normalization scripts that repair these mappings and handle text encoding issues, ensuring the model could be evaluated on external datasets without runtime errors.
- Baseline method (perceptual hashing): To establish a performance baseline, we implemented a lightweight visual similarity method. This approach utilizes the Discrete Cosine Transform (DCT) to generate perceptual hashes (pHash) and real-valued vector hashes (pHashF) of the website screenshots. These vectors are indexed using the FAISS library [30] for efficient similarity search. During inference, the system calculates the Euclidean or Hamming distance to the nearest neighbor in the training set; if the distance falls below a learned threshold, the site is classified as phishing, inheriting the target label of its neighbor. The optimal threshold for this method was selected to maximize the F1 Macro score, prioritizing balanced performance across all classes.
5. Analysis Results
5.1. Performance of the Methods
5.1.1. VisualPhishNet
5.1.2. Phishpedia
5.1.3. Baseline
5.2. Comparison of Methods
- The Baseline method provides the lowest variation in multiclass classification results between datasets compared to the other two methods. It achieves the best results for binary classification and relatively good results in multiclass classification.
- The Phishpedia method provides the highest levels of Identification rate, which do not change significantly between datasets, unlike VisualPhishNet. The method also achieves moderate results in multiclass classification.
- The VisualPhishNet method exhibited significant variations in detection results across datasets for both binary and multiclass scenarios. Also, in many cases, the method achieves the worst results among all the methods.
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Haan, K. Top Website Statistics Today. 2024. Available online: https://www.forbes.com/advisor/business/software/website-statistics/ (accessed on 5 December 2025).
- ENISA—European Union Agency for Cybersecurity. ENISA Threat Landscape 2025; ENISA: Athens, Greece, 2025; Available online: https://www.enisa.europa.eu/sites/default/files/2025-12/ENISA%20Threat%20Landscape%202025_v1.1.pdf (accessed on 29 December 2025).
- Anti-Phishing Working Group (APWG). Phishing Activity Trends Report. 2nd Quarter 2025. 2025. Available online: https://docs.apwg.org/reports/apwg_trends_report_q2_2025.pdf (accessed on 20 November 2025).
- Khonji, M.; Iraqi, Y.; Jones, A. Phishing detection: A literature survey. IEEE Commun. Surv. Tutor. 2013, 15, 2091–2121. [Google Scholar] [CrossRef]
- Zieni, R.; Massari, L.; Calzarossa, M.C. Phishing or not phishing? A survey on the detection of phishing websites. IEEE Access 2023, 11, 18499–18519. [Google Scholar] [CrossRef]
- Almomani, A.; Gupta, B.B.; Atawneh, S.; Meulenberg, A.; Almomani, E. A survey of phishing email filtering techniques. IEEE Commun. Surv. Tutor. 2013, 15, 2070–2090. [Google Scholar] [CrossRef]
- Graziano, G.; Ucci, D.; Bisio, F.; Oneto, L. PhishVision: A Deep Learning Based Visual Brand Impersonation Detector for Identifying Phishing Attacks. In Optimization, Learning Algorithms and Applications, Proceedings of the Third International Conference, Ponta Delgada, Portugal, 27–29 September 2023; Pereira, A.I., Mendes, A., Fernandes, F.P., Pacheco, M.F., Coelho, J.P., Lima, J., Eds.; Springer: Cham, Switzerland, 2024; pp. 123–134. [Google Scholar]
- Mishra, R.; Varshney, G. A Study of Effectiveness of Brand Domain Identification Features for Phishing Detection in 2025. In Applied Cryptography and Network Security Workshops, Proceedings of the ACNS 2025 Satellite Workshops: AIHWS, AIoTS, QSHC, SCI, PrivCrypt, SPIQE, SiMLA, and CIMSS 2025, Munich, Germany, 23–26 June 2025; Manulis, M., Ed.; Springer: Cham, Switzerland, 2026; pp. 89–108. [Google Scholar]
- Bozkir, A.S.; Aydos, M. LogoSENSE: A companion HOG based logo detection scheme for phishing web page and E-mail brand recognition. Comput. Secur. 2020, 95, 101855. [Google Scholar] [CrossRef]
- Ren, K.; Qiang, W.; Wu, Y.; Zhou, Y.; Zou, D.; Jin, H. An Empirical Study on the Effects of Obfuscation on Static Machine Learning-Based Malicious JavaScript Detectors. In Proceedings of the ISSTA 2023: 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, Seattle WA USA, 17–21 July 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1420–1432. [Google Scholar] [CrossRef]
- Abdelnabi, S.; Krombholz, K.; Fritz, M. VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), Virtual, 9–13 November 2020. [Google Scholar]
- Liu, R.; Lin, Y.; Yang, X.; Ng, S.H.; Divakaran, D.M.; Dong, J.S. Inferring Phishing Intention via Webpage Appearance and Dynamics: A Deep Vision Based Approach. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 1633–1650. [Google Scholar]
- Lin, Y.; Liu, R.; Divakaran, D.M.; Ng, J.Y.; Chan, Q.Z.; Lu, Y.; Si, Y.; Zhang, F.; Dong, J.S. Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Virtual, 11–13 August 2021; USENIX Association: Berkeley, CA, USA, 2021; pp. 3793–3810. [Google Scholar]
- Fu, A.Y.; Wenyin, L.; Deng, X. Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD). IEEE Trans. Dependable Secur. Comput. 2006, 3, 301–311. [Google Scholar] [CrossRef]
- Afroz, S.; Greenstadt, R. PhishZoo: Detecting Phishing Websites by Looking at Them. In Proceedings of the 2011 IEEE Fifth International Conference on Semantic Computing, Palo Alto, CA, USA, 18–21 September 2011; pp. 368–375. [Google Scholar] [CrossRef]
- Seifert, C.; Stokes, J.W.; Colcernian, C.; Platt, J.C.; Lu, L. Robust scareware image detection. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 2920–2924. [Google Scholar] [CrossRef]
- Saheed, Y.K.; Kehinde, T.O.; Ayobami Raji, M.; Baba, U.A. Feature selection in intrusion detection systems: A new hybrid fusion of Bat algorithm and Residue Number System. J. Inf. Telecommun. 2024, 8, 189–207. [Google Scholar] [CrossRef]
- Chen, S.; Lu, Y.; Liu, D.J. Phishing Target Identification Based on Neural Networks Using Category Features and Images. Secur. Commun. Netw. 2022, 2022, 5653270. [Google Scholar] [CrossRef]
- Bhurtel, M.; Siwakoti, Y.R.; Rawat, D.B. Phishing Attack Detection with ML-Based Siamese Empowered ORB Logo Recognition and IP Mapper. In Proceedings of the IEEE INFOCOM 2022—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), New York, NY, USA, 2–5 May 2022; pp. 1–6. [Google Scholar] [CrossRef]
- van den Hout, T.; Wabeke, T.; Moura, G.C.M.; Hesselman, C. LogoMotive: Detecting Logos on Websites to Identify Online Scams—A TLD Case Study. In Passive and Active Measurement; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 3–29. [Google Scholar] [CrossRef]
- Zeng, V.; Zhou, X.; Baki, S.; Verma, R.M. PhishBench 2.0: A Versatile and Extendable Benchmarking Framework for Phishing. In Proceedings of the CCS ’20: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 9–13 November 2020; ACM: New York, NY, USA, 2020; pp. 2077–2079. [Google Scholar] [CrossRef]
- Hannousse, A.; Yahiouche, S. Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Eng. Appl. Artif. Intell. 2021, 104, 104347. [Google Scholar] [CrossRef]
- Dalton, T.; Gowda, H.; Rao, G.; Pargi, S.; Khodabakhshi, A.H.; Rombs, J.; Jou, S.; Marwah, M. PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark. arXiv 2025, arXiv:2507.10854. [Google Scholar] [CrossRef]
- Ji, F.; Lee, K.; Koo, H.; You, W.; Choo, E.; Kim, H.; Kim, D. Evaluating the effectiveness and robustness of visual similarity-based phishing detection models. In Proceedings of the SEC ’25: 34th USENIX Conference on Security Symposium, Seattle, WA, USA, 13–15 August 2025. [Google Scholar]
- Zauner, C. Implementation and Benchmarking of Perceptual Image Hash Functions. 2010. Available online: http://phash.org/docs/pubs/thesis_zauner.pdf (accessed on 15 December 2025).
- Ramírez, S. FastAPI. Available online: https://github.com/fastapi/fastapi (accessed on 15 December 2025).
- Colvin, S. Pydantic. Data Validation Using Python Type Hints, Version v2.11.7. 2025. Available online: https://docs.pydantic.dev/latest/ (accessed on 14 August 2025).
- Bayer, M. SQLAlchemy. In The Architecture of Open Source Applications Volume II: Structure, Scale, and a Few More Fearless Hacks; Brown, A., Wilson, G., Eds.; Lulu.com: Morrisville, NC, USA, 2012. [Google Scholar]
- TorchVision Maintainers and Contributors. TorchVision: PyTorch’s Computer Vision Library. 2016. Available online: https://github.com/pytorch/vision (accessed on 15 December 2025).
- Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss Library. arXiv 2025, arXiv:2401.08281. [Google Scholar] [CrossRef]





| Dataset | Training | Validation | Test | ∑ | No. of Targets | ||||
|---|---|---|---|---|---|---|---|---|---|
| Phish | Benign | Phish | Benign | Phish | Benign | Phish | Benign | ||
| CERT | 4326 | 4703 | 1442 | 1568 | 1442 | 1568 | 7210 | 7839 | 36 |
| PP | 8700 | 924 | 2900 | 309 | 2900 | 309 | 14,500 | 1542 | 56 |
| VP | 2786 | 5332 | 929 | 1751 | 929 | 1752 | 4644 | 8835 | 144 |
| Dataset | F1 | ROC AUC | MCC |
|---|---|---|---|
| CERT | 0.3577 | 0.4101 | −0.1812 |
| VP | 0.4954 | 0.5887 | 0.1693 |
| PP | 0.1673 | 0.1018 | −0.6160 |
| Dataset | F1 Micro | F1 Macro | MCC | Identification Rate |
|---|---|---|---|---|
| CERT | 0.3654 | 0.1481 | 0.0733 | 0.7093 |
| VP | 0.3924 | 0.0047 | 0.0694 | 0.0037 |
| PP | 0.1003 | 0.0334 | 0.0111 | 1.0000 |
| Dataset | F1 | ROC AUC | MCC |
|---|---|---|---|
| CERT | 0.1598 | 0.5304 | 0.1301 |
| VP | 0.4263 | 0.4544 | −0.0955 |
| PP | 0.9062 | 0.6679 | 0.2729 |
| Dataset | F1 Micro | F1 Macro | MCC | Identification Rate |
|---|---|---|---|---|
| CERT | 0.5482 | 0.2621 | 0.2013 | 0.9845 |
| VP | 0.3782 | 0.3073 | 0.2569 | 0.9270 |
| PP | 0.7691 | 0.2894 | 0.7384 | 0.9154 |
| Dataset | F1 | ROC AUC | MCC |
|---|---|---|---|
| CERT | 0.6229 | 0.5682 | 0.1449 |
| VP | 0.7673 | 0.8201 | 0.6294 |
| PP | 0.9539 | 0.7759 | 0.5391 |
| Dataset | F1 Micro | F1 Macro | MCC | Identification Rate |
|---|---|---|---|---|
| CERT | 0.5823 | 0.1670 | 0.3668 | 0.2554 |
| VP | 0.7009 | 0.4111 | 0.5089 | 0.5679 |
| PP | 0.5467 | 0.3591 | 0.4939 | 0.5687 |
| Method | F1 Micro | F1 Macro | MCC | Identification Rate |
|---|---|---|---|---|
| VisualPhishNet | 0.3654 | 0.1481 | 0.0733 | 0.7093 |
| Phishpedia | 0.5482 | 0.2621 | 0.2013 | 0.9845 |
| Baseline | 0.5823 | 0.1670 | 0.3668 | 0.2554 |
| Method | F1 Micro | F1 Macro | MCC | Identification Rate |
|---|---|---|---|---|
| VisualPhishNet | 0.3924 | 0.0047 | 0.0694 | 0.0037 |
| Phishpedia | 0.3782 | 0.3073 | 0.2569 | 0.9270 |
| Baseline | 0.7009 | 0.4111 | 0.5089 | 0.5679 |
| Method | F1 Micro | F1 Macro | MCC | Identification Rate |
|---|---|---|---|---|
| VisualPhishNet | 0.1003 | 0.0334 | 0.0111 | 1.0000 |
| Phishpedia | 0.7691 | 0.2894 | 0.7384 | 0.9154 |
| Baseline | 0.5467 | 0.3591 | 0.4939 | 0.5687 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jarczewski, M.; Białczak, P.; Mazurczyk, W. Phishing Website Impersonation: Comparative Analysis of Detection and Target Recognition Methods. Appl. Sci. 2026, 16, 640. https://doi.org/10.3390/app16020640
Jarczewski M, Białczak P, Mazurczyk W. Phishing Website Impersonation: Comparative Analysis of Detection and Target Recognition Methods. Applied Sciences. 2026; 16(2):640. https://doi.org/10.3390/app16020640
Chicago/Turabian StyleJarczewski, Marcin, Piotr Białczak, and Wojciech Mazurczyk. 2026. "Phishing Website Impersonation: Comparative Analysis of Detection and Target Recognition Methods" Applied Sciences 16, no. 2: 640. https://doi.org/10.3390/app16020640
APA StyleJarczewski, M., Białczak, P., & Mazurczyk, W. (2026). Phishing Website Impersonation: Comparative Analysis of Detection and Target Recognition Methods. Applied Sciences, 16(2), 640. https://doi.org/10.3390/app16020640

