PhishCluster: Real-Time, Density-Based Discovery of Malicious URL Campaigns from Semantic Embeddings

Dimitrios Karapiperis; Georgios Feretzakis; Sarandis Mitropoulos

doi:10.3390/info17010064

,

and

School of Science and Technology, Hellenic Open University, 263 31 Patras, Greece

^*

Author to whom correspondence should be addressed.

Information2026, 17(1), 64;https://doi.org/10.3390/info17010064

This article belongs to the Section Information and Communications Technology

Version Notes

Order Reprints

Abstract

The proliferation of algorithmically generated malicious URLs has overwhelmed traditional threat intelligence systems, necessitating a paradigm shift from reactive, single-instance analysis to proactive, automated campaign discovery. Existing systems excel at finding semantically similar URLs given a known malicious seed but fail to provide a real-time, macroscopic view of emerging and evolving attack campaigns from high-velocity data streams. This paper introduces PhishCluster, a novel framework designed to bridge this critical gap. PhishCluster implements a two-phase, online–offline architecture that synergistically combines large-scale Approximate Nearest Neighbor (ANN) search with advanced density-based clustering. The online phase employs an ANN-accelerated maintenance algorithm to process a stream of URL embeddings at unprecedented throughput, summarizing the data into compact, evolving Campaign Micro-Clusters (CMCs). The offline, on-demand phase then applies a hierarchical density-based algorithm to these CMCs, enabling the discovery of arbitrarily shaped, varying-density campaigns without prior knowledge of their number. Our comprehensive experimental evaluation on a synthetic billion-point dataset, designed to mimic real-world campaign dynamics, demonstrates that PhishCluster’s architecture resolves the fundamental trade-off between speed and quality in streaming data analysis. The results validate that PhishCluster achieves an order-of-magnitude improvement in processing throughput over state-of-the-art streaming clustering baselines while simultaneously attaining a superior clustering quality and campaign detection fidelity.

PhishCluster: Real-Time, Density-Based Discovery of Malicious URL Campaigns from Semantic Embeddings

Abstract

Article Metrics

Citations

Article Access Statistics