Abstract
The proliferation of algorithmically generated malicious URLs has overwhelmed traditional threat intelligence systems, necessitating a paradigm shift from reactive, single-instance analysis to proactive, automated campaign discovery. Existing systems excel at finding semantically similar URLs given a known malicious seed but fail to provide a real-time, macroscopic view of emerging and evolving attack campaigns from high-velocity data streams. This paper introduces PhishCluster, a novel framework designed to bridge this critical gap. PhishCluster implements a two-phase, online–offline architecture that synergistically combines large-scale Approximate Nearest Neighbor (ANN) search with advanced density-based clustering. The online phase employs an ANN-accelerated maintenance algorithm to process a stream of URL embeddings at unprecedented throughput, summarizing the data into compact, evolving Campaign Micro-Clusters (CMCs). The offline, on-demand phase then applies a hierarchical density-based algorithm to these CMCs, enabling the discovery of arbitrarily shaped, varying-density campaigns without prior knowledge of their number. Our comprehensive experimental evaluation on a synthetic billion-point dataset, designed to mimic real-world campaign dynamics, demonstrates that PhishCluster’s architecture resolves the fundamental trade-off between speed and quality in streaming data analysis. The results validate that PhishCluster achieves an order-of-magnitude improvement in processing throughput over state-of-the-art streaming clustering baselines while simultaneously attaining a superior clustering quality and campaign detection fidelity.