Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques

Karapiperis, Dimitrios; Verykios, Vassilios S.

doi:10.3390/electronics14183605

Open AccessFeature PaperReview

Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques

by

Dimitrios Karapiperis

^1,*

and

Vassilios S. Verykios

²

¹

School of Science and Technology, International Hellenic University, 57001 Thermi, Greece

²

School of Science and Technology, Hellenic Open University, 26335 Patras, Greece

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3605; https://doi.org/10.3390/electronics14183605

Submission received: 18 August 2025 / Revised: 2 September 2025 / Accepted: 4 September 2025 / Published: 11 September 2025

(This article belongs to the Special Issue Advanced Research in Technology and Information Systems, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Entity resolution (ER) is a fundamental data integration process hindered by its quadratic computational complexity, making naive comparisons infeasible for large datasets. Blocking (or partitioning) is the foundational strategy to overcome this, traditionally using methods like K-Means clustering to group similar records. However, with the rise of deep learning and high-dimensional vector embeddings, the ER task has evolved into a vector similarity search problem. This review traces the evolution of K-Means from a direct, standalone blocking algorithm into a core partitioning engine within modern Approximate Nearest Neighbor (ANN) indexes. We analyze how its role has been adapted and optimized in partition-based systems like the Inverted File (IVF) system and Google’s SCANN, which are now central to scalable, embedding-based ER. By examining the architectural principles and trade-offs of these methods and contrasting them with non-partitioning alternatives like HNSW, this paper provides a coherent narrative on the journey of K-Means from a simple clustering tool to a critical component for scaling modern ER workflows.

Keywords:

entity resolution; blocking; partitioning; K-means; approximate nearest neighbor; ANN; SCANN; HNSW

1. Introduction

Entity resolution (ER) is the critical process of identifying and linking records across different datasets that refer to the same real-world entity [1,2]. Its application is a cornerstone of modern data systems, enabling tasks such as consolidating product catalogs in e-commerce, creating unified patient records in healthcare, and detecting sophisticated fraud in finance [3]. The core challenge of ER is its computational complexity. A naive approach, which compares every record against every other, scales quadratically (

O (n^{2})

). For even a modest dataset of one million records, this would require nearly 500 billion comparisons, rendering the method unusable for large-scale applications [4,5]. To surmount this scalability barrier, the foundational strategy blocking (or partitioning) is used. This technique divides the dataset into smaller, manageable subsets (blocks), restricting the expensive comparisons to only the record pairs within the same block [6,7,8]. A key distinction in modern ER is the nature of the dataset itself. A static dataset is fixed and unchanging, characteristic of academic benchmarks and offline batch processing. In contrast, a dynamic dataset is common in production environments and is subject to frequent insertions, deletions, and updates. Furthermore, real-world data often exhibits skewed distributions, where both the data itself (data skew, e.g., some product categories are much larger than others) and user ss patterns (query skew, e.g., new items are searched for more frequently) are highly non-uniform. These dynamic and skewed characteristics pose significant challenges for traditional partitioning techniques, motivating the need for more advanced and adaptive indexing systems.

While blocking addresses the issue of scale, the effectiveness of the process hinges on the ability to group similar entities correctly. Traditional methods often relied on brittle, lexical techniques like matching on a shared key or string similarity, which struggle with synonyms, typographical errors, and variations in language. The recent advent of deep learning has catalyzed a significant paradigm shift. By using models like BERT, entities can now be transformed into rich, high-dimensional numerical vectors known as embeddings [9,10]. The key advantage of this approach is that embeddings capture semantic similarity—understanding that “laptop” and “notebook computer” refer to the same concept—thus making the matching process far more robust. This innovation reframes the blocking task as a high-dimensional vector similarity search problem. Since an exact search is still too slow, the field has overwhelmingly adopted Approximate Nearest Neighbor (ANN) indexes, which trade a small, often negligible, amount of accuracy (recall) for massive improvements in speed [11,12]. These indexes often rely on a K-Means-based partitioning scheme to implement a scalable “divide and conquer” search strategy. The rapid proliferation of language models has made selecting an appropriate one for a given ER task a non-trivial challenge, motivating comprehensive experimental analyses of their relative performance in blocking and matching contexts [13]. An efficient ANN system for Maximum Inner Product Search (MIPS)—a common goal in this context—must both reduce the number of candidate vectors to score and accelerate the scoring process itself [14]. The first objective is typically met with space partitioning (e.g., via trees or hashing), while the second is achieved through quantization [15]. Locality-Sensitive Hashing (LSH) provides a crucial conceptual bridge from traditional blocking to modern ANN techniques. LSH uses hash functions designed so that similar items are more likely to collide into the same hash bucket, effectively creating blocks [16]. Modern variants like NLSHBlock extend this by using deep neural networks to learn optimal hash functions directly from the data [17,18]. A recent approach, LSBlock, proposes a hybrid system that combines minhash-based lexical blocking with a semantic refinement step using dense embeddings to improve recall while maintaining high precision [19]. Despite their power, a critical challenge remains: the majority of existing ANN methods were designed and benchmarked assuming static data. This assumption breaks down in real-world production environments, which are characterized by dynamic and skewed workloads [20,21,22]. Data distributions constantly evolve as new information is added and old data is deleted, a phenomenon known as concept drift. Furthermore, user access patterns are rarely uniform; queries often concentrate on a small but shifting set of popular items, from trending products to breaking news stories [23,24,25]. This mismatch between static design and dynamic reality leads to significant performance degradation, motivating the development of a new class of adaptive indexing systems [26]. Alongside this, the batch-processing nature of traditional ER is ill-suited for applications with tight time or computational constraints, necessitating a move towards progressive approaches that can deliver results incrementally [27]. This paper reviews the evolution of partition-based blocking techniques, using the K-Means algorithm as a foundational thread to connect its origins as a direct clustering method to its modern role as a core component in sophisticated ANN indexing systems. Our main contributions are as follows:

Historical Context: We trace the evolution of K-Means from its early use as a straightforward clustering tool for blocking to its modern role as a core partitioning engine in state-of-the-art ANN indexes like the Inverted File (IVF) system.
Architectural Analysis: We dissect the architecture of leading partition-based ANN systems, focusing on the K-Means-driven design of Google’s SCANN and contrasting it with the alternative graph-based paradigm exemplified by HNSW.
Survey of the Cutting Edge: We analyze the limitations of static indexes in dynamic environments and provide a survey of the emerging field of adaptive indexing, which represents the next frontier in scalable vector search.
A Framework for Practitioners: By synthesizing the literature, we clarify the trade-offs between different approaches and provide a structured framework, including a decision flowchart, to help researchers and practitioners select and optimize a partitioning strategy suited to their specific ER task.

2. Background and Related Work

2.1. The ER Workflow: From Blocking to Matching

A standard ER workflow is a multi-stage pipeline that typically includes data preprocessing, blocking, comparison (matching), and clustering [28,29]. The blocking stage generates candidate pairs of records that are then assessed by a more sophisticated and computationally intensive matching function to make the final match/non-match decision [30,31]. This two-phase process creates a fundamental trade-off: effective blocking must capture as many true matches as possible (high Recall), while remaining efficient by minimizing unnecessary comparisons, an objective measured by metrics such as Precision and Reduction Ratio.

2.2. A Brief Taxonomy of Blocking and Matching Methods

The extensive body of research on ER can be broadly categorized. A primary distinction is between non-learned and learned methods. Non-learned, or traditional, methods rely on manually defined rules. Standard Blocking uses a single attribute as a “blocking key,” which is simple but brittle [3]. To improve recall, redundancy-positive methods like Token Blocking create a block for every word in an attribute, increasing the chance of a match at the cost of generating many superfluous comparisons [32]. The poor precision of high-recall methods led to Meta-Blocking, a sub-field dedicated to refining the output of a coarse blocking phase by pruning the candidate graph [8,33,34]. Learned methods aim to automate the ER process. This began with traditional machine learning models that required extensive manual feature engineering [3]. The major breakthrough came with the application of deep learning, which can learn features automatically. Early works like DeepER used RNNs with pre-trained GloVe embeddings to learn tuple representations [10]. This was followed by a design space exploration in DeepMatcher, which systematically evaluated different neural architectures (RNNs, Attention, Hybrids) for matching structured, textual, and dirty data [30]. The current state of the art is dominated by systems that fine-tune large, pre-trained language models (LMs) like BERT. DITTO demonstrated that casting ER as a sequence-pair classification problem and fine-tuning LMs could significantly outperform previous methods, especially with optimizations like data augmentation and domain knowledge injection [31]. Further research has explored contrastive self-supervised learning frameworks like Sudowoodo to learn similarity-aware representations without labels, making these powerful models applicable to a wider range of data integration tasks [35]. Other works have moved beyond pairwise comparisons; for example, GNEM frames ER as a one-to-set problem, using a graph neural network to collectively evaluate a candidate record against a set of other records [36]. Learned methods have also been developed for the blocking stage. AutoBlock provides a hands-off framework that learns similarity-preserving vector representations from a small set of positive labels, which are then used with LSH to find candidate pairs [37]. More recently, SC-Block combines supervised contrastive learning with nearest neighbor search to create highly precise candidate sets, leading to significant speedups in the overall ER pipeline [38].

2.3. Foundational Partitioning with K-Means

The direct application of K-Means to vectorized entity records is an intuitive approach to learned blocking [6,39]. The algorithm partitions the data into k disjoint, spherical clusters, with each cluster serving as a block [40]. However, this method has several inherent limitations for the ER task: it requires pre-specifying the number of clusters k, is sensitive to initialization, and its “hard” partitions can easily separate true matches that lie near a boundary [1,39]. An alternative, Canopy Clustering, was designed to produce overlapping clusters using two distance thresholds, making it inherently more suitable for the high-recall needs of ER [7,41].

3. The Evolution and Adaptation of K-Means for ER

The limitations of standard K-Means catalyzed its evolution from a monolithic solution to a valuable component in larger pipelines. A critical first step was addressing its scalability. The k-Means|| algorithm overcomes the sequential bottleneck of k-means++ initialization by parallelizing it, making K-Means viable for large-scale data [42]. More advanced applications integrated K-Means into hybrid workflows like the KLSH (K-Means LSH) model, where it serves as a refinement tool to break down larger, pre-identified groups into tighter blocks [43]. The most profound conceptual shift is exemplified by the kMkNN (k-Means for k-Nearest Neighbors) algorithm [44]. Here, K-Means is used not for blocking, but to build an index that accelerates an exact k-NN search. The K-Means partitions are used with the triangle inequality to prune the search space. This use of K-Means to create an intermediate data structure for accelerating a search is a direct architectural precursor to the partitioning stage found in modern ANN systems.

4. Vector Quantization for Efficient ANN Search

As ANN search became central to blocking, Vector Quantization (VQ) emerged as a core technology for managing the high memory footprint and computational cost of dense vectors. VQ is a data compression technique that maps a large set of vectors to a smaller, finite “codebook” of representative vectors, reducing memory and accelerating distance calculations [45,46].

4.1. The Sub-Optimality of Traditional Quantization

Most traditional VQ techniques aim to minimize the reconstruction error of the database points—that is, the Euclidean distance between an original vector and its compressed representation [47]. However, this objective is suboptimal for MIPS [48]. The key insight is that for any given query, the quantization error for database points that have a high inner product (i.e., are more relevant) is far more important than the error for irrelevant points with low inner products. Minimizing a generic reconstruction error treats all points equally, failing to prioritize the ones that actually matter for the search result.

4.2. Product Quantization (PQ)

A particularly influential VQ method is Product Quantization (PQ) [45]. Instead of quantizing a high-dimensional vector directly, PQ splits it into multiple, lower-dimensional sub-vectors. A separate, small codebook is then learned for each of these sub-spaces, typically using K-Means. The original vector is represented by a short code composed of the concatenation of the IDs of the nearest codebook entry for each sub-vector. This enables extremely fast, approximate distance calculations using a method called Asymmetric Distance Computation (ADC), which relies on pre-computed lookup tables and avoids expensive floating-point operations.

5. A Foundational Partitioned Index: The Inverted File System (IVF)

The Inverted File (IVF) system, notably implemented in Facebook AI’s Faiss library [49], stands as a canonical example of a partition-based ANN index. Its scalability is built on a “divide and conquer” strategy, for which K-Means is the enabling partitioning algorithm.

The core mechanism is a two-stage search process that avoids a full scan of the dataset. First, during the indexing phase, K-Means is used as a “coarse quantizer” to partition the entire vector space into k Voronoi cells, each represented by a centroid [49]. Each vector from the dataset is then assigned to the cell of its nearest centroid. The resulting index is an “inverted file” where each centroid ID maps to a list of the vectors it contains [49]. Second, during the query phase, a search is executed in two steps:

Coarse Search: The query vector is compared only against the k centroids to find the most promising partitions. Since k is orders of magnitude smaller than the total number of vectors, this step is extremely fast.
Fine-Grained Search: The system identifies the ‘nprobe’ nearest centroids, where ‘nprobe’ is a tunable parameter. An exhaustive search is then performed *only* on the vectors contained in the inverted lists of these few partitions [49].

This two-stage approach dramatically reduces the number of distance calculations, directly addressing the scalability challenge of searching in massive datasets. To further search within these selected lists, the vectors are often compressed using techniques like Product Quantization (PQ) [45]. The performance of an IVF index is governed by the trade-off between its key parameters: the number of partitions (k) and the number of partitions to probe (‘nprobe’). A larger k creates more, smaller partitions, which can speed up the final exhaustive search. However, if ‘nprobe’ is too small, the true nearest neighbors might be missed if they fall into an un-probed partition, thus lowering recall. Increasing ‘nprobe’ improves recall at the direct cost of higher query latency. Tuning these parameters allows practitioners to balance the trade-off between speed, accuracy, and memory usage for their specific application.

6. State-of-the-Art Partition-Based Blocking: Google’s SCANN

Google’s SCANN (Scalable Nearest Neighbors) is a state-of-the-art library that represents the culmination of the partition-based approach to ANN search, designed to overcome the limitations of traditional quantization. Its architecture is a highly optimized, three-stage pipeline: Partitioning with a K-Means-like algorithm, Vector Quantization within partitions, and Re-ranking of top candidates. Its landmark contribution is a score-aware quantization loss function, which leads to a new technique called anisotropic vector quantization. To understand this concept, one can use an analogy. Imagine a vector as an arrow pointing from the origin. A standard (isotropic) quantizer tries to find a compressed representation that is simply close in distance, creating a spherical “circle of error” around the arrow’s tip. SCANN’s anisotropic approach is different; it recognizes that for inner product search, the vector’s length and direction are paramount. Its quantization method creates an error shape more like a narrow ellipse: it allows for more error “sideways” (perpendicular to the vector’s direction) but is extremely strict about errors “forwards or backwards” (parallel to the vector). This novel loss function better preserves the ranking of top results, which is the ultimate goal of the search. Instead of minimizing reconstruction error, this new loss function is aware of the final MIPS objective. It achieves this by more heavily penalizing the component of the quantization error that is parallel to the datapoint’s direction, as this component has a larger effect on the inner product score. This allows SCANN to preserve the ranking of high-scoring items much more accurately than traditional VQ, leading to state-of-the-art performance, particularly in the high-recall regime.

7. An Alternative Paradigm: Graph-Based Indexing with HNSW

While this review focuses on the evolution of partition-based methods rooted in K-Means, it is instructive to contrast this paradigm with an alternative, graph-based approach to understand the broader architectural trade-offs in ANN search. The leading example is HNSW (Hierarchical Navigable Small World) [50]. Instead of partitioning the data space, HNSW organizes vectors into a multi-layered proximity graph where nodes are vectors and edges represent nearness. The groundbreaking innovation of HNSW is its hierarchy. The graph consists of multiple layers, from a sparse top layer with long-range links for coarse navigation, down to a dense bottom layer containing every vector with short-range links for fine-grained exploration. A search proceeds by “greedily” traversing the graph in a coarse-to-fine manner, achieving logarithmic scaling of search time. While highly effective in static settings, its performance can suffer under dynamic workloads due to the high computational cost of updating the graph structure [25]. The high cost of updates stems from the complexity of maintaining the graph’s structural properties. Inserting a single new vector requires traversing the graph to find its neighbors at each layer and then rewiring numerous pointers, a non-local operation that can cause significant lock contention in concurrent environments. Deletions are even more problematic, as they can fragment the graph and degrade search performance over time if not handled properly through periodic rebuilding. Recent research aims to mitigate these issues through techniques like asynchronous graph rebuilding, tiered indexing that places recent updates in a separate dynamic structure, and new algorithms for more efficient incremental graph construction [51].

8. Comparative Analysis of Indexing Paradigms

To provide a structured summary of the methods discussed, Table 1 contrasts the key architectural and performance characteristics of the foundational IVF, the advanced partition-based SCANN, and the graph-based HNSW. This comparison highlights the fundamental trade-offs a practitioner must consider when selecting an indexing strategy for an entity resolution task.

9. Shortcomings of Existing Approaches and the Next Frontier in Adaptive Indexing

9.1. Limitations of Graph-Based Indexes

Graph-based indexes, such as HNSW [50], DiskANN [24], and SVS, achieve high recall with low latency in static settings by efficiently traversing a proximity graph. However, maintaining this graph structure under frequent updates is computationally intensive. Each insertion or deletion may require rewiring multiple edges to preserve the graph’s connectivity and proximity properties, leading to very high update latency [25]. This makes them less suitable for environments with high data churn.

9.2. Limitations of Partitioned Indexes

Partitioned indexes, such as Faiss-IVF, SCANN, and SpFresh [25], are more update-friendly as insertions and deletions often translate to more sequential access patterns. However, they face their own set of challenges. Under skewed write patterns, some partitions become significantly larger than others. If these large partitions are also frequently accessed due to read skew (“hot partitions”), query latency degrades significantly. Moreover, most partitioned indexes use a fixed number of partitions to scan (‘nprobe’). As the index structure changes due to updates, a static ‘nprobe’ is no longer optimal, leading to a drop in recall or excessive data scanning.

9.3. Limitations of Early Termination Methods

To address the static ‘nprobe’ problem, several early-termination methods have been proposed to dynamically adjust the number of partitions scanned per query. SPANN [22] prunes partitions once their centroid distance exceeds a user-tuned threshold. LAET [52] uses a trained model to predict the required ‘nprobe’, but requires dataset-specific training for each recall target. Auncel [53] uses a geometric model to estimate recall, but its conservative estimation can lead to overshooting the recall target. A common weakness of these methods is that they require manual tuning or calibration and, crucially, assume a static index structure, making them brittle in dynamic environments.

9.4. The Next Frontier: Adaptive Indexing with Quake

The shortcomings of existing methods highlight the need for a new class of adaptive indexes designed explicitly for dynamic, skewed workloads. A state-of-the-art example is Quake, a system that extends the partitioned index paradigm with mechanisms for continuous adaptation [26]. It represents a class of systems that aim for self-management, alongside other recent advances that focus on learning optimal index structures on the fly [54]. Quake addresses the key technical challenges by integrating three core innovations. First, it uses a cost-model-guided maintenance scheme to continuously monitor partition sizes and access frequencies, triggering maintenance actions like splitting hot partitions or merging cold ones to minimize a query latency cost function. Second, its Adaptive Partition Scanning (APS) dynamically determines the number of partitions to scan for each query to meet a recall target, using a geometric model that adapts on-the-fly to changes in the index structure. Third, it employs NUMA-aware parallelism to maximize memory bandwidth and close the performance gap with graph-based indexes. By integrating these mechanisms, Quake represents a path forward for building robust vector search systems that can maintain high performance in real-world, dynamic environments.

10. Illustrative Performance Benchmark

The following experiments provide an illustrative benchmark of the ANN methods discussed. The goal is not to present a rigorous, novel experimental study, but to offer a practical context for the performance trade-offs (Recall vs. Queries per Second (QPS)) between the foundational IVF, the advanced partition-based SCANN, and the alternative graph-based HNSW paradigm on common ER datasets.

To contextualize the performance of modern ANN techniques, we present experiments on two widely-used benchmark datasets for ER: Scholar-DBLP (65 K and 12.5 K entities) and the larger DBLP-Synthetic (3 million entities per dataset). We compare three representative ANN implementations: IVF, HNSW, and SCANN. The primary evaluation metric is the trade-off between Recall (accuracy) and QPS (speed).

As illustrated in Figure 1 and Figure 2, both HNSW and SCANN consistently define the Pareto frontier for performance. HNSW typically excels at achieving the highest recall levels with very high speed, although this can come at the cost of higher memory usage and longer index-building times. SCANN proves to be highly competitive, demonstrating that a sophisticated “divide and conquer” approach can achieve state-of-the-art results, often with lower memory overhead. The IVF index, while outperformed in terms of peak QPS at high recall, offers a different and still valuable trade-off: it is exceptionally scalable, memory-efficient, and conceptually simple, making it a robust and practical choice for extremely large datasets where resource constraints are a primary concern.

11. A Practitioner’s Framework for Method Selection

Selecting the right partitioning indexing strategy depends on a clear understanding of the application’s constraints and goals. The flowchart in Figure 3 provides a practical decision-making framework to guide practitioners in selecting an appropriate indexing strategy. The initial and most critical decision point hinges on the nature of the dataset. If the dataset is dynamic, characterized by frequent updates, the framework directly recommends employing a modern adaptive indexing system. For static datasets, the selection process then depends on the specific performance priority. When the primary objective is maximizing throughput (QPS), a highly optimized system like SCANN is the ideal choice. Finally, for applications at a massive scale where memory efficiency is the main concern, the foundational IVF method remains a robust and practical option.

12. Conclusions and Future Directions

This review has traced the evolution of partition-based blocking from the foundational application of K-Means to the sophisticated and adaptive ANN architectures of today. This analysis clarifies how a classic clustering algorithm for ER has been repurposed and refined to become an indispensable component for scaling modern, embedding-based entity resolution. The core concept of partitioning, pioneered by K-Means, endures as a vital scalability mechanism even in the most advanced systems. The field of blocking continues to evolve rapidly. The limitations of static indexes in dynamic environments have made adaptive indexing, as exemplified by systems like Quake, the new frontier [26]. Future work will likely focus on more-sophisticated, self-tuning systems that can automatically adjust their structure and search strategies in response to evolving data and query patterns, removing the need for manual parameter setting entirely. The development of learned cost models, query-aware maintenance policies, and deeper integration with heterogeneous hardware are all promising directions.

12.1. Opportunities in Hardware Acceleration

The performance of ANN indexing is intrinsically linked to the underlying hardware. Future advances will likely leverage specialized hardware to overcome computational bottlenecks. GPUs and TPUs, with their massive parallelism, are well-suited for accelerating the brute-force distance calculations required in both K-Means clustering and the final search phase of partition-based methods. Furthermore, emerging paradigms like processing-in-memory (PIM) aim to reduce the costly data movement between CPU and memory by performing computations directly where the index is stored. Such hardware integration could fundamentally alter the performance trade-offs between different indexing strategies.

12.2. Ethical Considerations and Bias

As ER systems increasingly rely on learned embeddings and models, it is crucial to consider the ethical implications. Learned models can inherit and amplify societal biases present in their training data. In sensitive domains such as healthcare, finance, or law enforcement, a biased ER system could lead to discriminatory outcomes, for instance, by incorrectly linking individuals from certain demographic groups more often than others. Future work must prioritize the development of techniques for fairness audits, bias detection in embeddings, and building transparent, interpretable ER systems to ensure they are deployed responsibly.

The choice of a blocking strategy remains a critical architectural decision influenced by the specific demands of the ER task. For educational purposes, K-Means is a valuable baseline. For massive, static datasets where throughput is key, SCANN is an exceptional choice. For applications requiring the highest recall and low latency, HNSW is often preferred. However, for real-world, dynamic workloads, practitioners should look towards the next generation of adaptive systems to ensure robust and stable performance over time.

Author Contributions

Conceptualization and writing, D.K. and V.S.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

There is no data available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Papadakis, G.; Tsekouras, L.; Thanos, E.; Giannopoulos, G.; Koubarakis, M.; Palpanas, T. Blocking and filtering techniques for entity resolution: A survey. ACM Comput. Surv. 2020, 53, 1–38. [Google Scholar]
Elmagarmid, A.K.; Ipeirotis, P.G.; Verykios, V.S. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 2007, 19, 1–16. [Google Scholar] [CrossRef]
Christen, P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Hernandez, M.A.; Stolfo, S.J. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, USA, 22–25 May 1995; pp. 127–138. [Google Scholar]
Papadakis, G.; Koutras, N.; Koubarakis, M.; Palpanas, T. A comparative analysis of blocking methods for entity resolution. Inf. Syst. 2016, 57, 1–22. [Google Scholar]
McCallum, A.; Nigam, K.; Ungar, L.H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–23 August 2000; pp. 169–178. [Google Scholar]
Papadakis, G.; Ioannou, E.; Palpanas, T.; Niederee, C.; Nejdl, W. Beyond keyword search: Discovering entities in the web of data. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 583–592. [Google Scholar]
Papadakis, G.; Koubarakis, M.; Palpanas, T. A survey of blocking and filtering methods for entity resolution. arXiv 2013, arXiv:1305.1581. [Google Scholar]
Thirumuruganathan, S.; Galhotra, V.; Mussmann, S.; Gummadi, A.; Das, G. DeepBlocker: A deep learning based blocker for entity resolution. In Proceedings of the 2021 International Conference on Management of Data, Virtual Event, China, 20–25 June 2021; pp. 1639–1652. [Google Scholar]
Ebraheem, M.; Thirumuruganathan, S.; Joty, S.; Ouzzani, M.; Tang, N. Distributed representations of tuples for entity resolution. PVLDB 2018, 11, 1454–1467. [Google Scholar]
Indyk, P.; Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Dallas, TX, USA, 24–26 May 1998; pp. 604–613. [Google Scholar]
Wang, J.; Shen, H.T.; Song, J.; Ji, J. Hashing for similarity search: A survey. arXiv 2014, arXiv:1408.2927. [Google Scholar] [CrossRef]
Zeakis, A.; Papadakis, G.; Skoutas, D.; Koubarakis, M. Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. PVLDB 2023, 16, 2225–2238. [Google Scholar] [CrossRef]
Guo, R.; Kumar, S.; Choromanski, K.; Simcha, D. Quantization based fast inner product search. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 482–490. [Google Scholar]
Charikar, M.S. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, Montréal, QC, Canada, 19–21 May 2002; pp. 380–388. [Google Scholar]
Karapiperis, D.; Verykios, V.S. An LSH-based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage. Trans. Knowl. Data Eng. 2015, 27, 909–921. [Google Scholar] [CrossRef]
Shen, J.; Li, P.; Wang, Y.; Wang, Y.; Zhang, C. Neural-LSH for deep learning based blocking. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 1923–1936. [Google Scholar]
Shen, J.; Li, P.; Wang, Y.; Wang, Y.; Zhang, C. Neural-LSH for deep learning based blocking: Extended version. arXiv 2023, arXiv:2303.04543. [Google Scholar]
Karapiperis, D.; Tjortjis, C.; Verykios, V.S. LSBlock: A Hybrid Blocking System Combining Lexical and Semantic Similarity Search for Record Linkage. In Proceedings of the 29th European Conference on Advances in Databases and Information Systems, Advances in Databases and Information Systems (ADBIS), Tampere, Finland, 23–26 September 2025. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning; PmLR: New York, NY, USA, 2021. [Google Scholar]
Waleffe, R.; Mohoney, J.; Rekatsinas, T.; Venkataraman, S. Mariusgnn: Resource-efficient out-of-core training of graph neural networks. In Proceedings of the ACM SIGOPS European Conference on Computer Systems (EuroSys), Rome, Italy, 8–12 May 2023. [Google Scholar]
Chen, Q.; Zhao, B.; Wang, H.; Li, M.; Liu, C.; Li, Z.; Yang, M.; Wang, J. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search. arXiv 2020, arXiv:2111.08566. [Google Scholar]
Baranchuk, D.; Douze, M.; Upadhyay, Y.; Yalniz, I.Z. DeDrift: Robust Similarity Search under Content Drift. arXiv 2023, arXiv:2308.02752. [Google Scholar] [CrossRef]
Subramanya, S.J.; Devvrit; Kadekodi, R.; Krishaswamy, R.; Simhadri, H.V. DiskANN: Fast Accurate Billion-Point Nearest Neighbor Search on a Single Node; Curran Associates Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
Xu, Y.; Liang, H.; Li, J.; Xu, S.; Chen, Q.; Zhang, Q.; Li, C.; Yang, Z.; Yang, F.; Yang, Y.; et al. SPFresh: Incremental In-Place Update for Billion-Scale Vector Search. In Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, Germany, 23–26 October 2023; pp. 545–561. [Google Scholar]
Mohoney, J.; Sarda, D.; Tang, M.; Chowdhury, S.R.; Pacaci, A.; Ilyas, I.F.; Rekatsinas, T.; Venkataraman, S. Quake: Adaptive Indexing for Vector Search. arXiv 2025, arXiv:2506.03437. [Google Scholar] [CrossRef]
Maciejewski, J.; Nikoletos, K.; Papadakis, G.; Velegrakis, Y. Progressive entity matching: A design space exploration. Proc. ACM Manag. Data 2025, 3, 65. [Google Scholar] [CrossRef]
Papadakis, G.; Skoutas, D.; Palpanas, T.; Koubarakis, M. A survey of entity resolution in the web of data. In The Semantic Web: Semantics and Big Data; Springer: Berlin/Heidelberg, Germany, 2013; pp. 15–30. [Google Scholar]
Whang, S.E.; Benjelloun, O.; Garcia-Molina, H. Generic entity resolution with negative rules. VLDB J. 2009, 18, 1261–1277. [Google Scholar] [CrossRef]
Mudgal, S.; Li, H.; Rekatsinas, T.; Doan, A.; Park, Y.; Krishnan, G.; Deep, R.; Arcaute, E.; Raghavendra, V. Deep learning for entity matching: A design space exploration. In Proceedings of the SIGMOD/PODS ’18: International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; pp. 19–34. [Google Scholar]
Li, Y.; Li, J.; Suhara, Y.; Doan, A.; Tan, W. Deep entity matching with pre-trained language models. PVLDB 2021, 14, 50–60. [Google Scholar] [CrossRef]
Papadakis, G.; Ioannou, E.; Palpanas, T.; Niederee, C.; Nejdl, W. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 2011, 25, 1665–1682. [Google Scholar] [CrossRef]
Efthymiou, V.; Papadakis, G.; Stefanidis, K. MinoanER: A representative-based approach to progressive entity resolution. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT), Venice, Italy, 21–24 March 2017; pp. 25–36. [Google Scholar]
Skoutas, D.; Alexiou, G.; Papadakis, G.; Thanos, E.; Koubarakis, M. Lightweight and effective meta-blocking for entity resolution. Inf. Syst. 2022, 107, 101899. [Google Scholar]
Wang, R.; Li, Y.; Wang, J. Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. arXiv 2022, arXiv:2207.04122. [Google Scholar]
Chen, R.; Shen, Y.; Zhang, D. GNEM: A Generic One-to-Set Neural Entity Matching Framework. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1686–1694. [Google Scholar]
Zhang, W.; Wei, H.; Sisman, B.; Dong, X.L.; Faloutsos, C.; Page, D. AutoBlock: A hands-off blocking framework for entity matching. In Proceedings of the WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 744–752. [Google Scholar]
Brinkmann, A.; Shraga, R.; Bizer, C. SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines. arXiv 2023, arXiv:2303.03132. [Google Scholar] [CrossRef]
Cohen, W.W.; Ravikumar, P.; Fienberg, S.E. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IIJ-03: Proceedings of the IJCAI-03 Workshop on Information Integration on the Web, Acapulco, Mexico, 9–10 August 2003; pp. 73–78. [Google Scholar]
Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. 1999, 31, 264–323. [Google Scholar] [CrossRef]
Koudas, N.; Sarawagi, S.; Srivastava, D. Record linkage: Similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, 27–29 June 2006; pp. 802–803. [Google Scholar]
Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable k-means++. In Proceedings of the VLDB Endowment, Istanbul, Turkey, 27–31 August 2012; Volume 5, pp. 622–633. [Google Scholar]
Gimenez, P.; Soru, T.; Marx, E.; Ngomo, A.C.N. Entity resolution with language models: A survey. arXiv 2023, arXiv:2305.10687. [Google Scholar]
Wang, X. A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining, Vancouver, BC, Canada, 11–14 December 2011; pp. 794–803. [Google Scholar]
Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 117–128. [Google Scholar] [CrossRef]
Lloyd, S. Least squares quantization in pcm. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Gong, Y.; Lazebnik, S.; Gordo, A.; Perronnin, F. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2916–2929. [Google Scholar] [CrossRef]
Babenko, A.; Lempitsky, V. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 931–938. [Google Scholar]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with gpus. arXiv 2017, arXiv:1702.08734. [Google Scholar] [CrossRef]
Malkov, Y.A.; Yashunin, D.A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 824–836. [Google Scholar] [CrossRef] [PubMed]
Xiao, W.; Zhan, Y.; Xi, R.; Hou, M.; Liao, J. Enhancing HNSW index for real-time updates: Addressing unreachable points and performance degradation. arXiv 2024, arXiv:2407.07871. [Google Scholar] [CrossRef]
Li, C.; Zhang, N.; Andersen D., G.; He, Y. Improving approximate nearest neighbor search through learned adaptive early termination. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 2539–2554. [Google Scholar]
Zhang, Z.; Jin, C.; Tang, L.; Liu, X.; Jin, X. Fast, approximate vector queries on very large unstructured datasets. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–19 April 2023; pp. 995–1011. [Google Scholar]
Li, M.; Wang, Z.; Liu, C. A Self-Learning Framework for Partition Management in Dynamic Vector Search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Berlin, Germany, 22–27 June 2025. [Google Scholar]

Figure 1. Performance of IVF, HNSW, and SCANN on the Scholar-DBLP dataset. The graph shows that for any given recall target, HNSW and SCANN offer significantly higher throughput (QPS) than the baseline IVF method. HNSW excels at the highest recall levels, while SCANN provides competitive performance across the board.

Figure 2. Performance on the much larger DBLP-Synthetic dataset. The results reinforce the trend seen in the first experiment, with HNSW and SCANN again defining the Pareto frontier. The performance gap between these state-of-the-art methods and the IVF baseline is even more pronounced at this larger scale, highlighting the importance of advanced indexing techniques for high-volume ER tasks.

Figure 3. Decision flowchart for selecting an ANN indexing strategy.

Table 1. Comparative analysis of ANN indexing methods.

Characteristic	IVF (Faiss)	SCANN	HNSW
Paradigm	Partition-Based	Partition-Based	Graph-Based
Index Mechanism	K-Means partitions (Voronoi cells) with an inverted file structure.	K-Means-like partitioning combined with score-aware anisotropic vector quantization.	Multi-layered proximity graph (Navigable Small World).
Memory Cost	Low to Moderate. Highly tunable based on codebook size (PQ).	Moderate. Generally lower than HNSW for similar performance points.	High. The entire graph structure, including all edges, must be stored in memory.
Update Handling	Efficient. Inserting/deleting a vector only requires updating a specific partition’s list.	Efficient. Similar to IVF, updates are localized to partitions.	Inefficient/Costly. Updates may require expensive rewiring of graph edges across multiple layers.
Ideal ER Scenario	Very large-scale datasets where memory efficiency and reasonable throughput are critical. A strong, scalable baseline.	High-throughput applications on large, static datasets where maximizing the number of queries resolved is the primary goal.	Applications requiring the absolute highest recall at low latency, on static or infrequently updated datasets.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Karapiperis, D.; Verykios, V.S. Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques. Electronics 2025, 14, 3605. https://doi.org/10.3390/electronics14183605

AMA Style

Karapiperis D, Verykios VS. Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques. Electronics. 2025; 14(18):3605. https://doi.org/10.3390/electronics14183605

Chicago/Turabian Style

Karapiperis, Dimitrios, and Vassilios S. Verykios. 2025. "Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques" Electronics 14, no. 18: 3605. https://doi.org/10.3390/electronics14183605

APA Style

Karapiperis, D., & Verykios, V. S. (2025). Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques. Electronics, 14(18), 3605. https://doi.org/10.3390/electronics14183605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scaling Entity Resolution with K-Means: A Review of Partitioning Techniques

Abstract

1. Introduction

2. Background and Related Work

2.1. The ER Workflow: From Blocking to Matching

2.2. A Brief Taxonomy of Blocking and Matching Methods

2.3. Foundational Partitioning with K-Means

3. The Evolution and Adaptation of K-Means for ER

4. Vector Quantization for Efficient ANN Search

4.1. The Sub-Optimality of Traditional Quantization

4.2. Product Quantization (PQ)

5. A Foundational Partitioned Index: The Inverted File System (IVF)

6. State-of-the-Art Partition-Based Blocking: Google’s SCANN

7. An Alternative Paradigm: Graph-Based Indexing with HNSW

8. Comparative Analysis of Indexing Paradigms

9. Shortcomings of Existing Approaches and the Next Frontier in Adaptive Indexing

9.1. Limitations of Graph-Based Indexes

9.2. Limitations of Partitioned Indexes

9.3. Limitations of Early Termination Methods

9.4. The Next Frontier: Adaptive Indexing with Quake

10. Illustrative Performance Benchmark

11. A Practitioner’s Framework for Method Selection

12. Conclusions and Future Directions

12.1. Opportunities in Hardware Acceleration

12.2. Ethical Considerations and Bias

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI