1. Introduction
Entity Resolution (ER) is the identification of co-referent entities across datasets. Different communities refer to it as instance matching, record linkage, and the merge-purge problem [
1,
2]. Scalability indicates a two-step solution [
1], as illustrated in
Figure 1. The first step, blocking, mitigates brute-force pairwise comparisons on all entities by clustering entities into blocks and then comparing pairs of entities only within blocks [
3]. For example, let us assume two knowledge graphs (KGs) describing customers (containing details, such as names, addresses and purchase histories) that must be linked. A blocking key, such as ‘Tokens(LastName)’, could first be applied to each node in the two KGs, as shown in the figure. In essence, this is a function that tokenizes the last name of each customer, and it assigns the customer to a block, indexed by the last-name token. According to the figure, this would lead to five overlapping blocks. One reason why blocks could overlap is that some customers may have multiple tokens in their last name e.g., Michael Ann-Moss.
Blocking results in the selection of a small subset of pairs, called the candidate set, which is input to a second step (called the ‘similarity’ step) to determine co-reference using a sophisticated similarity function. A good blocking key can lead to significant computational savings, as shown in the figure, while leading to minimal (or even no) loss of recall in the similarity step. While we exclusively address blocking in this work, we emphasize that it is only one part of a complete ER workflow. Pointers to the literature on the similarity step (that can often be applied independently of blocking) are provided in
Section 2. We also provide background on the overall ER process, to place this work in context, in
Section 3.
Blocking methods use a blocking scheme to assign entities to blocks. In some cases, blocking scheme and key are used interchangeably, but, typically, the former builds on the latter in a manner that we detail more formally in
Section 4. Over the last two decades, Disjunctive Normal Form (DNF) Blocking Scheme Learners (BSLs) were proposed to learn DNF blocking schemes using supervised [
4,
5], semi-supervised [
6], or unsupervised machine learning techniques [
7]. DNF-BSLs have emerged as state-of-the-art, because they operate in an expressive hypothesis space and have demonstrated excellent empirical performance on real-world data [
4]. Examples of DNF-BSLs will be provided in
Section 4. However, despite their advantages, current DNF-BSLs assume that input datasets are tabular and have the same schemas. The latter assumption is often denoted as structural homogeneity [
1]. Taking the earlier example of the customer domain, both of the datasets may rely on the same schema or ontology (also called a T-Box), which contains concepts such as Name, Address, Date, Credit Card Number, and so on, as well as properties such as
name_of,
lives_in,
identified_by, and so on.
The assumption of tabular (or in the more general case of KGs, ontological) structural homogeneity restricts application of DNF-BSLs to other data models. The recent growth of heterogeneous KGs and ecosystems, such as the Linked Open Data [
8] (in the Semantic Web community), motivates the development of a DNF-BSL for an arbitrary KG represented using the Resource Description Framework (RDF) data model. These graphs are often published by independent sources and are
structurally heterogeneous [
9]. Such KGs can make important contributions to multiple downstream search and recommendation applications, especially in e-commerce [
10,
11,
12], but also in non-commercial domains, like building better systems for managing COVID-19 information overload, as evidenced by the success of the Google Knowledge Graph and the ongoing development of efforts such as the Amazon Product Graph [
13,
14,
15]. However, without well-performing, efficient ER that works on structurally heterogeneous data, limited use can be made of publicly available KGs that are inexpensive to acquire and download, but have many redundancies and entity-overlap.
In this paper, we present a generic algorithmic pipeline for learning DNF blocking schemes on pairs of RDF KGs. The formalism of DNF blocking schemes relies on the existence of a schema. KGs, including RDF datasets published on the Web, may not have accompanying schemas [
8], or in cases involving multiple KGs, may have different, independently developed schemas. Our proposed approach builds a dynamic schema using the properties in the KG. The KG dataset can then be logically represented as a property table, which may be populated at run-time. Previously, property tables were defined as physical data structures that were used in the implementation of triplestores [
16]. By using a logical (rather than physical) property table representation, our approach admits the application of a DNF-BSL to RDF datasets, including KGs.
As a further special case, the pipeline also admits structurally heterogeneous tabular inputs. That is, the pipeline can be applied to tabular datasets with different schemas. Thus, previous DNF-BSLs become special cases of the pipeline, since they learn schemes on tables with the same schemas. As a second special case, the pipeline accommodates RDF-tabular heterogeneity, with one input, RDF, and the other, tabular. The utility of RDF-tabular heterogeneity is particularly apparent when linking datasets between Linked Open Data and the relational Deep Web [
8,
17]. The proposed method allows for us, at least in principle, to build efficient ER systems for doing so.
2. Related Work
Elmagarmid et al. comprehensively surveyed ER [
1], with a generic approach represented by
Swoosh [
19]. Swoosh comprises a family of ER algorithms, including G-Swoosh, R-Swoosh, and F-Swoosh, with different theoretical guarantees and levels of expressiveness. However, Swoosh does not directly address the blocking problem.
A fairly recent survey on ER was provided in [
20]. Within the ER community, blocking has separately witnessed much specific research, with Christen surveying numerous blocking methods [
3], including traditional blocking and Sorted Neighborhood. In the specific research area of learning blocking schemes, which is what this work also builds on, Bilenko et al. [
4] and Michelson and Knoblock [
5] independently proposed the first supervised DNF-BSLs in 2006. Since then, a semi-supervised adaptation of the BSL proposed by Bilenko et al. has been published [
4,
6], as well as an unsupervised system [
7]. The four systems assume structural homogeneity. We discuss their core principles in
Section 5.3, and subsequently detail how the proposed approach generalizes them.
Since the advent of large knowledge graphs (KGs) on the Web [
13,
21,
22], the problem of ER has taken on new urgency [
23,
24,
25,
26]. Recently, similarity techniques have become quite advanced, especially due to the rise of language representation models, such as BERT, GPT-3, and T5 [
27,
28,
29], as well as so-called knowledge graph embeddings [
30,
31,
32]. These models can be used to automatically ‘embed’ data items (including nodes, edges, words, or even sentences, depending on the model and input) into a continuous, low-dimensional, and real-valued vector space. Even using simple techniques, like logistic regression or cosine similarity (in the vector space), the vectors can be used to decide when two entities refer to the same underlying entity. Hence, the problem of feature engineering has largely been solved. However, because of the size of these KGs, blocking continues to be an important step, and research on blocking has lagged behind that of developing advanced similarity techniques. This paper is an attempt in that direction.
Heterogeneous blocking may be performed without learning a DNF scheme. One example is Locality Sensitive Hashing (LSH) [
33,
34], employed by the Harra system, for instance [
35]. LSH is an algorithmic method that hashes similar inputs into the same ‘buckets’ with high probability. The efficacy of the algorithm depends on the precise definition of ‘similarity’ applied, and how high the similarity should be. With these caveats in place, an LSH ‘bucket’ could be thought of as a block. While LSH is promising, it only applies to specific distance measures, such as Jaccard and cosine distance (although recently, a measure was also proposed for the edit distance [
36]). It is not clear how one would apply LSH to more complicated similarity functions, including machine learning classifiers.
Another good application of LSH for instance-based matching of large ontologies is [
37]. The Typifier system by Ma et al. is an example that relies on type inferencing and it was designed for Web data published without semantic types [
38]. In contrast, DNF-BSLs can be applied generally, with multiple studies showing strong empirical performance [
4,
5,
6,
7]. Other more recent blocking methods include methods, such as skyblocking [
39], on top of which some recent work has also been developed [
40]. In the skyblocking framework, each blocking scheme is mapped as a point to a multi-dimensional scheme space. The authors present efficient algorithms for learning such schemes. Finally, although related, clustering is technically treated separately from blocking in the literature [
3]. However, recent approaches involving micro-clustering may show more promise for applying clustering methodologies to ER [
41].
In the Semantic Web, ER is simultaneously known as instance matching and link discovery, and it has been surveyed by Ferraram et al. [
2]. Existing work in the Semantic Web tends to restrict inputs to RDF. Also, most techniques in the Semantic Web do not learn schemes, but instead present graph-based blocking methods, with a good example being Silk [
9]. Another recent method by Zhu et al. [
25] uses unsupervised methods on multi-type graphs. Unfortunately, this method suffers from high complexity, and would likely benefit from the blocking methods described in this paper. In our own prior work, we have separately presented blocking methods for RDF graphs and for tables [
7,
26], but this is the first work to attempt to combine both types of inputs in a unified framework and demonstrate viable empirical performance on a range of datasets. A more theoretical treatment on graph-theoretic blocking schemes can be found in [
42]. This article also significantly combines and builds on non-archival work by the author [
42,
43,
44,
45,
46].
The framework in this paper also relies on schema mapping. Schema mapping is an active research area, with a good survey provided by Bellahsene et al. [
47]. Gal notes that it is a difficult problem [
48]. Schema matchers may return 1:1 or n:m mappings (or even 1:n and n:1). An instance-based schema matcher relies on data instances to perform schema matching [
47]. A good example is Dumas [
18], which relies on an inexpensive
duplicates generator to perform unsupervised schema matching [
18]. We describe Dumas in
Section 6. An example of a related, and very recent, work that uses metadata (such as matching dependencies) to enhance the ER process is [
49]. In principle, this work is similar to ours. The work by Caruccio et al. [
50], while not directly about entity resolution, tackles the related problem of mining relaxed functional dependencies from data.
The property table representation used in this paper is a physically implemented data structure in the Jena triplestore API (
https://jena.apache.org/, accessed on 18 March 2021), [
16]. In this paper, it is used as a logical data structure. We note that the concept of logically representing one data model as another has precedent. In particular, literature abounds with proposed methods on how to integrate relational databases (RDB) with the Semantic Web. Sahoo et al. extensively surveyed this topic, called RDB2RDF [
51]. A use-case is the Ultrawrap architecture, which utilizes RDB2RDF to enable real-time Ontology-based Data Access or OBDA [
52]. We effectively tackle the inverse problem by translating an RDF graph to a logical property table. To our knowledge, this is the first application to devise such an inverse translation for heterogeneous ER.
4. Preliminaries and Formalism
We present blocking-specific definitions and examples to place the remainder of the work in context. Consider a pair of datasets and . Each dataset individually conforms to either the RDF or tabular data model. An RDF dataset may be visualized as a directed graph or equivalently, as a set of triples. A triple is a three-tuple of the form (subject, property, object). Note that a property is also called ‘predicate’ although we will continue to use ‘property’ for the purposes of uniformity. A tabular dataset conforms to a tabular schema, which is the table name followed by a set of fields. The dataset instance is a set of tuples, with each tuple comprising field values.
Example 1. Dataset 1 (in Figure 3) is a more simplified version of the running example introduced earlier in Figure 2. It is an RDF dataset visualized as a directed graph , and it can be equivalently represented as a set of triples. For example, (Mickey Beats, hasWife, Joan Beats) would be one such triple in the triples representation. Datasets 2 and 3 are tabular dataset examples, with the former having schema Emergency Contact(Name, Contact, Relation). The first tuple of Dataset 2 has field values Mickey Beats, Joan Beats, and Spouse, respectively. The keyword null is reserved. We will use the data in Figure 3 as the basis for our running examples in this section. According to the RDF specification (
http://www.w3.org/RDF/, accessed on 18 March 2021), subjects and properties must necessarily be Uniform Resource Identifiers (URIs), while an object node may either be a URI or a literal. URI elements in RDF files typically have associated names (or labels), obtained through de-reference. For ease of exposition, we henceforth refer to every URI element in an RDF file by its associated name. Additionally, note that in the most general case, RDF datasets do not have to conform to any schema. This is why they are commonly visualized as semi-structured datasets, and not as tables. In
Section 5.1, we show how to dynamically build a property schema and logically represent RDF as a tabular data structure.
An entity is defined as a semantically distinct subject node in an RDF dataset, or as a (semantically distinct) tuple in a tabular dataset. The entity referring to Mickey Beats is shown in red in all datasets in
Figure 3. In this context, ER is the process of resolving semantically equivalent (but possibly syntactically different) entities. As earlier described, the majority of ER research has traditionally assumed structural homogeneity, an example of which would be identifying that the two highlighted tuples in Dataset 3 are duplicates.
In the Semantic Web, ER is operationalized by connecting two equivalent entities with an owl:sameAs (
http://www.w3.org/TR/owl-ref/, accessed on 18 March 2021), property edge. For example, the two nodes referring to Mickey Beats in Dataset 1 should be connected while using an owl:sameAs edge. Easy operationalizing of ER (and more generally, ‘link specification’ [
9]) explains in part the ongoing interest in ER in the Semantic Web [
2]. In the relational setting, ER is traditionally operationalized through joins or mediated schemas. It is less evident how to operationalize ER across RDF-tabular inputs, such as linking Datasets 1 and 2. We return to this issue in
Section 5.1.
In order to introduce the current notion of DNF blocking schemes, tabular structural homogeneity is assumed for the remainder of this section. In later sections, we generalize the concepts as a core contribution of this paper.
The most basic elements of a blocking scheme are indexing functions
[
4]. An indexing function accepts a field value from a tuple as input and returns a set
Y that contains 0 or more blocking key values (BKVs). A BKV identifies a block in which the tuple is placed. Intuitively, one may think of a block as a hash bucket, except that blocking is one-many while hashing is typically many-one [
3]. For example, if
Y contains multiple BKVs, then a tuple is placed in multiple blocks.
Definition 1. An indexing function takes as input a field value from some tuple t and returns a set Y that contains 0 or more Blocking Key Values (BKVs) from the set of all possible BKVs .
The domain is usually just the string datatype. The range is a set of BKVs that the tuple is assigned to. Each BKV is represented by a string identifier.
Example 2. An example of an indexing function is Tokens. When applied to the Last Name field value of the fourth tuple in Dataset 3, the output set Y is .
This leads to the notion of a general blocking predicate (GBP). Intuitively, a GBP takes as input field values from two tuples, and , and uses the ith indexing function to obtain BKV sets and for the two arguments. The predicate is satisfied if and share elements, or equivalently, if and have a block in common.
Definition 2. A general blocking predicate takes as input field values and from two tuples, and , and returns True if , and returns False otherwise.
Each GBP is always associated with an indexing function.
Example 3. Consider the GBP ContainsCommonToken, associated with the previously introduced Tokens. Suppose that it was input the Last Name field values from the first and fourth tuples in Dataset 3. Because these field values have a token (Beats) in common, the GBP returns True.
A specific blocking predicate (SBP) explicitly pairs a GBP to a specific field.
Definition 3. A specific blocking predicate is a pair , where is a general blocking predicate and f is a field. A specific blocking predicate takes two tuples and as arguments and applies to the appropriate field values and from both tuples. A tuple pair is said to be covered if the specific blocking predicate returns True for that pair.
Previous DNF research assumed that all available GBPs can be applied to
all fields of the relation [
4,
5,
6,
7]. For this reason, they were neither obviously applicable to different-schema relational databases, nor to knowledge graphs. Hence, given a relation
R with
m fields in its schema, and
s GBPs, the number of SBPs is exactly
. Note that structural homogeneity implies exactly one input schema, even if there are multiple relational instances. Finally, a DNF blocking scheme is defined as:
Definition 4. A DNF blocking scheme is a positive propositional formula constructed in Disjunctive Normal Fo-rm or DNF (a disjunction of terms, where each term is a conjunction of literals), using a given set H of SBPs as the set of atoms. Additionally, if each term is constrained to comprise at most one atom, then the blocking scheme is referred to as disjunctive.
SBPs cannot be negated, since the DNF scheme is a positive formula. A tuple pair is said to be covered if the blocking scheme returns True for that pair. Intuitively, this means that the two constituent tuples share a block. In practice, both duplicate and non-duplicate tuple pairs can end up getting covered, since blocking is just a pre-processing step.
Example 4. Consider the disjunctive scheme (ContainsCommonToken, Last Name) ∨ (SameFirstDigit, Zip), applied on Dataset 3. While the two tuples referring to Mickey Beats would share a block (with the BKV Beats), the non-duplicate tuples referring to Susan and Samuel would also share a block (with the BKV 6). Additionally, note that the first and fourth tuples share more than one block, since they also have BKV 7 in common.
Given a blocking scheme, a blocking method would need to map tuples to blocks efficiently. According to the definition provided earlier, a blocking scheme takes a tuple pair as input. In practice, linear-time hash-based techniques are usually applied.
Example 5. To efficiently apply the blocking scheme in the previous example on each individual tuple, tokens from the field value corresponding to field Last Name are extracted, along with the first character from the field value of the Zip field, to obtain the tuple’s set of BKVs. For example, being applied to the first tuple of Dataset 3, the BKV set is extracted. An index is maintained, with the BKVs as keys and tuple pointers as values. With n tuples, traditional blocking computes the blocks in time [3]. Let the set of generated blocks be
.
contains sets of the form
, where
is the block that is referred to by the BKV
v. The candidate set of pairs
is given below:
is precisely the set input to the second step of ER, which classifies each pair as a duplicate, non-duplicate, or probable duplicate [
55]. Blocking should produce a small
but with high coverage and density of duplicates. Metrics quantifying these properties are defined in
Section 7.
Finally, schema mapping is utilized in the paper. The formal definition of a mapping is quite technical; the survey by Bellahsene et al. provides a full treatment [
47]. In this paper, an intuitive understanding of the mapping as a pair of field-sets suffices. For example, ({Name},{First Name, Last Name}) is a 1:n mapping between Datasets 2 and 3. More generally, mappings may be of cardinality n:m. The simplest case is a 1:1 mapping, with singleton components.
6. An Unsupervised Instantiation
A key question addressed in this work is whether the generic pipeline can be instantiated in an unsupervised fashion. As we showed earlier, existing DNF-BSLs that can be extended require some form of supervision. An unsupervised heterogeneous DNF-BSL is important because, in principle, it enables a fully unsupervised ER workflow in both the relational and Semantic Web communities. As the surveys by Elmagarmid et al. and Ferraram et al. note, unsupervised techniques for the second ER step do exist already [
1,
2]. A second motivation is the observation that existing unsupervised and semi-supervised homogeneous DNF-BSLs (Systems 3–4) require considerable parameter tuning. Parameter tuning is being increasingly cited as an important algorithmic issue, in applications ranging from schema matching [
57] to generic machine learning [
58]. Variety in Big Data implies that algorithm design cannot discount parameter tuning.
We propose an unsupervised instantiation with a new DNF-BSL that only requires two parameters. In
Table 1, only the supervised System 1 requires two parameters. The schematic of the unsupervised instantiation (of the generic pipeline in
Figure 4a) is shown in
Figure 4b. We use the existing schema matcher, Dumas, in the instantiated pipeline [
18]. Dumas outputs 1:1 field mappings by first using a duplicates generator to locate tuple pairs with high cosine similarity. In the second step, Dumas uses Soft-TFIDF to build a similarity matrix from each generated duplicate. If
n duplicates are input to the second step, then
n similarity matrices are built and then averaged into a single similarity matrix. The assignment problem is then solved by invoking the Hungarian Algorithm on this matrix [
59]. This results in exactly
1:1 field mappings (the set
Q) being output.
In addition to using
Q, we recycle the noisy duplicates of Dumas and then pipe them into Algorithm 1. Note that Dumas does not generate non-duplicates. We address this issue in a novel way, by permuting the generated duplicates set
D. Suppose that
D contains
n tuple pairs
, with each
, respectively, from datasets
. By randomly permuting the pairs in
D, we heuristically obtain non-duplicate pairs of the form
,
. Note that (at most)
distinct permutations are possible. For balanced supervision, we set
, with
N the permutation-generated set.
Algorithm 1 Learn Extended k-DNF Blocking Scheme. |
Input: Set D of duplicate tuple pairs, Set Q of mappings Parameters: Beam search parameter k, SC-threshold Output: Extended DNF Blocking Scheme Method://Step 0: Construct sets N and H 1. Permute pairs in D to obtain N, 2. Construct set H of simple extended SBPs using set G of GBPs and Q 3. Supplement set H to get set using k //Step 1: Build Multimaps and 4. Construct , X is a tuple pair in D, contains the elements in covering X 5. Repeat previous step to build for tuple pairs in N 6. Reverse and to respectively get and //Step 2: Run approximation algorithm for alldo 8. Score X by using formula 9. Remove X if end for 11. Perform W-SC on keys in using Chvatal’s heuristic, weights are negative scores //Step 3: Construct and output DNF blocking scheme 12. Disjunction of chosen keys 13. Output
|
Empirically, the permutation is expected to yield a precise
N because of observed duplicates sparsity in ER datasets [
3,
7]. This sparsity is also a key tenet underlying the blocking procedure itself. If the datasets were dense in duplicates, blocking would not yield any savings.
Algorithm 1 shows the pseudocode of the extended DNF BSL. Inputs to the algorithm are the piped Dumas outputs,
D and
Q. To learn a blocking scheme from these inputs, two parameters,
k and
, beed to be specified. Similar to (extended) Systems 1–3 in
Table 1,
G,
Q, and
k are used to construct the search space,
. Note that
G is considered the algorithm’s feature space, and it is not a dataset-dependent input (or parameter). Designing an expressive
G has computational and qualitative effects, as we empirically demonstrate. We describe the GBPs that are included in
G in
Section 7.
Step 0 in Algorithm 1 is the permutation step just described to generate the non-duplicates set
N.
G and
Q are then used to construct the set
H of simple extended (because Dumas only outputs 1:1 mappings) SBPs, with
.
H is supplemented (using parameter
k) to yield
, as earlier described in
Section 5.3.
Step 1 constructs multimaps (multimap keys reference multiple values, or a value set) on which Set Covering (SC) is eventually run. As a first logical step, multimaps
and
are constructed. Each tuple pair (TP) in
D is a key in
, with the SBPs and terms in
covering that TP comprising the value set.
is then reversed to yield
.
is built analogously.
Figure 7 demonstrates the procedure, assuming that
D contains TPs 1-5, covered as shown in
Figure 6. The time complexity of building (both)
and
is
.
In Step 2, each key is first scored by calculating the difference between the fractions of covered duplicates and non-duplicates. A threshold parameter, , is used to remove the SBPs and terms that have low scores. Intuitively, tries to balance the conflicting needs of previously described parameters, and , and reduce tuning effort. The range of is [−1, 1]. An advantage of the parameter is that it has an intuitive interpretation. A value that is close to would indicate that the user is confident about low noise-levels in inputs D and Q, since high implies the existence of elements in that cover many positives and few negatives. Because many keys in are removed by high , this also leads to computational savings. However, setting too high (perhaps because of misguided user confidence) could potentially lead to excessive purging of , and subsequent algorithm failure. Experimentally, we show that is easily tunable and even high values of are robust to noisy inputs.
Weighted Set Covering (W-SC) is then performed using Chvatal’s algorithm (we include Chvatal’s algorithm in the
Appendix A.3 survey) [
56], with each key in
acting as a set and the tuple pairs covered by all keys as elements of the universe set
. For example, assuming that all SBPs and terms in the keyset of
in
Figure 7 have scores above
,
. Note that
only is pruned (using
) and, also, W-SC is performed only on
.
only aids in the score calculation (and subsequent pruning process) and may be safely purged from memory before W-SC commences.
W-SC needs to find a subset of the keyset that covers all of and with minimum total weight. For this reason, the weight of each set is the negative of its calculated score. Given that sets chosen by W-SC actually represent SBPs or terms, their disjunction is the k-DNF blocking scheme.
Under plausible (essentially, assuming that
) complexity assumptions, Chvatal’s algorithm is essentially the best-known polynomial-time approximation for W-SC [
60]. For example, Bilenko et al. used Peleg’s approximation to Red-Blue SC [
61,
62], which is known to have worse bounds [
62]. The proposed DNF-BSL has the strongest theoretical approximation guarantee of all systems in
Table 1.
9. Discussion
Earlier, when discussing the preliminary experimental results evaluating Dumas (
Section 7.2), we noted that an extended DNF-BSL can only integrate well into the pipeline if it is robust to noise from previous steps. Previous research has noted the overall robustness of DNF-BSLs. This led to the recent emergence of a homogeneous unsupervised system [
7], which was adapted here as a semi-supervised baseline. Experiment 1 results showed that this robustness also carries over to extended DNF-BSLs. High overall performance shows that the pipeline can accommodate heterogeneity, a key goal of this paper.
Experiment 2 results demonstrate the advantage of having an expressive
G, which is evidently more viable than increasing
k. On DPs 1 and 5 (that the systems succeeded on), no statistically significant differences were observed, despite the run-time increasing by a factor of 16. We note that the largest (homogeneous) test cases on which
schemes were previously evaluated were only about the order of DP 1 (in size). Even with less expressive
G, only a few percentage point performance differences were observed (in PC and RR), with statistical significance not reported [
4,
7].
In order to confirm the role of
G, we performed a follow-up experiment where we used the originally proposed
G [
4] on DPs 1 and 5, with both
and
. We observed lower performance with
compared to
Table 4 results, while
results were only at par with those results. The run-times with less expressive
G were obviously lower (for corresponding
k); however,
run-times were higher (with less expressive
G) than
run-times with more expressive
G. All of the differences just described were statistically significant (at the 95% level). This validates previous research findings, while also confirming our stated hypothesis regarding
G.
The Experiment 3 results showed that a sophisticated schema matcher is not always necessary for the purpose of learning a blocking scheme. However, the importance of good schema matching goes beyond blocking and even ER. Schema matching is an important step in overall data integration [
47]. On noisier datasets, a good n:m schema matcher could make all the difference in pipeline performance, but we leave it for future work to evaluate such a case.
The similar run-time trends that were shown by the various systems in
Figure 9a also explain why, in Experiment 2, all the systems simultaneously succeeded or failed on a given DP. Even if we replace our DNF-BSL with an extended version from the literature, the exponential dependence on
k remains.
Figure 9a,b also empirically validate theoretical run-time calculations. Previous research on DNF-BSLs did not theoretically analyze (or empirically report) the algorithmic run-times and scalability explicitly [
4,
5,
6,
7].
Figure 9c demonstrates the encouraging qualitative result that only a few (noisy) samples are typically enough for adequate performance. Given enterprise quality requirements, as well as expense of domain expertise, high performance for low
n and minimum parameter tuning is a practical necessity, for industrial deployment. Recall that we retained
at 0.9 for all experiments (after tuning on DP 1), while for the baselines, we had to conduct parameter sweeps for each separate experiment. When combined with the results in both
Table 4 and
Figure 9c, this shows that the system can be a potential use-case in industry. Combined with previous unsupervised results for the second ER step [
1,
2], such a use-case would apply to both relational and Semantic Web data as a fully unsupervised ER workflow, which has thus far remained elusive.