Next Article in Journal
Systemic Data Bias in Real-World AI Systems: Technical Failures, Legal Gaps, and the Limits of the EU AI Act
Previous Article in Journal
T-HumorAGSA: A Gated Anchor-Guided Self-Attention Model for Classroom Teacher Humor Language Detection
Previous Article in Special Issue
Building Bio-Ontology Graphs from Data Using Logic and NLP
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication

by
Jorge Galán-Mena
1,
Martín López-Nores
1,*,
Daniel Pulla-Sánchez
2,
Luis Fernando Guerrero-Vásquez
2 and
Juan Pablo Salgado-Guerrero
3
1
atlanTTic Research Center for Telecommunication Technologies, Department of Telematics Engineering, Universidade de Vigo, 36310 Vigo, Spain
2
Artificial Intelligence and Assistive Technologies Research Group (GI-IATa), UNESCO Chair on Support Technologies for Educational Inclusion, Universidad Politécnica Salesiana, Cuenca 170143, Ecuador
3
Economics and Business Management Faculty, Pontificia Universidad Católica del Ecuador, Quito 170143, Ecuador
*
Author to whom correspondence should be addressed.
Information 2026, 17(4), 325; https://doi.org/10.3390/info17040325
Submission received: 19 February 2026 / Revised: 20 March 2026 / Accepted: 23 March 2026 / Published: 26 March 2026
(This article belongs to the Special Issue Knowledge Graph Technology and Its Applications, 3rd Edition)

Abstract

Scholarly knowledge graphs integrate bibliographic records from heterogeneous sources and therefore require controlled, auditable deduplication. This paper presents OntoDup, an ontology-driven approach that models entity matching as a governed decision process: Matching outcomes are recorded as reified assertions enriched with governance state, evidence, provenance and operational metadata, while a separate operational view is exposed through policy-driven materialization of consumable identity links. We evaluate OntoDup on the DBLP-ACM and DBLP-Scholar benchmarks under two regimes: (i) a pre-blocked setting using the benchmark candidate lists to compare matching methods under a fixed candidate set, and (ii) an end-to-end setting that generates candidates from the graph with DeepBlocker and applies governed triage and materialization. We report operational precision/recall/ F 1 computed directly on the graph via SPARQL aggregations, characterize governance workload through state distributions, and quantify inference cost for LLM-based matchers via token and latency metadata attached to assertions. For end-to-end evaluation, we anchor operational links against a full positive reference encoded as idealized validations derived from the benchmark labels, enabling analysis of missed positives in terms of governance status and materialization policy. The experiments show that OntoDup enables evaluation at the level of consumable identity links, review workload, and inference cost, revealing operational trade-offs that are not visible from pairwise matching metrics alone.

1. Introduction

Scholarly communication is increasingly mediated by data infrastructures that aggregate heterogeneous research metadata into large, queryable knowledge graphs (KGs). These graphs support discovery and analytics across institutions and domains, enabling applications such as research profiling [1], topic discovery and monitoring [2], citation recommendation [3] and collaboration or co-author recommendations [4]. As scholarly KGs become more integrated and entity-centric [5,6], upstream data-quality problems—especially identity resolution and duplicate records—have a greater impact on downstream measurements and inference [7,8,9]. Near-duplicate publication records ingested from multiple sources can distort bibliometric indicators, inflate collaboration networks and bias analyses that assume entity uniqueness [6,10]. This makes duplicate record detection (DRD) and identity resolution foundational capabilities for trustworthy scholarly KGs.
State-of-the-art record linkage and entity matching techniques span rule-based similarity pipelines, probabilistic models, supervised learning, deep learning and, more recently, large language models (LLMs) [11,12,13,14,15,16,17]. These methods can provide strong pairwise matching performance, but they often provide limited support for the operational requirements of scholarly KG curation: explicit and queryable evidence, provenance and versioning of assertions, conflict management when multiple agents disagree, and lifecycle control under incremental ingestion. Scholarly KG infrastructures such as VIVO show the value of domain ontologies and shared semantics for integration [18,19], but identity resolution in this setting requires more than a shared schema. It also requires a governance layer that can represent who asserted what, based on which evidence, under which policy, and how decisions evolve over time, in line with concerns about proper inference and accountable curation in KGs [20,21,22].
To address this gap, we present OntoDup, an ontology-driven and provenance-aware approach for DRD in scholarly metadata. OntoDup models duplicate-candidate generation, match assertions, evidence artifacts and decision outcomes as first-class, queryable KG objects, and introduces governance-oriented semantics for conflict detection, policy-driven resolution and traceable lifecycle management. OntoDup does not propose a new standalone pairwise matcher intended to outperform specialized matching models in isolation. Its methodological contribution lies instead in formalizing entity deduplication in scholarly knowledge graphs as a governed decision process in which candidate generation, matching, evidence capture, provenance, conflict management and operational materialization are integrated under explicit semantic control. OntoDup is therefore best understood as a governance-aware methodology for operationalizing entity matching in scholarly KGs, rather than as a replacement for existing blocking or comparison models such as DeepBlocker (https://github.com/qcri/DeepBlocker (accessed on 26 February 2026)) [13], Ditto (https://github.com/megagonlabs/ditto (accessed on 26 February 2026)) [23] or LLM-based comparators. Our contributions are as follows:
  • A VIVO-aligned ontology extension for DRD governance, representing match assertions, evidence and decisions with explicit provenance and versioning.
  • A reproducible, incremental pipeline that ingests heterogeneous bibliographic sources into RDF and supports continuous deduplication under evolving data.
  • A mixed-initiative matching protocol that structures LLM and human assessments as auditable assertions and links them to evidence artifacts.
  • A governance layer for conflicts and closure, enabling policy-aware detection of inconsistent assertions and supporting controlled materialization of identity decisions.
The remainder of the paper is organized as follows. Section 2 reviews related work. Section 3 defines the problem and datasets. Section 4 introduces the OntoDup governance model, and Section 5 details the pipeline. Section 6 reports the experimental results, followed by discussion in Section 7 and conclusions in Section 8.

2. Related Work

Deduplication and entity matching in scholarly domains have been approached from classical rule- and similarity-based record linkage to modern deep learning and large language models. In parallel, scholarly knowledge graphs have evolved into dynamic infrastructures where correct identity resolution is critical for reliable analytics and inference. This section reviews the state of the art with an emphasis not only on matching effectiveness but also on how deduplication decisions are represented, audited and governed over time.

2.1. Duplicate Records and Noise in Scholarly Metadata

Duplicate and ambiguous records arise from heterogeneous metadata practices, incomplete fields, OCR artifacts and inconsistent editorial conventions. In scholarly settings, these issues interact with author name ambiguity and affiliation variability, leading to noisy and fragmented entity representations. Incremental author name disambiguation and identity resolution methods emphasize the need to update clusters as new evidence arrives, rather than assuming static snapshots [24,25]. More broadly, noise and duplication are known to affect downstream analytics in complex ways: Even modest disambiguation errors can reshape observed network topology and distort derived measures [10]. These observations motivate approaches that explicitly model uncertainty, revision and provenance in the deduplication process.

2.2. Candidate Generation and Matching Paradigms

Entity matching pipelines typically separate candidate generation (blocking) from pairwise decision making to remain computationally feasible at scale. Prior work highlights the design space of blocking strategies and the trade-off between recall and efficiency, especially under heterogeneous schemas [13]. Learning-based matchers have shown strong performance, but they often require task-specific labels and may generalize poorly across datasets [12]. Recently, LLM-based matchers have gained attention due to their flexibility and reduced reliance on extensive labeled data; however, evidence indicates that cross-dataset robustness and cost-quality trade-offs remain open challenges [14,15,16]. In this context, a KG-centered approach that captures evidence and decisions as first-class objects can complement model-centric methods by supporting traceability, adjudication and continuous refinement.

2.3. Scholarly Knowledge Graph Infrastructures

A rich line of work has developed infrastructures and ontologies for representing research entities and their relationships. VIVO is a prominent example of an ontology-driven platform for integrating and exposing scholarly profiles, publications and organizations [18,19]. Large-scale academic graphs such as MAKG demonstrate the value of integrated KGs for supporting discovery and analytics, while also illustrating the sensitivity of downstream tasks to identity and deduplication quality [8,9]. Domain applications built on scholarly KGs include semantic exploration of research topics [2], knowledge graph construction and utilization efforts [6] and survey-level mappings of the broader scholarly KG ecosystem [5]. While these infrastructures provide the semantic backbone for integration, governance mechanisms for duplicate detection and conflict-aware identity resolution are less standardized.

2.4. Relational and Graph-Based Record Linkage

Beyond traditional relational linkage, recent work studies record linkage and entity resolution directly over knowledge graphs, where structural context may provide complementary signals to textual similarity. KG-based and graph-based alignment methods are especially attractive when entity neighborhoods, relation patterns, or graph constraints carry informative signals that are not visible from local attributes alone [26]. Their main advantage is therefore the ability to exploit graph structure in addition to field-level similarity, which can be valuable in richly connected and semantically constrained settings.
At the same time, their effectiveness depends on the availability and quality of graph context. In publication-level scholarly deduplication, matching often still relies heavily on noisy textual metadata and source-specific descriptive attributes, especially when local record structure is richer than the surrounding graph neighborhood. In such cases, graph-based alignment is best seen as complementary to attribute- and text-centered matching rather than as a full replacement. Temporal and dynamic perspectives further complicate identity resolution: Entities and their contextual neighborhoods evolve, and linkage signals may drift. Methods for temporal reasoning and completion in dynamic KGs [27,28,29], together with work on proper inference under uncertainty and constraints [20], suggest that identity resolution should be treated as a lifecycle problem rather than a one-shot classification task.

2.5. Provenance, Evidence and Explainability

Operational matching pipelines often require explanations that can be audited by curators and stakeholders. Provenance models provide the technical substrate to represent how a decision was produced, by whom and from which inputs. PROV-O supplies a standardized vocabulary for provenance in the Semantic Web [21], while lightweight vocabularies such as PAV support authoring and versioning metadata at a practical level [22]. In entity matching, evidence can range from field-level agreements (titles, venues, years) to contextual signals (co-authors, citations, affiliations). LLM-based matching intensifies the need for explicit evidence representation because decisions may be produced from unstructured reasoning that must be externalized into verifiable signals [14,15]. OntoDup aligns with this trend by treating evidence and assertions as queryable KG resources rather than opaque classifier outputs.

2.6. Governance and Lifecycle Management for Identity Resolution

Governance-oriented perspectives emphasize that identity resolution is not only a technical problem but also a decision process involving policies, conflicts and controlled materialization of consequences (e.g., merging nodes, propagating identifiers). Work on proper inference highlights the importance of aligning automated reasoning with intended constraints and interpretability requirements in KGs [20]. Probabilistic record linkage frameworks stress uncertainty modeling and principled combination of evidence [11]. However, many existing DRD/ER pipelines still under-specify how assertions are tracked, how conflicts are detected and how resolution decisions evolve under incremental updates. OntoDup addresses this gap by combining (i) an ontology-level governance model for assertions and evidence, (ii) provenance-aware lifecycle management, and (iii) policy-driven conflict handling to support continuous, auditable identity resolution in scholarly KGs.

2.7. Gap Addressed by OntoDup

Prior work on entity matching has mainly focused on improving candidate generation and pairwise matching accuracy. In contrast, provenance and knowledge graph governance research has emphasized traceability, statement annotation and semantic control. What remains underdeveloped is the connection between these two strands: how matching outputs should be represented, reviewed and operationalized once they enter a scholarly knowledge graph. OntoDup addresses this gap by treating deduplication not only as a prediction task, but as a governed decision process. Rather than directly materializing identity links from matcher outputs, it records them as auditable assertions with status, evidence, provenance and operational metadata and separates this governance layer from the operational graph. In this sense, OntoDup complements both attribute-based and graph-based matching approaches: such methods may provide candidate pairs or matching signals, whereas OntoDup governs how resulting identity decisions are represented, reviewed and operationalized in scholarly knowledge graphs.

3. Problem Definition and Datasets

In this section, we formalize the entity matching (EM) task addressed in this work and describe the bibliographic benchmarks used in our experiments. We also detail the training/validation/test split adopted to ensure comparability with prior literature and to enable the selection of thresholds and decision policies without bias from using the test set.

3.1. Problem Definition

OntoDup is intended for scholarly knowledge graph integration settings in which records describing the same publication may be harvested from heterogeneous bibliographic sources. Given two such sources, L and R, the framework assumes that their records have a minimum set of comparable textual attributes, including title, authors, publication year and venue.
For a candidate pair ( l , r ) L × R , we aim to estimate a function f ( l , r ) { 0 , 1 } , where 1 indicates that both records represent the same publication and 0 indicates otherwise [12,23]. Since the Cartesian product | L | × | R | is computationally infeasible in realistic settings, we operate on a reduced candidate set C L × R , obtained through a candidate generation (blocking) mechanism [30,31] and evaluate matching performance against a set of labeled pairs Y (ground truth) provided by the benchmarks.
The matching decision is not interpreted solely as a binary prediction, but as a governable assertion within a knowledge graph. Each prediction is materialized as a reified assertion (ontodup:MatchAssertion) with confidence attributes, governance status and evidence. This design is aligned with established provenance vocabularies and lightweight provenance patterns, enabling reproducible auditing and traceability [21,22].

3.2. Bibliographic Datasets

In this study, the empirical validation relies on two publication-level bibliographic benchmarks:
  • DBLP-ACM (D-A), a paired collection containing bibliographic records from DBLP and ACM [32].
  • DBLP-Scholar (D-S), a paired collection aligning DBLP records with Google Scholar records, characterized by higher textual variability and noisier metadata [32].
Both datasets represent a challenging scenario in which spelling variations, incomplete metadata and duplicates coexist. This setting allows us to analyze how bibliographic noise conditions decision policies and motivates the need for traceability in identity resolution.

3.3. Experimental Partitions and Class Balance

To ensure comparability with standard configurations reported in the literature, we follow the standard training/validation/test partition over a set of precomputed candidate pairs [12,23]. Each split contains positive and negative pairs used to train and evaluate matching models and, crucially for this work, to tune threshold policies ( τ accept and τ reject ) without incurring bias from using the test set.
Table 1 summarizes the distribution of positive and negative pairs per split for both benchmarks. In subsequent sections, these partitions are used (i) to evaluate matching models under the standard benchmarking protocol, (ii) to measure governance-oriented metrics, such as the fraction of positives that remain in the proposed state (review load) or are rejected (false negatives), and (iii) to quantify inference cost for LLM-based models via token counts.

4. OntoDup Governance Model

This section specifies the governance model used by OntoDup to represent matching decisions, evidence and controlled materialization in a knowledge graph. To achieve this, OntoDup explicitly separates (i) the decision issued by a matching method and its associated evidence, (ii) the operational materialization of similar entities, and (iii) governance relations that signal inconsistencies and enable review and cleansing processes before exposing a conflict-free, trustworthy view of the graph.

4.1. Entities and Relations of the Model

The OntoDup model is a lightweight ontology aimed at representing entity matching decisions in a traceable way, while distinguishing between operationalized knowledge (links consumable by queries or services) and governance knowledge (assertions, states and conflicts). To maximize interoperability, the core reuses established vocabularies: BIBO is used to type the bibliographic record as a documentary entity, and SKOS is used to model the governance lifecycle through a controlled state scheme. Compatibility with VIVO is preserved in two senses: (i) a bibliographic record can be integrated into VIVO-based academic graphs by maintaining standard documentary semantics, and (ii) when temporal precision or information-resource modeling is needed, we rely on patterns and classes commonly used in VIVO. Table 2 summarizes the most relevant TBox/ABox elements (Terminological Box/Assertional Box) introduced by OntoDup and their role in the governance cycle.
From a TBox perspective, the model introduces three central classes:
  • ontodup:ScholarlyRecord represents the source-level bibliographic record and is specialized as a bibliographic document to enable identity links across heterogeneous catalogs.
  • ontodup:MatchAssertion reifies a decision about a pair of entities by explicitly representing the subject–predicate–object triple as an rdf:Statement. This reification prevents the operational link from being confused with evidence or governance state, enabling metadata attachment without polluting the graph with preliminary decisions.
  • ontodup:Evidence acts as a traceable container for the evidence associated with an assertion.
The decision lifecycle is controlled by a domain class ontodup:AssertionStatus as a subclass of skos:Concept, and through individuals that belong to a skos:ConceptScheme. The relationship between a decision and its state is expressed via ontodup:status. The relationship between a decision and its evidence is expressed via ontodup:hasEvidence, allowing one or multiple evidence objects per assertion.
Besides, the model incorporates negative and conflict relations as part of governance control: ontodup:notSameAs explicitly declares non-identity, while ontodup:conflictWith marks contradictions between entities when incompatible assertions exist for the same pair, with the purpose of excluding such cases from conflict-free operational views.
To support operational traceability in experiments with LLMs, certain metadata are modeled as datatype properties on ontodup:MatchAssertion (attempts, tokens and latency). This design binds inference cost and operational footprint to the semantic unit that produced them, while keeping the core of the model focused on governing decisions and their traceability.

4.2. Governance Semantics: Assertions, States and Materialization

OntoDup adopts a two-layer semantics to decouple governed decisions from operational knowledge. The governance layer is composed of individuals of class ontodup:MatchAssertion that explicitly reify a relation decision between two resources as an rdf:Statement, preserving its (subject, predicate, object) form and enabling the attachment of a lifecycle status (ontodup:status) and operational evidence without introducing ambiguity into the consumable graph. The operational layer, in contrast, corresponds to links that become directly consumable as graph relations, but only when there exists an underlying assertion whose status enables materialization.
Formally, let A be the set of ontodup:MatchAssertion individuals. For each m a A , let subj ( m a ) , pred ( m a ) and obj ( m a ) denote the projections of rdf:subject, rdf:predicate and rdf:object, respectively, and let σ ( m a ) denote the value of ontodup:status, with range in ontodup:AssertionStatus. An assertion is eligible for materialization if its status indicates either automatic acceptance or human validation. We capture this condition via the predicate Accepted, defined in Equation (1).
Accepted ( m a ) : = ( σ ( m a ) = ontodup : StatusAutoAccepted ) ( σ ( m a ) = ontodup : StatusHumanValidated )
Based on Equation (1), materialization is modeled as an operator M that projects accepted assertions into operational triples. Intuitively, M promotes to the operational layer only those assertions whose predicates belong to the set of consumable predicates P o p = { owl : sameAs , ontodup : notSameAs } . This is formalized in Equation (2).
M ( m a ) : = ( subj ( m a ) , pred ( m a ) , obj ( m a ) ) if Accepted ( m a ) pred ( m a ) P o p otherwise
The operational graph is defined as the union of the materialized triples from all governance assertions. In other words, the set of operational facts available for consumption is a derived view obtained via controlled promotion, as stated in Equation (3).
G o p : = m a A M ( m a )
An immediate consequence of Equations (2) and (3) is that the operational identity relation is not interpreted as a primary fact, but as the result of materializing accepted assertions with predicate owl:sameAs. This relation can be expressed as a set of pairs ( s , o ) , as defined in Equation (4).
SameAs o p : = { ( s , o ) m a A : Accepted ( m a ) pred ( m a ) = owl : sameAs subj ( m a ) = s obj ( m a ) = o }
Thus, ontodup:StatusProposed retains candidate decisions with no operational impact, whereas ontodup:StatusAutoAccepted and ontodup:StatusHumanValidated enable link promotion into G o p via M . Additionally, OntoDup introduces audit indicators to make contradictions between decisions explicit, so that consumption-oriented queries can exclude contradictory links whenever a consistent view is needed. After conflicts are resolved through governance actions (e.g., curator overrides such as ontodup:StatusHumanRejected), the operational closure is obtained in the next materialization/reasoning cycle.

4.3. PIE Rules for Materialization and Constraint Closure

OntoDup implements materialization and constraint closure via PIE rules executed by the knowledge base inference engine, summarized in Appendix A. These rules serve two purposes: (i) to promote to the operational layer only those governed decisions that satisfy an acceptance policy (auto-accepted or human-validated) and (ii) to close the resulting relations to ensure minimal semantic properties, in particular the closure of owl:sameAs (symmetry and transitivity) and the propagation of negative constraints (ontodup:notSameAs) through equivalences. The process is monotonic and is computed as a fixed-point closure: Given an initial graph G 0 with reified assertions and domain data, the reasoner iteratively applies the rules until obtaining a saturated graph G * that contains both explicit and inferred facts.
Overall, the rules ensure that operational identity is not a direct claim of the matching method, but rather the consequence of a governed assertion whose status enables materialization. Closing owl:sameAs provides minimal consistency to operate at the level of equivalence components, while propagating ontodup:notSameAs through owl:sameAs enforces negative constraints that remain coherent under equivalence.

4.4. Conflict Handling Without Retraction

OntoDup derives the operational view from the governance layer through monotonic rule-based materialization and closure. Conflicts are not handled by retracting derived operational facts; instead, contradictory decisions are made explicit at the governance level and excluded from the conflict-free consumption view. This design is consistent with logical frameworks that tolerate inconsistency during reasoning [33]. Human curation is represented as status transitions on assertions and can be applied between reasoning cycles via SPARQL Update.
A conflict arises when reified assertions coexist over the same pair ( s , o ) with opposing predicates (owl:sameAs versus ontodup:notSameAs), regardless of pair order. The rules ontodupDetectConflictFromAssertions and ontodupDetectConflictFromAssertionsSwapped materialize ontodup:conflictWith (s,o), while ontodupConflictWithSymmetric enforces symmetry. The rules ontodupMarkConflictingAssertionsSameAs and ontodupMarkConflictingAssertionsNotSameAs then assign ontodup:StatusConflict to the assertions involved, preserving them for audit rather than deleting or overwriting them.
OntoDup therefore defines a conflict-free consumption view as the set of owl:sameAs links derived by materialization and filtered by the absence of conflicts, as formalized in Equation (5). This separates operational consumption from unresolved contradiction while keeping conflicting cases queryable in the governance layer.
E o p = { ( s , o ) ( s , owl : sameAs , o ) ¬ ( s , ontodup : conflictWith , o ) } .
When a conflict is later resolved, for example through curator overrides, the updated governance state is reflected in the next materialization/reasoning cycle, yielding an operational closure aligned with that resolution.

4.5. Provenance and Intra-Source Anchoring for Traceability

Traceability in OntoDup does not rely solely on reifying matching decisions via ontodup:MatchAssertion, but also on ensuring that each bibliographic record remains anchored to its provenance and to a stable identifier within its source. To this end, OntoDup separates (i) the governance semantics defined in the core ontology and (ii) the explicit declaration of bibliographic sources as instances in an ABox vocabulary. This separation allows operational links and downstream audits to be interpreted reproducibly, while preserving the origin context of every compared entity.
Concretely, ontodup-vocab.vivo.ttl defines the bibliographic sources (DBLP, ACM and Google Scholar) as instances of c4o:BibliographicInformationSource with stable IRIs. These instances serve as reference points for the provenance attribute dcterms:source used in the RDF mappings of bibliographic records so that each ontodup:ScholarlyRecord can be linked to the source from which it was ingested. In addition, OntoDup introduces ontodup:localId as an intra-source identifier, which enables deterministic retrieval of the original record and supports verifiable association of matching decisions.
Under this anchoring, one may ask whether it would suffice to annotate identity links with provenance vocabularies such as PROV-O or to keep decision metadata in external artifacts and perform the final selection outside the graph. Provenance is useful for describing the origin and production context of a decision, and OntoDup remains compatible with that ecosystem by typing ontodup:ScholarlyRecord, ontodup:MatchAssertion and ontodup:Evidence as prov:Entity [21]. However, in deduplication, owl:sameAs links are not only explanatory traces: Once consumed by queries, inference or aggregation, they have operational consequences. A design based only on provenance annotations over operational links leaves this consumption semantics implicit and typically pushes final link selection into client-side filtering or post-processing, making it harder to preserve a queryable account of history, disagreement and policy-dependent materialization.
The operational view is derived from governance states by selective materialization through inference rules. For immediate consumption, one may query a view filtered by the absence of ontodup:conflictWith, but this is only a safeguard at query time and does not replace the closure already produced (e.g., by transitivity of owl:sameAs). In practice, resolution is recorded through state transitions, such as curator decisions in a correction graph, and the operational graph is re-materialized in the next reasoning cycle. This preserves the traceability of assertions while maintaining the correspondence between decision, evidence and consumable result. Figure 1 illustrates this triple-composition pattern, where VIVO/BIBO instantiation is extended with OntoDup provenance and local identifiers.
Table 3 summarizes the artifacts that implement this declarative provenance and identity-anchoring layer, highlighting their role in pipeline traceability and in interpreting decisions throughout the governance lifecycle.

5. OntoDup Pipeline: Ingestion, Matching and Governed Materialization

OntoDup structures deduplication as a reproducible pipeline that transforms heterogeneous bibliographic records into a knowledge graph in which identity links are incorporated under governance control. Figure 2 serves as a roadmap for the subsections that follow, summarizing the staged flow and locating the main modules together with their output artifacts in the experiments we report in Section 6: a pre-blocked setting (Experiment A), which compares matchers under a fixed benchmark candidate set and an end-to-end setting (Experiment B), in which the reported metrics reflect candidate generation, matching and governed materialization together.
1.
Data sources and experimental inputs (Section 5.1). Bibliographic source records and benchmark supervision are prepared as pipeline inputs.
2.
Ingestion and base graph construction (Section 5.2). Source records are normalized, mapped to RDF, and loaded into the base graph.
3.
Candidate generation (Section 5.3). Potential duplicate pairs are generated over the base graph.
4.
Pairwise assessment (Section 5.4). Candidate pairs are evaluated by a selected matching component.
5.
Governed assertion persistence (Section 5.5). Matching outcomes are recorded as governed assertions rather than inserted directly as operational links.
6.
Controlled materialization (Section 5.6). Accepted assertions are promoted to the operational graph under the active policy.
7.
Repository interrogation (Section 5.7). The repository supports governance-oriented audit and operational query views.
In this design, the governance model is fixed, while the blocking component, matcher, and score thresholds are configurable and calibrated for the dataset and experimental regime.

5.1. Data Sources and Experimental Inputs

The pipeline operates over heterogeneous bibliographic records sourced from DBLP, ACM and Google Scholar. These records are normalized and mapped to RDF as instances of ontodup:ScholarlyRecord, preserving stable source identifiers and intra-source keys that anchor each entity to its origin. The resulting graph is organized into named graphs that separate domain data from governance artifacts, enabling the reproducible execution of candidate generation, matching methods and the controlled materialization of operational links.
For experimental evaluation, we additionally rely on the reference labels provided by the benchmarks described in Section 3. These labels are not used to produce decisions within OntoDup but only to tune decision policies on validation partitions and to quantify performance in the two experiments.

5.2. Ingestion and Base Graph Construction

The ingestion pipeline starts from flat bibliographic records provided by the benchmark sources (ACM, DBLP and Google Scholar). Before RDF conversion, the records are normalized to reduce superficial cross-source variation while preserving the original descriptive content used for matching. This preparation is intentionally lightweight and includes trimming and whitespace normalization, normalization of punctuation and delimiter variants, case harmonization where required by downstream components, normalization of publication-year values, and consistent handling of missing or empty attributes. The normalized records are then organized into attribute subsets used by the subsequent stages, including the uniform bibliographic views consumed by blocking and pairwise comparison.
No additional domain-specific text-mining pipeline is introduced at this stage. In particular, the workflow does not apply OCR correction, stemming, synonym expansion, external metadata enrichment, or manually engineered semantic features. Instead, the normalized records are transformed into RDF through declarative mappings aligned with the VIVO ontology and extended with OntoDup-specific properties. The conversion is performed with OntoRefine, which takes the normalized files and their mapping documents as input, materializes ABox instances according to the ontology, and exports the result as N-Quads. Named graphs are assigned to separate provenance by source and domain, enabling graph-level reproducibility, traceability and logical isolation between record sets.

5.3. Candidate Generation

A crucial step in EM between two collections L and R would be to evaluate the Cartesian product | L | × | R | , which is computationally infeasible in realistic settings. Therefore, the pipeline incorporates a blocking stage that constructs a reduced candidate set C L × R , aiming to minimize computational cost without sacrificing coverage of true matches. In this context, the key criterion is blocking recall, defined as the fraction of truly matching pairs that are included in C.
In line with our experimental design, we employ two complementary candidate selection mechanisms. In Experiment A (pre-blocked), we directly use the precomputed candidate pairs provided by the benchmarks (Section 3), together with their reference labels, enabling direct comparability with standard evaluation protocols. In Experiment B (end-to-end), we generate C dynamically via deep-learning-based blocking following the approach of Thirumuruganathan [13]: implementing DeepBlocker using dense (embedding) representations learned with an autoencoder and top-k retrieval per record, and operating over attributes retrieved from the knowledge base. This integration reduces the dependence on official candidate lists and strengthens the pipeline’s ability to incorporate future datasets without available candidates, while maintaining experimental control by measuring the impact of blocking on both coverage and computational cost.

5.4. Pairwise Assessment

Once the candidate set C is built, the pipeline performs pairwise comparison as a stage that is separate from the operational graph. Each pair ( l , r ) C is evaluated from a uniform representation of bibliographic attributes such as title, authors, venue and year, enabling different methods to operate over the same inputs. In this implementation, we consider two families of methods: (i) a neural matcher trained for bibliographic EM (Ditto) [23] and (ii) a zero-shot evaluator based on LLM prompting [15,17]. Both approaches yield a continuous score interpretable as match confidence, enabling consistent comparisons of quality and cost under the same protocol, both in Experiment A with pre-blocked pairs and in Experiment B with an end-to-end flow.
The matching output is not incorporated directly as consumable links but recorded as governance knowledge. Each prediction materializes as a reified ontodup:MatchAssertion that makes the ( s u b j e c t , predicate , object ) triple explicit and attaches three components: the decision score, the provenance of the emitting method and minimal textual evidence for auditing. Based on these elements, a decision policy assigns a lifecycle state (e.g., proposed or auto-accepted) and only assertions in enabling states are projected to the operational layer via the materialization and conflict-control rules described in Section 4.

5.5. Governed Assertion Persistence

In order to convert the continuous score produced by the matchers into governable actions, OntoDup applies a decision policy that assigns lifecycle states to each ontodup:MatchAssertion. Operationally, the pipeline distinguishes three regions: auto-acceptance when confidence exceeds a high threshold τ accept , suggestion for review when the score falls between τ suggest and τ accept , and rejection when the score is below τ suggest . These thresholds are tuned using the validation partitions described in Section 3, keeping the test set exclusively for evaluation. As a result, the assigned status (e.g., ontodup:StatusAutoAccepted or ontodup:StatusProposed) is not a mere model label, but an explicit decision that controls whether an assertion remains in the audit queue or becomes eligible for materialization.
Additionally, the model supports states for advanced governance. When an assertion is reviewed and confirmed by a human evaluator, it can be promoted to ontodup:StatusHumanValidated while preserving the traceability of the decision. If incompatible decisions exist for the same pair, the pipeline marks the case as a conflict and prevents its consumption in the contradiction-free operational view, as formalized in Section 4. This organization connects directly to the experiments: In both A and B, the policy controls link promotion and enables measuring both performance and review workload.

5.6. Controlled Materialization

Once decisions are recorded as ontodup:MatchAssertion individuals and labeled by the state policy, OntoDup executes controlled materialization as a step that is separate from matching. Instead of directly inserting identity links, the reasoner applies PIE rules to promote only assertions in enabling states into operational triples, generating owl:sameAs links and, when applicable, ontodup:notSameAs constraints. This promotion implements the two-layer semantics described in Section 4: The governance layer preserves the complete history of decisions and evidence, whereas the operational layer exposes a consumable set of links derived from accepted decisions.
In parallel, the same rules detect contradictions among incompatible decisions and materialize conflict indicators, enabling the derivation of a conflict-free operational view that excludes pairs marked for audit. This view is used as the stable output of the pipeline for downstream analysis and for quantifying experimental results: In both A and B we evaluate the resulting set of operational links against the reference labels. In this sense, conflict filtering supports immediate consumption, whereas the closure consistent with conflict resolutions is produced in the subsequent materialization/reasoning cycle.

5.7. Repository Interrogation and Reproducibility Support

Under OntoDup’s two-layer model, a matching decision is first recorded as a governable assertion and only affects the consumable graph when a policy enables its materialization. Consider two bibliographic records r 1 (DBLP) and r 2 (Google Scholar) with high similarity in title and venue but with variations in author formatting. After candidate generation, the pair ( r 1 , r 2 ) is represented in the governance layer by an instance m a of ontodup:MatchAssertion, where rdf:subject, rdf:predicate and rdf:object encode the proposed link. At this stage, the state ontodup:StatusProposed retains the case for inspection without introducing operational links. When the decision policy promotes the state to ontodup:StatusAutoAccepted (or to ontodup:StatusHumanValidated after human review), the assertion becomes eligible for materialization, and the corresponding link can be projected into the consumable graph by the materialization rules (summarized in Appendix A).
In the presence of disagreements, OntoDup makes contradictions explicit without relying on in-place triple retraction during online reasoning. If, for the same pair, another agent issues an incompatible decision (e.g., a negative assertion using ontodup:notSameAs), OntoDup detects the contradiction, materializes ontodup:conflictWith and marks the involved assertions with ontodup:StatusConflict. This signal supports two complementary uses via SPARQL: (i) auditing, by retrieving decision history, evidence and operational metadata attached to each assertion and (ii) consumption, by querying a conflict-free operational view that excludes links flagged by conflictWith, consistent with Equation (5). Importantly, SPARQL does not “rebuild” the inferential closure when conflicts are present; rather, closure is re-computed in the next materialization/reasoning cycle after conflict-resolution actions (modeled as status overrides on the underlying assertions) are recorded.
Appendix B includes representative SPARQL audit and operational inspection queries as examples to illustrate minimal and reproducible patterns that operate on the integrated repository state.

6. Results

This section presents the results of the two experimental scenarios considered. Experiment A is formulated under the pre-blocked regime provided by the benchmarks, in which the candidate set C is fixed and shared across all methods, enabling a controlled comparison of performance on the test split. Experiment B analyzes the end-to-end behavior of the pipeline by incorporating candidate generation and the matching stage within the same flow; therefore, the resulting metrics jointly reflect the blocking coverage, the comparator’s discriminative capacity, and the governance policy applied to assertions.

6.1. Experiment A (Pre-Blocked): Operational Quality, Governance Workload and Inference Cost

Experiment A evaluates entity matching over precomputed candidate pairs provided by the benchmarks, which enables comparing methods under the same candidate set C and quantifying performance against the reference labels of the test split described in Section 3. Unlike a classical pipeline, OntoDup does not incorporate predictions as consumable links immediately. Each decision is recorded as a reified ontodup:MatchAssertion with a governance state and minimal evidence; only assertions in enabling states are projected to the operational view through the materialization and conflict-control rules (Section 4). Accordingly, we report results through three complementary lenses: (i) the quality of the links materialized in the operational view, (ii) the distribution across governance states, which makes review workload and the risk of discarding positives explicit, and (iii) the inference cost for LLM-based methods in tokens and latency.
The decision policy defines three regions: rejected, proposed and auto-accepted. For Ditto, τ proposed is tuned on the validation split by maximizing F 1 via grid search (Equation (6)), and  τ autoaccept is selected as the minimum threshold that satisfies a precision constraint, enforcing τ autoaccept τ proposed so that auto-acceptance is a subset of the proposed set (Equation (7)).
τ proposed = arg max τ [ 0 , 1 ] F 1 y , I [ s τ ]
τ autoaccept = min τ [ 0 , 1 ] τ Prec y , I [ s τ ] π * , Rec y , I [ s τ ] ρ min , τ τ proposed
OntoDup does not eliminate threshold selection; rather, it makes threshold-driven decisions explicit, queryable, and operationally auditable. In the present experiments, τ proposed and τ autoaccept are calibrated on validation data and should be regarded as dataset- and regime-specific policy parameters rather than as universally transferable settings:
  • On DBLP-ACM, validation tuning yields τ proposed = 0.6865 and τ autoaccept = 0.9810 , reaching precision 0.9903 in the auto-accepted region.
  • On DBLP-Scholar, the target precision π * = 0.99 is infeasible under the minimum recall constraint, and is therefore relaxed to the best attainable value, with τ autoaccept = 0.9995 and precision 0.9654 .
For LLMs, the binary matching decision is operationalized consistently with prior work by forcing a Yes/No output and interpreting a positive signal when the model affirms the match; this practice is typically evaluated by parsing “yes” as a match decision [15,17]. In OntoDup, this decision is mapped to governance by making match=true equivalent to auto-acceptance, while the continuous score p is used only for triage: Cases with p 0.5 but match=false are retained as reviewable proposals, as the prompt indicates insufficient evidence to auto-accept but not enough to discard conclusively; in contrast, p < 0.5 is assigned to rejection, materialized as ontodup:notSameAs with status ontodup:StatusAutoAccepted.
To maintain traceability and reproducibility, all Experiment A metrics are computed directly on the graph through SPARQL aggregations over the named graphs of assertions. In particular, each run writes its ontodup:MatchAssertion instances into a prediction graph per method and configuration, and the evaluation links those assertions to the benchmark test-split reference labels, which are also represented in the graph. Under this setup, T P / F P / F N counts and derived metrics (precision, recall and F 1 ) are obtained by filtering only assertions in ontodup:StatusAutoAccepted, while governance workload is reported as the AutoAccepted/Proposed/Rejected distribution and the number of true positives that remain in Proposed or are lost in Rejected. For LLM-based methods, the same SPARQL scheme additionally retrieves per-assertion cost metadata such as input/output tokens and latency, enabling reporting averages and sums by dataset and model without relying on external logs.
Operational impact is quantified only on auto-accepted links, which are the ones projected to the consumable view G o p . Under this criterion, Table 4 summarizes, for DBLP-ACM (D-A) and DBLP-Scholar (D-S), the total number of operational links N o p , the total number of positives in the test split P test and standard classification metrics ( T P / F P / F N , precision, recall and F 1 ). For D-A, F 1 values range from 0.9226 (gpt-4.1-nano) to 0.9801 (gpt-4o), while Ditto attains precision 0.9853 and recall 0.9077 . For D-S, Ditto reaches F 1 = 0.9387 (precision 0.9701 , recall 0.9093 ) and gpt-4o records F 1 = 0.9222 ; additionally, we observe lower operational coverage for configurations such as gpt-4o-mini (recall = 0.7121 ) and larger N o p for gpt-4.1-nano ( N o p = 1299 ), where the increase in false positives is reflected in a precision of 0.7691 .
Complementarily, the distribution of assertions by governance state quantifies review workload and characterizes which fraction of true matches is not materialized in G o p . Table 5 reports, for each method, the counts of AutoAccepted assertions (materialized positive links), Proposed assertions (plausible cases retained for audit) and Rejected assertions (negative decisions materialized as ontodup:notSameAs with ontodup:StatusAutoAccepted). We further break down two test-split indicators: P test prop , true positives retained in Proposed, and  P test nm , true positives assigned to a negative decision.
  • For DBLP-Scholar, for instance, Ditto yields P test nm = 67 and P test prop = 30 , while gpt-4o yields P test nm = 1 and P test prop = 91 , evidencing different trade-off profiles between non-materialization due to proposals and non-materialization due to negative decisions.
  • For DBLP-ACM, we observe configurations where Proposed is mostly composed of negative cases, such as gpt-4.1 with 82 proposals and P test prop = 0 , as well as cases where the loss of positives concentrates on negative decisions, such as gpt-4.1-nano with P test nm = 15 .
Regarding inference cost, Table 6 summarizes, for each dataset and LLM model, the number of evaluated pairs n, the average token consumption per pair (input, output and total), as well as the average latency and the aggregated total latency. Due to the fact that the attribute serialization scheme and the prompt payload are kept constant per dataset, the average input tokens remain stable across models within the same benchmark, while differences are reflected mainly in the observed latency and, to a lesser extent, in output length. These statistics are retrieved from the graph via SPARQL aggregations over per-assertion cost metadata, enabling reporting means and sums without relying on external instrumentation.

6.2. Experiment B: Governed End-to-End Matching

The goal of Experiment B is to evaluate the end-to-end behavior of OntoDup in a context where the candidate set C does not come from precomputed benchmark lists but is generated dynamically via a neural blocker. Under this regime, the reported metrics jointly reflect (i) the coverage of candidate generation (blocking recall), (ii) the discriminative capacity of the comparator, and (iii) the effect of the governance policy on which links reach the operational view G o p and which fraction remains in the audit queue.
In this experiment, we employ Ditto as the main comparator because it exhibited stable and consistent behavior in the precision– F 1 balance on both benchmarks (DBLP-ACM and DBLP-Scholar) under the pre-blocked regime of Experiment A. This stability is particularly relevant in the end-to-end flow because, as the number of candidates increases drastically, any bias toward producing too many positives or a poorly calibrated threshold can amplify operational error or render the review queue unmanageable.
In the end-to-end setting, the candidate pair set C no longer comes from precomputed benchmark lists, but is generated dynamically from the graph using DeepBlocker. Operationally, for each record on the left side we retrieve the top-k nearest neighbors on the right side using dense representations, so that k directly controls the trade-off between coverage (blocking recall) and comparison workload. We therefore select k under a pragmatic criterion, adopting values consistent with the configuration reported by DeepBlocker while ensuring high positive coverage under the noise conditions of each benchmark:
  • On DBLP-ACM, a small k is sufficient because the records exhibit higher consistency and cleaner signals, achieving blocking recall of 99.6 % with k = 5 .
  • On DBLP-Scholar, textual variability and incomplete metadata require expanding the retrieved neighborhood to avoid losing true positives during blocking; we therefore use k = 150 , reaching blocking recall of 98.1 % .
Table 7 summarizes the selected k values, the achieved coverage and the resulting candidate volume in each case.
Once C is fixed via DeepBlocker, threshold selection is no longer a purely predictive exercise and becomes a governance decision: the goal is no longer to maximize a single global metric on validation, but to control which fraction of the flow remains reviewable and which fraction is automatically promoted to the operational view. Under this regime, τ proposed is not interpreted as the best classification cutoff, but as a triage threshold that defines the beginning of the band worth auditing. Accordingly, in Experiment B we change the selection criterion: τ proposed is chosen under explicit coverage and review-load constraints, rather than by maximizing F 1 on the validation split as in the pre-blocked setting.
Concretely, we impose a minimum recall on validation ( ρ min = 0.95 ) to avoid losing true positives due to an overly high threshold, and we additionally fix a review budget that limits the volume of pairs that remain in ontodup:StatusProposed. This constraint is deliberately benchmark-dependent because noise and candidate-set size induce different operational sensitivities:
  • For DBLP-ACM, where k is small and the candidate set is more controlled, we adopt a strict budget ( b = 0.015 ), which forces a small review queue; under these conditions, the process selects τ proposed = 0.6225 , maintaining recall 0.9865 and precision 0.9755 within the proposed band.
  • For DBLP-Scholar, the high-k blocking substantially increases candidate volume and variability; requiring the same budget would cause a strong coverage drop. We therefore allow a larger budget ( b = 0.20 ), which yields τ proposed = 0.0005 , preserving recall ( 0.9822 ) at the cost of lower precision in the reviewable band ( 0.8415 ), which is consistent with a setting where the purpose of Proposed is to retain plausible cases for audit and avoid premature rejection of positives.
To formalize this policy, we first define the review rate as the fraction of pairs whose scores fall in the reviewable band, i.e., those that are proposed but not auto-accepted (Equation (8)).
RevRate ( τ p , τ a ) = 1 n i = 1 n I τ p s i < τ a
With this definition, τ proposed is selected as the minimum threshold that guarantees minimum coverage and respects the review budget b, given τ autoaccept (Equation (9)).
τ proposed = min τ [ 0 , 1 ] τ s . t . Rec y , I [ s τ ] ρ min , RevRate ( τ , τ autoaccept ) b
In turn, τ autoaccept retains its high-confidence role: It is selected as the minimum threshold that attempts to satisfy a precision target π * = 0.99 , with a minimum recall and enforcing τ autoaccept τ proposed so that auto-acceptance is a subset of the proposed region. This criterion is consistent with the formulation used in the pre-blocked setting (Equation (7)), but it is applied here under the end-to-end regime where candidate volume and score distribution are conditioned by blocking. On DBLP-ACM this remains feasible with τ autoaccept = 0.9810 and precision 0.9903 in the auto-accepted region. On DBLP-Scholar, the target π * = 0.99 is unattainable under the constraints and is therefore relaxed to the best available value, yielding τ autoaccept = 0.9995 with precision 0.9645 .
In addition, in the end-to-end setting, we avoid materializing negative decisions en masse for all pairs with s < τ proposed . In a realistic flow, those pairs correspond to early-discarded comparisons and typically do not become persistent knowledge in the graph; recording them as ontodup:notSameAs at scale would produce a dominant volume of low-value negative facts, increasing storage and query cost and diluting the useful audit signal. Therefore, in Experiment B the governance graph focuses on (i) auto-accepted positive assertions, which are the only ones promoted to the operational view, and (ii) proposed assertions, which define the reviewable band and make the audit workload under budget b explicit.
In order to support an end-to-end reference without additional annotation effort, we encode the benchmark positives as a validation anchor in the governance layer: Each pair labeled as a match is recorded as an ontodup:MatchAssertion with status ontodup:StatusHumanValidated. This construction provides a stable comparison point for the end-to-end flow. In the derived operational view, it enables measuring which proportion of reference positives is covered by ontodup:StatusAutoAccepted, which fraction remains in ontodup:StatusProposed as the auditable queue, and which part does not appear under the candidate set generated by blocking. Under this anchor, the metrics reported below interpret operational performance as a joint function of blocking, triage policy and selective materialization, while workload effects should be read as optimistic with respect to real-world review variability.
Under the end-to-end regime, evaluation is performed on the operational view derived by materialization: Only pairs that reach ontodup:StatusAutoAccepted are projected as consumable links in G o p , while the rest remain in the governance layer as an audit signal. In this frame, the reported metrics reflect the operational behavior of the full flow and help explain why, on DBLP-Scholar, F 1 decreases relative to the pre-blocked setting: The drop is dominated by coverage losses in materialized links, rather than by a general deterioration in precision.
To establish a contrast against a complete positive reference, we encode the global ground truth as validation facts for positive links only, representing them as owl:sameAs assertions with status ontodup:StatusHumanValidated. Against this complete positive reference anchor, we compare the set of owl:sameAs links effectively materialized by Ditto (after closure rules), computing T P / F P / F N and derived metrics via SPARQL aggregations over the named evaluation graphs. Table 8 summarizes the resulting operational performance under this criterion.
These results show that precision remains high on both benchmarks, which is consistent with a conservative materialization policy that promotes to G o p only high-confidence decisions. The difference concentrates on coverage: On DBLP-Scholar the operational view retrieves a smaller fraction of positives in the reference anchor (Rec. = 0.54123 ), which pulls F 1 down to 0.69886 despite precision 0.98603 . This degradation is not driven by an increase in false positives (FP = 41 ), but by a large volume of positives that do not get materialized and are therefore counted as false negatives (FN = 2453 ) under a strictly operational evaluation.
The explanation becomes transparent by decomposing those false negatives according to the governance logic. In this flow, a portion of positives is retained in ontodup:StatusProposed (reviewable queue) and another portion falls below τ proposed and is not preserved as persistent governance knowledge, because in Experiment B we opted for selective materialization rather than recording negative decisions at scale. Table 9 summarizes this decomposition:
  • On DBLP-Scholar, 1766 positives remain as Proposed, and 687 positives fall outside persistent triage; both components sum exactly to the 2453 FN reported in Table 8.
  • On DBLP-ACM, the same phenomenon exists but at much smaller magnitude (141 proposed and 13 missing, totaling 154 FN), which preserves a high F 1 ( 0.95207 ).
Under these conditions, F 1 on DBLP-Scholar decreases because the operational evaluation penalizes any positive that is not promoted to G o p , even if it remains traceably recorded as ontodup:StatusProposed and is, by design, intended for subsequent human review. In other words, the drop aligns with the two-layer semantics: AutoAccepted defines the consumable, audit-ready set, whereas Proposed quantifies audit workload and retains potential coverage that is not yet reflected as an operational link.
On the same operational view used to measure performance, false positives become a natural audit object because they are already bounded by the materialization policy: They correspond to owl:sameAs links promoted with ontodup:StatusAutoAccepted that are not supported by the positive reference anchor. In this setup, auditing does not require reconstructing the pipeline or inspecting all candidates, but explicitly querying the small subset of operational links that conflict with the reference anchor. In our results, this volume is small (54 for DBLP-ACM and 41 for DBLP-Scholar, Table 8), which makes manual inspection feasible with full traceability, prioritizing cases where the operational error cost is highest.
Thus, traceability becomes operational because the governance graph retains, for each ontodup:MatchAssertion, the leftRecord/rightRecord pair, the matchScore, the status and an evidence comment that summarizes inference context (model, blocking type and effective thresholds). Rather than embedding long queries in the paper, the audit flow relies on two query patterns: (i) retrieve AutoAccepted links that are not supported by the human positive anchor, and (ii) for each, bring minimal metadata and the evidence text stored as a comment. Table 10 illustrates the kind of extract obtained: promoted links and queued auditable cases (Proposed) with their score and a compact summary of the triage and auto-accept thresholds used in that run.
When an operational false positive is identified against the reference anchor, correction is implemented as an explicit governance action that does not erase traceability of the original event. Rather than retracting derived operational links or removing the underlying assertion, the same MatchAssertion is re-labeled through a curator override that records a rejection status in a dedicated correction graph, preserving the assertion identifier and the originating left/right pair.
The accumulated effect of these corrections shows directly in the count of deduplicated records, measured as the number of distinct ontodup:ScholarlyRecord before and after materialization/closure and before and after incorporating human rejections. To keep the computation stable, the base count is obtained over source graphs without inferences, while the deduplicated count is computed over the materialized graph with automatic owl:sameAs expansion disabled, so that the total reflects the number of resulting entities after closure rather than the number of equivalent aliases. Table 11 summarizes the picture:
  • On DBLP-ACM the flow goes from 4910 base records to 2660 inferred records without rejections, and applying 54 rejection overrides increases the total to 2690.
  • On DBLP-Scholar, the total goes from 66,879 base records to 61,572 inferred records without rejections, and increases to 61,604 after applying 41 rejection overrides.
This variation is consistent with the semantics of correcting a false positive under owl:sameAs closure: Rejecting one promoted link can split an equivalence component, increasing the number of distinct entities in the deduplicated view.
Overall, this strategy preserves a useful operational criterion: The operational graph remains lightweight, centered on promoted links (ontodup:StatusAutoAccepted), while corrections are recorded as human decisions in a separate, queryable and reproducible graph. In this way, audit and correction are integrated into the matching lifecycle: The system retains sufficient evidence to explain why a link was promoted and preserves the trace of why it was rejected later, without turning the operational graph into a massive repository of low-audit-value comparisons.

7. Discussion

Interpreting OntoDup requires reading matching outcomes together with the policy-controlled transition from the governance layer to the operational view. On DBLP-ACM and DBLP-Scholar, precision and recall measured on materialized owl:sameAs links vary with the proportion of assertions that remain in ontodup:StatusProposed and with the selective materialization policy adopted as candidate volume grows, as reflected by the operational and governance summaries (Table 4, Table 5 and Table 8). Evaluation is therefore shaped not by a single threshold alone, but by a lifecycle that preserves evidence, accommodates incompatible assertions, and supports auditable correction through queryable status transitions. The discussion that follows focuses on the operational implications observed under end-to-end conditions, on how the review simulation should be interpreted when reasoning about workload and cost, and on what the two-layer design enables beyond pipelines that directly insert links.
Three findings emerge from this reading. First, OntoDup shifts evaluation from pairwise matcher outputs to the level of operationally consumable identity links, which is more appropriate for governed scholarly KG integration. Second, under end-to-end conditions, part of the observed loss in operational recall reflects policy-controlled retention of uncertain cases outside materialization rather than indiscriminate false-link insertion. Third, the framework makes review workload and inference cost visible alongside precision, recall and F1, allowing matching strategies to be compared not only by predictive effectiveness but also by their operational consequences.
Accordingly, the governance metrics should be read as operational indicators rather than as alternatives to classifier-quality measures. They characterize how outcomes are distributed across the governed decision process: which assertions are automatically accepted, which remain proposed for review, which are rejected, and which become materialized in the operational graph under a given policy. In this way, they complement the traditional precision–recall view by quantifying review burden, policy selectivity and operational exposure.

7.1. Operational Metrics as a Function of the Two-Layer Transition

A joint reading of the experiments suggests that, when matching decisions are represented as governable assertions rather than as direct links, operational metrics synthesize the behavior of the full flow and not only the discriminative capacity of the comparator. In the pre-blocked regime, where the candidate set is fixed by the benchmark, the operational view more faithfully approximates the quality–coverage balance because it is not conditioned by blocking variability; this is observed on DBLP-ACM and DBLP-Scholar with Ditto, where operational precision remains high (0.9853 and 0.9701), and F 1 stays competitive (Table 4). In contrast, under the full end-to-end flow, operational quality becomes explicitly conditioned by the interaction between blocking coverage, triage thresholds and the selective materialization policy so that a decrease in F 1 can remain coherent with a conservative policy aimed at promoting only high-confidence decisions to G o p (Table 8).
Within this regime, τ proposed and τ autoaccept act as control parameters for the transition between the governance layer and the operational view.
  • On DBLP-ACM, with  τ autoaccept = 0.9810 the flow materializes 2120 operational links and obtains Rec. = 0.93063 against the human positive anchor, with FP = 54 and Prec. = 0.97453 (Table 8).
  • On DBLP-Scholar, the target π * = 0.99 is infeasible under end-to-end constraints and is relaxed to the best achievable value; the resulting configuration sustains high precision (Prec. = 0.98603 ; FP = 41 ) but reduces operational coverage (Rec. = 0.54123 ), which drags F 1 down to 0.69886 and concentrates error in FN = 2453 (Table 8).
This evidence supports an interpretation in which the F 1 loss is driven primarily by non-materialized coverage rather than by an increase in operational false positives, given that FP remains low on both benchmarks.

7.2. Coverage Decomposition Under Selective Materialization

The decomposition of FN in the end-to-end flow makes the two-layer semantics explicit and enables interpreting the coverage gap unambiguously.
  • On DBLP-Scholar, 1766 positives remain retained in ontodup:StatusProposed as part of the auditable queue, and 687 positives fall outside triage persistence under the selective materialization policy; both quantities sum exactly to the 2453 FN of the operational evaluation (Table 8 and Table 9).
  • On DBLP-ACM, the same phenomenon appears at a significantly smaller scale (141 positives in ontodup:StatusProposed and 13 outside persistence, totaling FN = 154 ), which remains consistent with maintaining a high F 1 in the operational view (Table 9). In this frame, ontodup:StatusProposed is interpreted as traceable audit workload rather than as operational error, whereas ontodup:StatusAutoAccepted delineates the consumable subset effectively promoted to G o p .
A selective stance toward negative persistence further clarifies how operational scalability and audit signal are balanced in the end-to-end flow. Under a large candidate set, recording ontodup:notSameAs for all pairs with s < τ proposed would yield a dominant volume of negative facts with limited audit value; persistence is therefore concentrated on (i) auto-accepted links, which are the only assertions promoted to G o p , and (ii) proposals, which expose the auditable workload under budget b explicitly. The trade-off is delimited by the Missing Positives category (Table 9), which captures the coverage component that is not preserved as persistent triage knowledge under this policy.

7.3. Governance Workload and Cost as Operational Constraints

In the pre-blocked regime, the governance layer adds a complementary comparison axis to operational quality by quantifying review workload and positive loss due to non-materialization. On DBLP-Scholar, Ditto leaves 48 pairs in ontodup:StatusProposed and 67 positives non-materialized due to negative decisions, whereas gpt-4o leaves 524 proposals and retains 91 positives in ontodup:StatusProposed, shifting part of the coverage into the auditable queue (Table 5). These profiles support an operational reading in which a method may sustain competitive F 1 on the test set while inducing markedly different audit workloads or concentrating positive losses in Rejected, which affects practicality when human review is limited.
The cost dimension follows the same operational logic when it is tied to the assertion as the semantic unit and summarized through aggregations on the graph. In Experiment A, average consumption per pair remains stable per benchmark due to fixed attribute serialization, while observed average latencies range from 0.8570–1.1699 s on DBLP-ACM and 1.0806–1.2092 s on DBLP-Scholar for the evaluated models (Table 6). These measurements are not intended to extrapolate to other environments or configurations, but they anchor feasibility considerations as the pair volume grows, particularly in end-to-end settings where | C | increases markedly (Table 7).

7.4. Auditing and Correcting Consumable Links

Auditing false positives benefits directly from the constraint imposed by materialization because the inspection universe is bounded to auto-accepted owl:sameAs links that are already consumable in the operational view. On DBLP-ACM and DBLP-Scholar, this subset corresponds to 54 and 41 links, respectively (Table 8), and it can be retrieved through SPARQL patterns that return the pair (leftRecord/rightRecord), matchScore, status and compact evidence, as illustrated in Table 10. When a false positive is confirmed, the correction is recorded as an explicit governance action via ontodup:StatusHumanRejected in a correction graph, preserving traceability of the original event and enabling closure rules to exclude the link from operational materialization without relying on destructive retractions.
The accumulated effect is observed in deduplicated-entity counts. On DBLP-ACM, inferred records move from 2660 to 2690; on DBLP-Scholar, from 61,572 to 61,604 (Table 11). The variation in inferred count does not necessarily match the number of human rejections (54 and 41) because the impact of rejecting a link depends on the connectivity induced by owl:sameAs closure: Removing an edge may leave a component connected via alternative paths or split it into multiple components, affecting the resulting entity count non-linearly.

7.5. Practical Robustness for Curation and Operational Decision-Making

Beyond the operational-quality readings reported above, the evidence on DBLP-ACM and DBLP-Scholar highlights a robustness gap between pipelines that export a final link set from scores and a design that treats each matching outcome as a governable assertion with an explicit lifecycle. In the former, a fixed threshold can reproduce link selection, but the history, provenance and evidence explaining why a link is or is not consumable typically remain external. In OntoDup, by contrast, incompatible assertions remain first-class objects in the governance layer, while the operational view remains policy-defined and can be queried as the consumable state.
Illustrative example. Two publications harvested from distinct bibliographic sources share a near-identical title and publication year but differ in author-string formatting or venue representation. In a score-to-link workflow, a sufficiently high score may lead directly to the export of an owl:sameAs link, even if the case remains ambiguous and later requires correction. Under OntoDup, the same outcome can instead be persisted as a governed assertion in StatusProposed, together with its score, provenance and compact evidence, and only promoted to the operational graph if it satisfies the applicable acceptance policy or is validated through review. This supports curation by keeping borderline cases visible, auditing by preserving the basis of the decision, and data management by preventing uncertain links from propagating immediately into downstream search, profiling or analytics services.
This has direct implications for review and curation. Multiple agents may assert owl:sameAs or its negation over the same pair, and human actions can be recorded as additional status transitions; materialization rules then project only the materializable subset to G o p . Reproducing the same behavior with a score table would require an external event-sourcing layer synchronized with downstream consumers, whereas the graph-based formulation keeps the lineage attached to the same semantic unit later queried for audit, prioritization, and reconciliation.
The same two-layer design also makes policy adaptation and repository querying more reliable. When τ proposed , τ autoaccept , or the review budget b change, the operational view can be re-projected from the existing assertion store rather than reconstructed post hoc from disconnected exports, preserving comparability of what was proposed, what became consumable and what remained outside persistence under selective materialization (Table 8 and Table 9). Governance and publication decisions can also be supported by direct SPARQL queries over a single integrated state rather than joins across score tables, logs, and ad hoc review files. Queries can retrieve current consumable owl:sameAs links together with leftRecord/rightRecord, matchScore, status and compact evidence (Table 10), and can support governance-oriented slicing such as identifying where incompatible assertions accumulate or how cost and latency aggregate by status and outcome (Table 5 and Table 6). Equivalent reports can be assembled outside the graph, but OntoDup makes the audit and control surface part of the same knowledge structure that supports operational consumption.

7.6. Limitations, Threats to Validity and Implications for Reuse

These observations delineate the main limitations of the present study and the conditions under which its results should be reused. The evaluation is conducted on DBLP-ACM and DBLP-Scholar, which are established bibliographic benchmarks and therefore useful for controlled comparison, but they remain relatively clean and static academic datasets and do not fully capture the heterogeneity, metadata degradation, sample size and scale of production scholarly knowledge graphs. In particular, the experiments do not include OCR-heavy noise, continuously evolving source collections, or million-scale operational deployments, so the reported results should be interpreted as evidence for the feasibility of the governance model under controlled bibliographic conditions rather than as a final validation under full production complexity.
A second threat to validity concerns the review reference used in Experiment B. The validation anchor is derived from benchmark labels and provides a deterministic reference for end-to-end bookkeeping. As a result, it does not reflect sources of variability commonly present in practical curation workflows, including disagreement, inconsistent criteria, latency and occasional review errors. Therefore, the ontodup:StatusProposed queue should be interpreted primarily as a workload proxy under this reference anchor, and any workload-reduction reading should be treated as optimistic.
Transferability is also limited by calibration dependence. Thresholds and budgets are tuned on benchmark partitions and on score distributions conditioned by blocking, so transferring them to other collections requires recalibration, especially in higher-noise scenarios where strict precision goals may be infeasible and must be explicitly relaxed to remain coherent with operational constraints. More generally, the governance representation is stable across instantiations, but matcher choice, blocking behavior and decision policy remain context-sensitive.
Finally, the present study quantifies token and latency costs for LLM-based matching, but it does not yet explore cost-reduction strategies such as prompt optimization, caching, selective invocation or model cascading. Future work will therefore focus on cost-aware policy calibration and incremental correction cycles under distribution shift.

8. Conclusions and Future Work

This paper framed scholarly KG deduplication as a governance problem rather than as a one-shot prediction task. OntoDup represents matching outcomes as governed assertions with status, evidence and provenance, and separates this governance layer from the operational graph in which only policy-approved identity links become consumable. The paper therefore contributes not a new standalone matcher, but a governance-aware methodology for integrating heterogeneous matching components into an auditable and operationally controlled scholarly KG workflow. More broadly, the paper shows that in scholarly KG deduplication, the quality of identity resolution cannot be characterized fully by pairwise matching performance alone; it must also be evaluated in terms of governance state, materialization policy, review burden and operational exposure.
The experiments on DBLP-ACM and DBLP-Scholar show that this perspective changes how deduplication should be evaluated. Beyond pairwise precision, recall, and F1, OntoDup makes it possible to assess operational link quality together with review burden, policy selectivity, and inference cost. The results also show that selective materialization can preserve a stable consumable view while retaining uncertain, conflicting, or corrected cases in the governance layer for later audit and revision.
Future work should extend the evaluation beyond controlled bibliographic benchmarks and publication-level deduplication to larger, noisier, and more heterogeneous scholarly graph settings. Additional work is also needed on calibration under distribution shift and on cost-aware orchestration strategies for LLM-based matching, including selective invocation, caching, and cascading policies.

Author Contributions

Conceptualization, J.G.-M., M.L.-N. and J.P.S.-G.; Formal analysis, J.G.-M. and D.P.-S.; Writing—original draft preparation, J.G.-M., L.F.G.-V. and D.P.-S.; Writing—review and editing, J.G.-M., M.L.-N. and L.F.G.-V.; Visualization, J.G.-M. All authors have read and agreed to the published version of the manuscript.

Funding

The authors from the University of Vigo were supported by the Xunta de Galicia grant GPC-ED431B 2024/26 for the consolidation and structuring of competitive research units.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code supporting the findings of this study is available at https://github.com/jgalanm07/ontodup/ (accessed on 26 February 2026). The experimental result datasets have been deposited in Zenodo under the reserved DOI https://doi.org/10.5281/zenodo.18972541.

Conflicts of Interest

The authors declare that they have no conflict of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
ABoxAssertional Box
BIBOBibliographic Ontology
D-ADBLP-ACM (benchmark dataset)
D-SDBLP-Scholar (benchmark dataset)
DBLPDigital Bibliography & Library Project
DRDDuplicate Record Detection
EMEntity Matching
FNFalse Negatives
FPFalse Positives
F1F1-score
KGKnowledge Graph
IRIInternationalized Resource Identifier
LLMLarge Language Model
MAKGMicrosoft Academic Knowledge Graph
OCROptical Character Recognition
OWLWeb Ontology Language
PIEPIE rules/constraints (as used by the paper’s inference engine)
PrecPrecision
PROV-OPROV Ontology
RDFResource Description Framework
RecRecall
SKOSSimple Knowledge Organization System
SPARQLSPARQL query language
TBoxTerminological Box
TPTrue Positives
VIVOVIVO ontology/infrastructure
XSDExtensible Markup Language Schema Datatypes

Appendix A. Formal PIE Rules for Operational Materialization

Table A1 summarizes the PIE rules for materialization, conflict detection and constraint closure in OntoDup. As explained in Section 4.3, these rules serve to promote to the operational layer only the governed decisions that satisfy an acceptance policy, and to close the resulting relations to ensure the closure of owl:sameAs (symmetry and transitivity) and the propagation of negative constraints (ontodup:notSameAs) through equivalences.
We use triple notation ( x , p , y ) and write each rule as a conjunctive antecedent implying a consequent. Variables denote RDF resources, and constraints of the form s o correspond to PIE constraints used to avoid trivial reflexivity or degenerate matches.
Table A1. PIE rules for materialization, conflict detection and constraint closure in OntoDup.
Table A1. PIE rules for materialization, conflict detection and constraint closure in OntoDup.
RuleFormal Form (Antecedent ⇒ Consequent)
ontodupApplyAutoAcceptedSameAs ( m a , rdf : type , ontodup : MatchAssertion ) ( m a , rdf : predicate , owl : sameAs ) ( m a , rdf : subject , s ) ( m a , ontodup : status , ontodup : StatusAutoAccepted ) ( m a , rdf : object , o ) ( s o ) ( s , owl : sameAs , o )
ontodupApplyHumanValidatedSameAs ( m a , rdf : type , ontodup : MatchAssertion ) ( m a , rdf : predicate , owl : sameAs ) ( m a , rdf : subject , s ) ( m a , ontodup : status , ontodup : StatusHumanValidated ) ( m a , rdf : object , o ) ( s o ) ( s , owl : sameAs , o )
ontodupApplyAutoacceptedNotSameAs ( m a , rdf : type , ontodup : MatchAssertion ) ( m a , rdf : predicate , ontodup : notSameAs ) ( m a , rdf : subject , s ) ( m a , ontodup : status , ontodup : StatusAutoAccepted ) ( m a , rdf : object , o ) ( s o ) ( s , ontodup : notSameAs , o )
ontodupApplyHumanValidatedNotSameAs ( m a , rdf : type , ontodup : MatchAssertion ) ( m a , rdf : predicate , ontodup : notSameAs ) ( m a , rdf : subject , s ) ( m a , ontodup : status , ontodup : StatusHumanValidated ) ( m a , rdf : object , o ) ( s o ) ( s , ontodup : notSameAs , o )
ontodupDetectConflictFromAssertions ( m a 1 , rdf : type , ontodup : MatchAssertion ) ( m a 1 , rdf : predicate , owl : sameAs ) ( m a 1 , rdf : subject , s ) ( m a 1 , rdf : object , o ) ( m a 2 , rdf : subject , s ) ( m a 2 , rdf : type , ontodup : MatchAssertion ) ( m a 2 , rdf : predicate , ontodup : notSameAs ) ( m a 2 , rdf : object , o ) ( s , ontodup : conflictWith , o )
ontodupDetectConflictFromAssertionsSwapped ( m a 1 , rdf : type , ontodup : MatchAssertion ) ( m a 1 , rdf : predicate , owl : sameAs ) ( m a 1 , rdf : subject , s ) ( m a 1 , rdf : object , o ) ( m a 2 , rdf : subject , o ) ( m a 2 , rdf : type , ontodup : MatchAssertion ) ( m a 2 , rdf : predicate , ontodup : notSameAs ) ( m a 2 , rdf : object , s ) ( s , ontodup : conflictWith , o )
ontodupMarkConflictingAssertionsSameAs ( m a , rdf : type , ontodup : MatchAssertion ) ( m a , rdf : subject , s ) ( m a , rdf : object , o ) ( m a , rdf : predicate , owl : sameAs ) ( s , ontodup : conflictWith , o ) ( m a , ontodup : status , ontodup : StatusConflict )
ontodupMarkConflictingAssertionsNotSameAs ( m a , rdf : type , ontodup : MatchAssertion ) ( m a , rdf : subject , s ) ( m a , rdf : object , o ) ( m a , rdf : predicate , ontodup : notSameAs ) ( s , ontodup : conflictWith , o ) ( m a , ontodup : status , ontodup : StatusConflict )
ontodupNotSameAsSymmetric ( s 1 , ontodup : notSameAs , s 2 ) ( s 2 , ontodup : notSameAs , s 1 )
ontodupCloseNotSameAsThroughSameAsLeft ( s 1 , owl : sameAs , s 2 ) ( s 2 , ontodup : notSameAs , s 3 ) ( s 1 , ontodup : notSameAs , s 3 )
ontodupCloseNotSameAsThroughSameAsRight ( s 1 , ontodup : notSameAs , s 2 ) ( s 2 , owl : sameAs , s 3 ) ( s 1 , ontodup : notSameAs , s 3 )
ontodupConflictWithSymmetric ( s 1 , ontodup : conflictWith , s 2 ) ( s 1 s 2 ) ( s 2 , ontodup : conflictWith , s 1 )

Appendix B. SPARQL Query Templates for Audit and Operational Inspection

The following listings illustrate minimal and reproducible query patterns that operate on the integrated repository state, without hard-coding deployment-specific graph names:
  • Listing A1 retrieves the audit queue (proposed assertions) together with optional LLM cost metadata and method attribution.
  • Listing A2 enumerates conflict-flagged pairs and the assertions involved.
  • Listing A3 returns consumable owl:sameAs links filtered by the absence of contradictions between entities under incompatible assertions (ontodup:conflictWith).
  • Listing A4 provides end-to-end traceability for each consumable link by retrieving the backing ontodup:MatchAssertion along with evidence and cost metadata.
Listing A1. Audit queue—proposed assertions with optional LLM cost metadata.
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
\hl{PREFIX ontodup: }
<https://w3id.org/ontodup/ontology#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX prov:    <http://www.w3.org/ns/prov#>

SELECT DISTINCT ?ma ?s ?p ?o ?algo ?tokens ?lat
WHERE {
 GRAPH ?g {
  ?ma a ontodup:MatchAssertion ;
     rdf:subject ?s ;
     rdf:predicate ?p ;
     rdf:object ?o ;
     ontodup:status ontodup:StatusProposed ;
     prov:wasAttributedTo ?algo .
  OPTIONAL { ?ma ontodup:llmTotalTokens ?tokens_raw . }
  OPTIONAL { ?ma ontodup:llmLatencySeconds ?lat_raw . }
 }
 BIND(COALESCE(xsd:integer(?tokens_raw), 0) AS ?tokens)
 BIND(COALESCE(xsd:decimal(?lat_raw), 0.0)  AS ?lat)
}
ORDER BY DESC(?tokens) DESC(?lat)
Listing A2. Conflicts—pairs flagged with conflictWith and the involved assertions.
PREFIX ontodup: <https://w3id.org/ontodup/ontology#>
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?a ?b ?ma ?s ?o ?pred ?status
WHERE {
 GRAPH ?gC { ?a ontodup:conflictWith ?b . }
 GRAPH ?gA {
  ?ma a ontodup:MatchAssertion ;
     rdf:subject ?s ;
     rdf:predicate ?pred ;
     rdf:object ?o ;
     ontodup:status ?status .
 }

 # Match assertions that refer to the same conflicting pair in either direction
 FILTER( (?s = ?a && ?o = ?b) || (?s = ?b && ?o = ?a) )
}
ORDER BY ?a ?b ?ma
Listing A3. Conflict-free operational consumption—owl:sameAs links not flagged by conflicts.
PREFIX ontodup: <https://w3id.org/ontodup/ontology#>
PREFIX owl:   <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT ?s ?o
WHERE {
 GRAPH ?g {
  ?s owl:sameAs ?o .
 }
 FILTER(?s != ?o)
 FILTER(STR(?s) < STR(?o))

 FILTER NOT EXISTS { ?s ontodup:conflictWith ?o . }
 FILTER NOT EXISTS { ?o ontodup:conflictWith ?s . }
}
Listing A4. Traceability for a consumable link—backing assertion plus evidence and cost.
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl:   <http://www.w3.org/2002/07/owl#>
PREFIX ontodup: <https://w3id.org/ontodup/ontology#>

SELECT DISTINCT ?s ?o ?ma ?status ?ev ?tokens ?lat
WHERE {
 GRAPH ?gOp { ?s owl:sameAs ?o . }

 FILTER(?s != ?o)
 FILTER(STR(?s) < STR(?o))
 FILTER NOT EXISTS { ?s ontodup:conflictWith ?o . }
 FILTER NOT EXISTS { ?o ontodup:conflictWith ?s . }

 GRAPH ?gA {
  ?ma a ontodup:MatchAssertion ;
     rdf:subject ?s ;
     rdf:predicate owl:sameAs ;
     rdf:object ?o ;
     ontodup:status ?status .
  OPTIONAL { ?ma ontodup:hasEvidence ?ev . }
  OPTIONAL { ?ma ontodup:llmTotalTokens ?tokens . }
  OPTIONAL { ?ma ontodup:llmLatencySeconds ?lat . }
 }

 FILTER(?status IN (ontodup:StatusAutoAccepted, ontodup:StatusHumanValidated))
}
ORDER BY ?s ?o ?ma

References

  1. Li, J.; Sun, T.; Xian, G.; Huang, Y.; Zhao, R. Scientific knowledge graph-driven research profiling. In Proceedings of the 6th International Conference on Computer Science and Application Engineering; ACM: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
  2. Liu, C.; Du, Y.R.; Wang, Z. Topic discovery and hotspot analysis of scientific literature based on fine-gained knowledge graph. Inf. Stud. Theory Appl. 2024, 47, 131–138. [Google Scholar] [CrossRef]
  3. Vaidhyaraman, R.; Sharon Jessika, S.; Sahaaya Arul Mary, S. BERT based citation recommender and impactful nodes identifier in RDF knowledge graphs. In Proceedings of the 4th International Conference on Soft Computing for Security Applications (ICSCSA); IEEE: Piscataway, NJ, USA, 2024; pp. 441–448. [Google Scholar] [CrossRef]
  4. Munna, T.A.; Delhibabu, R. Cross-domain co-author recommendation based on Knowledge Graph clustering. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2021; Volume 12672, pp. 782–795. [Google Scholar] [CrossRef]
  5. Verma, S.; Bhatia, R.; Harit, S.; Batish, S. Scholarly knowledge graphs through structuring scholarly communication: A review. Complex Intell. Syst. 2023, 9, 1059–1095. [Google Scholar] [CrossRef] [PubMed]
  6. Manghi, P. Challenges in building scholarly knowledge graphs for research assessment in open science. Quant. Sci. Stud. 2024, 5, 991–1021. [Google Scholar] [CrossRef]
  7. Manghi, P.; Atzori, C.; De Bonis, M.; Bardi, A. Entity deduplication in big data graphs for scholarly communication. Data Technol. Appl. 2020, 54, 409–435. [Google Scholar] [CrossRef]
  8. Färber, M. The microsoft academic knowledge graph: A linked data source with 8 billion triples of scholarly data. In Proceedings of the International Semantic Web Conference; Springer: Cham, Switzerland, 2019; pp. 113–129. [Google Scholar]
  9. Färber, M.; Ao, L. The Microsoft Academic Knowledge Graph enhanced: Author name disambiguation, publication classification, and embeddings. Quant. Sci. Stud. 2022, 3, 51–98. [Google Scholar] [CrossRef]
  10. Diesner, J.; Evans, C.; Kim, J. Impact of entity disambiguation errors on social network properties. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2015; Volume 9, pp. 81–90. [Google Scholar] [CrossRef]
  11. Steorts, R.C.; Hall, R.; Fienberg, S.E. A Bayesian Approach to Graphical Record Linkage and Deduplication. J. Am. Stat. Assoc. 2016, 111, 1660–1672. [Google Scholar] [CrossRef]
  12. Mudgal, S.; Li, H.; Rekatsinas, T.; Doan, A.; Park, Y.; Krishnan, G.; Deep, R.; Arcaute, E.; Raghavendra, V. Deep Learning for Entity Matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD); Association for Computing Machinery: New York, NY, USA, 2018; pp. 19–34. [Google Scholar] [CrossRef]
  13. Thirumuruganathan, S.; Li, H.; Tang, N.; Ouzzani, M.; Govind, Y.; Paulsen, D.; Fung, G.; Doan, A. Deep learning for blocking in entity matching: A design space exploration. Proc. VLDB Endow. 2021, 14, 2459–2472. [Google Scholar] [CrossRef]
  14. Zhang, Z.; Groth, P.; Calixto, I.; Schelter, S. A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models. In Proceedings of the EDBT; Open Proceedings: Konstanz, Germany, 2025; pp. 922–934. [Google Scholar] [CrossRef]
  15. Peeters, R.; Steiner, A.; Bizer, C. Entity matching using large language models. arXiv 2023, arXiv:2310.11244. [Google Scholar]
  16. Arvanitis-Kasinikos, I.; Papadakis, G. Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations. In Proceedings of the 27th InternationalWorkshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2025), Co-Located with EDBT/ICDT 2025. 2025. Available online: https://ceur-ws.org/Vol-3931/paper4.pdf (accessed on 26 February 2026).
  17. Steiner, A.; Peeters, R.; Bizer, C. Fine-tuning large language models for entity matching. In Proceedings of the IEEE 41st International Conference on Data Engineering Workshops (ICDEW); IEEE: Piscataway, NJ, USA, 2025; pp. 9–17. [Google Scholar]
  18. Devare, M.; Corson-Rikert, J.; Caruso, B.; Lowe, B.; Chiang, K.; McCue, J. VIVO: Connecting People, Creating a Virtual Life Sciences Community. D-Lib Mag. 2007, 13, 3. [Google Scholar] [CrossRef]
  19. Conlon, M.; Woods, A.; Triggs, G.; O’Flinn, R.; Javed, M.; Blake, J.; Gross, B.; Ahmad, Q.A.I.; Ali, S.; Barber, M.; et al. VIVO: A system for research discovery. J. Open Source Softw. 2019, 4, 1182. [Google Scholar] [CrossRef]
  20. Kaplan, A.; Betancourt, B.; Steorts, R.C. A Practical Approach to Proper Inference with Linked Data. Am. Stat. 2022, 76, 384–393. [Google Scholar] [CrossRef]
  21. Lebo, T.; Sahoo, S.; McGuinness, D.; Belhajjame, K.; Cheney, J.; Corsar, D.; Garijo, D.; Soil-Reyes, S.; Zednik, S.; Zhao, J. PROV-O: The PROV Ontology; W3C Recommendation; World Wide Web Consortium: Cambridge, MA, USA, 2013. [Google Scholar]
  22. Ciccarese, P.; Soiland-Reyes, S.; Belhajjame, K.; Gray, A.J.G.; Goble, C.; Clark, T. PAV ontology: Provenance, authoring and versioning. J. Biomed. Semant. 2013, 4, 37. [Google Scholar] [CrossRef] [PubMed]
  23. Li, Y.; Li, J.; Suhara, Y.; Doan, A.; Tan, W.C. Deep entity matching with pre-trained language models. arXiv 2020, arXiv:2004.00584. [Google Scholar] [CrossRef]
  24. Hussain, I.; Asghar, S. Incremental author name disambiguation using author profile models and self-citations. Turk. J. Electr. Eng. Comput. Sci. 2019, 27, 3665–3681. [Google Scholar] [CrossRef]
  25. De Bonis, M.; Falchi, F.; Manghi, P. Graph-based methods for Author Name Disambiguation: A survey. PeerJ Comput. Sci. 2023, 9, e1536. [Google Scholar] [CrossRef] [PubMed]
  26. Gautam, B.; Terrades, O.R.; Pujadas-Mora, J.M.; Valls, M. Knowledge graph based methods for record linkage. Pattern Recognit. Lett. 2020, 136, 127–133. [Google Scholar] [CrossRef]
  27. Trivedi, R.; Dai, H.; Wang, Y.; Song, L. Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2017; pp. 3462–3471. [Google Scholar]
  28. Goel, R.; Kazemi, S.M.; Brubaker, M.; Poupart, P. Diachronic embedding for temporal knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2020; Volume 34, pp. 3988–3995. [Google Scholar]
  29. Dasgupta, S.S.; Ray, S.N.; Talukdar, P. Hyte: Hyperplane-based temporally aware knowledge graph embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2001–2011. [Google Scholar]
  30. Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. A Survey of Blocking and Filtering Techniques for Entity Resolution. arXiv 2019, arXiv:1905.06167. [Google Scholar] [CrossRef]
  31. Zhang, W.; Wei, H.; Sisman, B.; Dong, X.L.; Faloutsos, C.; Page, D. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM ’20); ACM: New York, NY, USA, 2020; pp. 744–752. [Google Scholar] [CrossRef]
  32. Köpcke, H.; Thor, A.; Rahm, E. Evaluation of Entity Resolution Approaches on Real-World Match Problems. Proc. VLDB Endow. 2010, 3, 484–493. [Google Scholar] [CrossRef]
  33. García-Duque, J.; López-Nores, M.; Pazos-Arias, J.; Fernández-Vilas, A.; Díaz-Redondo, R.; Gil-Solla, A.; Blanco-Fernández, Y.; Ramos-Cabrer, M. A six-valued logic to reason about uncertainty and inconsistency in requirements specifications. J. Log. Comput. 2006, 16, 227–255. [Google Scholar] [CrossRef]
Figure 1. Sample ABox instantiation based on VIVO/BIBO with OntoDup extensions for provenance anchoring and stable intra-source identifiers (rendered in Protégé v5.6 using OntoGraf v1.0.1).
Figure 1. Sample ABox instantiation based on VIVO/BIBO with OntoDup extensions for provenance anchoring and stable intra-source identifiers (rendered in Protégé v5.6 using OntoGraf v1.0.1).
Information 17 00325 g001
Figure 2. OntoDup pipelines for Experiments A and B.
Figure 2. OntoDup pipelines for Experiments A and B.
Information 17 00325 g002
Table 1. Statistics of the training, validation and test splits for the DBLP-Scholar and DBLP-ACM benchmarks. #Pos means “number of positive pairs”; #Neg means “number of negative pairs”.
Table 1. Statistics of the training, validation and test splits for the DBLP-Scholar and DBLP-ACM benchmarks. #Pos means “number of positive pairs”; #Neg means “number of negative pairs”.
DatasetTrainingValidationTest
#Pos#Neg#Pos#Neg#Pos#Neg
DBLP-Scholar (D-S)427718,6881070467210704672
DBLP-ACM (D-A)1776811444420294442029
Table 2. Main elements of the OntoDup model.
Table 2. Main elements of the OntoDup model.
Elements by TypeGovernance RoleAlignment
Classes
ontodup:ScholarlyRecordSource-level bibliographic record; unit of deduplication.bibo:Document (compatible with VIVO-like graphs).
ontodup:MatchAssertionReified decision (subject–predicate–object) to attach state and evidence.rdf:Statement.
ontodup:EvidenceTraceable container for evidence associated with an assertion.Extensible; no fine-grained typing in the core.
Individuals
ontodup:AssertionStatusSchemeControlled lifecycle scheme for assertions.skos:ConceptScheme.
ontodup:StatusProposedRecorded assertion pending decision or validation.skos:Concept.
ontodup:StatusAutoAcceptedPolicy-accepted assertion; eligible for materialization.skos:Concept.
ontodup:StatusHumanValidatedHuman-confirmed assertion; eligible for materialization.skos:Concept.
ontodup:StatusHumanRejectedHuman-rejected assertion; never materialized; overrides auto-acceptance.skos:Concept.
ontodup:StatusConflictAssertion flagged due to contradiction; requires inspection.skos:Concept.
Object properties
ontodup:statusAssigns a governance state to an assertion.Range restricted to ontodup:AssertionStatus.
ontodup:hasEvidenceLinks evidence item(s) to an assertion.Range: ontodup:Evidence.
ontodup:notSameAsNegative relation for blocking and constraint closure.Symmetric, irreflexive.
ontodup:conflictWithMarks contradiction between entities under incompatible assertions.Symmetric, irreflexive.
Datatype properties
ontodup:llmTotalTokensOperational footprint of the LLM judgment per assertion.Experimental metadata (xsd:integer).
ontodup:llmLatencySecondsEnd-to-end LLM latency per assertion.Experimental metadata (xsd:decimal).
Table 3. Declarative artifacts for source provenance in OntoDup.
Table 3. Declarative artifacts for source provenance in OntoDup.
Role and ArtifactTypeKey Content
OntoDup core ontology ontodup-core.vivo.ttlTBoxDefines ontodup:ScholarlyRecord, ontodup: MatchAssertion, the status lifecycle and properties such as ontodup:localId (used as a stable intra-source reference) and governance links.
Provenance vocabulary ontodup-vocab.vivo.ttlABoxDeclares c4o:BibliographicInformationSource instances for DBLP/ACM/Scholar with stable IRIs that are referenced from dcterms:source.
Table 4. Experiment A: operational quality on materialized (AutoAccepted) links in the test split.
Table 4. Experiment A: operational quality on materialized (AutoAccepted) links in the test split.
DBLP-ACM (D-A)
Model N o p P test TPFPFNPrec.Rec. F 1
Ditto4094444036410.98530.90770.9449
gpt-4o-mini44644443115130.96640.97070.9685
gpt-4o4604444431710.96300.99770.9801
gpt-4.1-nano48644442957150.88270.96620.9226
gpt-4.1-mini4754444423320.93050.99550.9619
gpt-4.14654444442100.95481.00000.9769
DBLP-Scholar (D-S)
Model N o p P test TPFPFNPrec.Rec. F 1
Ditto1003107097330970.97010.90930.9387
gpt-4o-mini8161070762543080.93380.71210.8081
gpt-4o1051107097873920.93050.91400.9222
gpt-4.1-nano12991070999300710.76910.93360.8434
gpt-4.1-mini117510701002173680.85280.93640.8927
gpt-4.111111070982129880.88390.91780.9005
Table 5. Experiment A: assertion distribution by governance state and non-materialized positives in the test split.
Table 5. Experiment A: assertion distribution by governance state and non-materialized positives in the test split.
DBLP-ACM (D-A)
ModelAutoAccPropRejected P test prop P test nm
Ditto409412023347
gpt-4o-mini44610201776
gpt-4o460109190410
gpt-4.1-nano48601987015
gpt-4.1-mini4753199511
gpt-4.146582192600
DBLP-Scholar (D-S)
ModelAutoAccPropRejected P test prop P test nm
Ditto10034846913067
gpt-4o-mini8161854741153155
gpt-4o10515244167911
gpt-4.1-nano129994434566
gpt-4.1-mini11757844892939
gpt-4.111114614170844
Table 6. Experiment A: inference cost for LLMs (tokens and latency) obtained via SPARQL aggregations over per-assertion metadata.
Table 6. Experiment A: inference cost for LLMs (tokens and latency) obtained via SPARQL aggregations over per-assertion metadata.
DBLP-ACM (D-A)
Modeln InTok a v g OutTok a v g TotTok a v g Lat a v g TotTok s u m Lat s u m
gpt-4o-mini2473315.9914.00329.990.8570816,0692119.35
gpt-4o2473315.9914.00329.991.1699816,0692893.24
gpt-4.1-nano2473315.9914.02330.011.0937816,1172704.79
gpt-4.1-mini2473315.9914.00329.991.0963816,0692711.20
gpt-4.12473315.9916.00331.991.1579821,0232863.53
DBLP-Scholar (D-S)
Modeln InTok a v g OutTok a v g TotTok a v g Lat a v g TotTok s u m Lat s u m
gpt-4o-mini5742289.5614.00303.561.14891,743,0216597.15
gpt-4o5742289.5614.00303.561.20921,743,0216942.97
gpt-4.1-nano5742289.5614.02303.571.08061,743,1176204.67
gpt-4.1-mini5742289.5614.00303.561.09231,743,0216271.86
gpt-4.15742289.5615.99305.541.11581,754,4396407.18
Table 7. Experiment B: DeepBlocker candidate generation and selection of k under high blocking recall.
Table 7. Experiment B: DeepBlocker candidate generation and selection of k under high blocking recall.
DatasetkBlocking Recall | C |
DBLP-ACM50.99613,080
DBLP-Scholar1500.981392,400
Table 8. End-to-end operational quality against a full positive reference anchor (benchmark-derived validations) with Ditto.
Table 8. End-to-end operational quality against a full positive reference anchor (benchmark-derived validations) with Ditto.
Dataset N op P ref TPFPFNPrec.Rec. F 1
DBLP-ACM212022202066541540.9745280.9306300.952070
DBLP-Scholar2935534728944124530.9860300.5412300.698860
Table 9. Decomposition of non-materialized positives under the governed triage policy with Ditto.
Table 9. Decomposition of non-materialized positives under the governed triage policy with Ditto.
DatasetAutoaccepted LinksProposed AssertionsProposed PositivesMissing Positives
DBLP-ACM212064214113
DBLP-Scholar293524731766687
Table 10. Example audit extract from the governance layer (SPARQL result) using Ditto as comparator.
Table 10. Example audit extract from the governance layer (SPARQL result) using Ditto as comparator.
LeftRecordRightRecordMatchScoreStatusEvidenceSummary
DBLP/2123ACM/01.0000AutoAcceptedblocking = DeepBlocker
τ proposed = 0.6225
τ autoaccept = 0.9810
score = 1.0000
DBLP/1470ACM/11.0000AutoAcceptedblocking = DeepBlocker
τ proposed = 0.6225
τ autoaccept = 0.9810
score = 1.0000
DBLP/2360ACM/18860.9800Proposedblocking = DeepBlocker
τ proposed = 0.6225
τ autoaccept = 0.9810
score = 0.9800
DBLP/180ACM/16680.9799Proposedblocking = DeepBlocker
τ proposed = 0.6225
τ autoaccept = 0.9810
score = 0.9799
Table 11. Deduplication impact: base vs. inferred records with Ditto, before and after rejection overrides recorded in the governance layer.
Table 11. Deduplication impact: base vs. inferred records with Ditto, before and after rejection overrides recorded in the governance layer.
DatasetBase
Records
Inferred
(No Rejections)
Inferred
(with Rejections)
Rejection
Overrides
ΔInferred
DBLP-ACM49102660269054+30
DBLP-Scholar66,87961,57261,60441+32
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Galán-Mena, J.; López-Nores, M.; Pulla-Sánchez, D.; Guerrero-Vásquez, L.F.; Salgado-Guerrero, J.P. OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication. Information 2026, 17, 325. https://doi.org/10.3390/info17040325

AMA Style

Galán-Mena J, López-Nores M, Pulla-Sánchez D, Guerrero-Vásquez LF, Salgado-Guerrero JP. OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication. Information. 2026; 17(4):325. https://doi.org/10.3390/info17040325

Chicago/Turabian Style

Galán-Mena, Jorge, Martín López-Nores, Daniel Pulla-Sánchez, Luis Fernando Guerrero-Vásquez, and Juan Pablo Salgado-Guerrero. 2026. "OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication" Information 17, no. 4: 325. https://doi.org/10.3390/info17040325

APA Style

Galán-Mena, J., López-Nores, M., Pulla-Sánchez, D., Guerrero-Vásquez, L. F., & Salgado-Guerrero, J. P. (2026). OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication. Information, 17(4), 325. https://doi.org/10.3390/info17040325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop