OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication
Abstract
1. Introduction
- A VIVO-aligned ontology extension for DRD governance, representing match assertions, evidence and decisions with explicit provenance and versioning.
- A reproducible, incremental pipeline that ingests heterogeneous bibliographic sources into RDF and supports continuous deduplication under evolving data.
- A mixed-initiative matching protocol that structures LLM and human assessments as auditable assertions and links them to evidence artifacts.
- A governance layer for conflicts and closure, enabling policy-aware detection of inconsistent assertions and supporting controlled materialization of identity decisions.
2. Related Work
2.1. Duplicate Records and Noise in Scholarly Metadata
2.2. Candidate Generation and Matching Paradigms
2.3. Scholarly Knowledge Graph Infrastructures
2.4. Relational and Graph-Based Record Linkage
2.5. Provenance, Evidence and Explainability
2.6. Governance and Lifecycle Management for Identity Resolution
2.7. Gap Addressed by OntoDup
3. Problem Definition and Datasets
3.1. Problem Definition
3.2. Bibliographic Datasets
3.3. Experimental Partitions and Class Balance
4. OntoDup Governance Model
4.1. Entities and Relations of the Model
- ontodup:ScholarlyRecord represents the source-level bibliographic record and is specialized as a bibliographic document to enable identity links across heterogeneous catalogs.
- ontodup:MatchAssertion reifies a decision about a pair of entities by explicitly representing the subject–predicate–object triple as an rdf:Statement. This reification prevents the operational link from being confused with evidence or governance state, enabling metadata attachment without polluting the graph with preliminary decisions.
- ontodup:Evidence acts as a traceable container for the evidence associated with an assertion.
4.2. Governance Semantics: Assertions, States and Materialization
4.3. PIE Rules for Materialization and Constraint Closure
4.4. Conflict Handling Without Retraction
4.5. Provenance and Intra-Source Anchoring for Traceability
5. OntoDup Pipeline: Ingestion, Matching and Governed Materialization
- 1.
- Data sources and experimental inputs (Section 5.1). Bibliographic source records and benchmark supervision are prepared as pipeline inputs.
- 2.
- Ingestion and base graph construction (Section 5.2). Source records are normalized, mapped to RDF, and loaded into the base graph.
- 3.
- Candidate generation (Section 5.3). Potential duplicate pairs are generated over the base graph.
- 4.
- Pairwise assessment (Section 5.4). Candidate pairs are evaluated by a selected matching component.
- 5.
- Governed assertion persistence (Section 5.5). Matching outcomes are recorded as governed assertions rather than inserted directly as operational links.
- 6.
- Controlled materialization (Section 5.6). Accepted assertions are promoted to the operational graph under the active policy.
- 7.
- Repository interrogation (Section 5.7). The repository supports governance-oriented audit and operational query views.
5.1. Data Sources and Experimental Inputs
5.2. Ingestion and Base Graph Construction
5.3. Candidate Generation
5.4. Pairwise Assessment
5.5. Governed Assertion Persistence
5.6. Controlled Materialization
5.7. Repository Interrogation and Reproducibility Support
6. Results
6.1. Experiment A (Pre-Blocked): Operational Quality, Governance Workload and Inference Cost
- On DBLP-ACM, validation tuning yields and , reaching precision in the auto-accepted region.
- On DBLP-Scholar, the target precision is infeasible under the minimum recall constraint, and is therefore relaxed to the best attainable value, with and precision .
- For DBLP-Scholar, for instance, Ditto yields and , while gpt-4o yields and , evidencing different trade-off profiles between non-materialization due to proposals and non-materialization due to negative decisions.
- For DBLP-ACM, we observe configurations where Proposed is mostly composed of negative cases, such as gpt-4.1 with 82 proposals and , as well as cases where the loss of positives concentrates on negative decisions, such as gpt-4.1-nano with .
6.2. Experiment B: Governed End-to-End Matching
- On DBLP-ACM, a small k is sufficient because the records exhibit higher consistency and cleaner signals, achieving blocking recall of with .
- On DBLP-Scholar, textual variability and incomplete metadata require expanding the retrieved neighborhood to avoid losing true positives during blocking; we therefore use , reaching blocking recall of .
- For DBLP-ACM, where k is small and the candidate set is more controlled, we adopt a strict budget (), which forces a small review queue; under these conditions, the process selects , maintaining recall and precision within the proposed band.
- For DBLP-Scholar, the high-k blocking substantially increases candidate volume and variability; requiring the same budget would cause a strong coverage drop. We therefore allow a larger budget (), which yields , preserving recall () at the cost of lower precision in the reviewable band (), which is consistent with a setting where the purpose of Proposed is to retain plausible cases for audit and avoid premature rejection of positives.
- On DBLP-Scholar, 1766 positives remain as Proposed, and 687 positives fall outside persistent triage; both components sum exactly to the 2453 FN reported in Table 8.
- On DBLP-ACM, the same phenomenon exists but at much smaller magnitude (141 proposed and 13 missing, totaling 154 FN), which preserves a high ().
- On DBLP-ACM the flow goes from 4910 base records to 2660 inferred records without rejections, and applying 54 rejection overrides increases the total to 2690.
- On DBLP-Scholar, the total goes from 66,879 base records to 61,572 inferred records without rejections, and increases to 61,604 after applying 41 rejection overrides.
7. Discussion
7.1. Operational Metrics as a Function of the Two-Layer Transition
- On DBLP-ACM, with the flow materializes 2120 operational links and obtains Rec. against the human positive anchor, with FP and Prec. (Table 8).
- On DBLP-Scholar, the target is infeasible under end-to-end constraints and is relaxed to the best achievable value; the resulting configuration sustains high precision (Prec. ; FP ) but reduces operational coverage (Rec. ), which drags down to and concentrates error in FN (Table 8).
7.2. Coverage Decomposition Under Selective Materialization
- On DBLP-ACM, the same phenomenon appears at a significantly smaller scale (141 positives in ontodup:StatusProposed and 13 outside persistence, totaling FN ), which remains consistent with maintaining a high in the operational view (Table 9). In this frame, ontodup:StatusProposed is interpreted as traceable audit workload rather than as operational error, whereas ontodup:StatusAutoAccepted delineates the consumable subset effectively promoted to .
7.3. Governance Workload and Cost as Operational Constraints
7.4. Auditing and Correcting Consumable Links
7.5. Practical Robustness for Curation and Operational Decision-Making
Illustrative example. Two publications harvested from distinct bibliographic sources share a near-identical title and publication year but differ in author-string formatting or venue representation. In a score-to-link workflow, a sufficiently high score may lead directly to the export of an owl:sameAs link, even if the case remains ambiguous and later requires correction. Under OntoDup, the same outcome can instead be persisted as a governed assertion in StatusProposed, together with its score, provenance and compact evidence, and only promoted to the operational graph if it satisfies the applicable acceptance policy or is validated through review. This supports curation by keeping borderline cases visible, auditing by preserving the basis of the decision, and data management by preventing uncertain links from propagating immediately into downstream search, profiling or analytics services.
7.6. Limitations, Threats to Validity and Implications for Reuse
8. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ABox | Assertional Box |
| BIBO | Bibliographic Ontology |
| D-A | DBLP-ACM (benchmark dataset) |
| D-S | DBLP-Scholar (benchmark dataset) |
| DBLP | Digital Bibliography & Library Project |
| DRD | Duplicate Record Detection |
| EM | Entity Matching |
| FN | False Negatives |
| FP | False Positives |
| F1 | F1-score |
| KG | Knowledge Graph |
| IRI | Internationalized Resource Identifier |
| LLM | Large Language Model |
| MAKG | Microsoft Academic Knowledge Graph |
| OCR | Optical Character Recognition |
| OWL | Web Ontology Language |
| PIE | PIE rules/constraints (as used by the paper’s inference engine) |
| Prec | Precision |
| PROV-O | PROV Ontology |
| RDF | Resource Description Framework |
| Rec | Recall |
| SKOS | Simple Knowledge Organization System |
| SPARQL | SPARQL query language |
| TBox | Terminological Box |
| TP | True Positives |
| VIVO | VIVO ontology/infrastructure |
| XSD | Extensible Markup Language Schema Datatypes |
Appendix A. Formal PIE Rules for Operational Materialization
| Rule | Formal Form (Antecedent ⇒ Consequent) |
|---|---|
| ontodupApplyAutoAcceptedSameAs | |
| ontodupApplyHumanValidatedSameAs | |
| ontodupApplyAutoacceptedNotSameAs | |
| ontodupApplyHumanValidatedNotSameAs | |
| ontodupDetectConflictFromAssertions | |
| ontodupDetectConflictFromAssertionsSwapped | |
| ontodupMarkConflictingAssertionsSameAs | |
| ontodupMarkConflictingAssertionsNotSameAs | |
| ontodupNotSameAsSymmetric | |
| ontodupCloseNotSameAsThroughSameAsLeft | |
| ontodupCloseNotSameAsThroughSameAsRight | |
| ontodupConflictWithSymmetric |
Appendix B. SPARQL Query Templates for Audit and Operational Inspection
- Listing A1 retrieves the audit queue (proposed assertions) together with optional LLM cost metadata and method attribution.
- Listing A2 enumerates conflict-flagged pairs and the assertions involved.
- Listing A3 returns consumable owl:sameAs links filtered by the absence of contradictions between entities under incompatible assertions (ontodup:conflictWith).
- Listing A4 provides end-to-end traceability for each consumable link by retrieving the backing ontodup:MatchAssertion along with evidence and cost metadata.
| Listing A1. Audit queue—proposed assertions with optional LLM cost metadata. |
| PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> \hl{PREFIX ontodup: } <https://w3id.org/ontodup/ontology#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX prov: <http://www.w3.org/ns/prov#> SELECT DISTINCT ?ma ?s ?p ?o ?algo ?tokens ?lat WHERE { GRAPH ?g { ?ma a ontodup:MatchAssertion ; rdf:subject ?s ; rdf:predicate ?p ; rdf:object ?o ; ontodup:status ontodup:StatusProposed ; prov:wasAttributedTo ?algo . OPTIONAL { ?ma ontodup:llmTotalTokens ?tokens_raw . } OPTIONAL { ?ma ontodup:llmLatencySeconds ?lat_raw . } } BIND(COALESCE(xsd:integer(?tokens_raw), 0) AS ?tokens) BIND(COALESCE(xsd:decimal(?lat_raw), 0.0) AS ?lat) } ORDER BY DESC(?tokens) DESC(?lat) |
| Listing A2. Conflicts—pairs flagged with conflictWith and the involved assertions. |
| PREFIX ontodup: <https://w3id.org/ontodup/ontology#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT DISTINCT ?a ?b ?ma ?s ?o ?pred ?status WHERE { GRAPH ?gC { ?a ontodup:conflictWith ?b . } GRAPH ?gA { ?ma a ontodup:MatchAssertion ; rdf:subject ?s ; rdf:predicate ?pred ; rdf:object ?o ; ontodup:status ?status . } # Match assertions that refer to the same conflicting pair in either direction FILTER( (?s = ?a && ?o = ?b) || (?s = ?b && ?o = ?a) ) } ORDER BY ?a ?b ?ma |
| Listing A3. Conflict-free operational consumption—owl:sameAs links not flagged by conflicts. |
| PREFIX ontodup: <https://w3id.org/ontodup/ontology#> PREFIX owl: <http://www.w3.org/2002/07/owl#> SELECT DISTINCT ?s ?o WHERE { GRAPH ?g { ?s owl:sameAs ?o . } FILTER(?s != ?o) FILTER(STR(?s) < STR(?o)) FILTER NOT EXISTS { ?s ontodup:conflictWith ?o . } FILTER NOT EXISTS { ?o ontodup:conflictWith ?s . } } |
| Listing A4. Traceability for a consumable link—backing assertion plus evidence and cost. |
| PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX ontodup: <https://w3id.org/ontodup/ontology#> SELECT DISTINCT ?s ?o ?ma ?status ?ev ?tokens ?lat WHERE { GRAPH ?gOp { ?s owl:sameAs ?o . } FILTER(?s != ?o) FILTER(STR(?s) < STR(?o)) FILTER NOT EXISTS { ?s ontodup:conflictWith ?o . } FILTER NOT EXISTS { ?o ontodup:conflictWith ?s . } GRAPH ?gA { ?ma a ontodup:MatchAssertion ; rdf:subject ?s ; rdf:predicate owl:sameAs ; rdf:object ?o ; ontodup:status ?status . OPTIONAL { ?ma ontodup:hasEvidence ?ev . } OPTIONAL { ?ma ontodup:llmTotalTokens ?tokens . } OPTIONAL { ?ma ontodup:llmLatencySeconds ?lat . } } FILTER(?status IN (ontodup:StatusAutoAccepted, ontodup:StatusHumanValidated)) } ORDER BY ?s ?o ?ma |
References
- Li, J.; Sun, T.; Xian, G.; Huang, Y.; Zhao, R. Scientific knowledge graph-driven research profiling. In Proceedings of the 6th International Conference on Computer Science and Application Engineering; ACM: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
- Liu, C.; Du, Y.R.; Wang, Z. Topic discovery and hotspot analysis of scientific literature based on fine-gained knowledge graph. Inf. Stud. Theory Appl. 2024, 47, 131–138. [Google Scholar] [CrossRef]
- Vaidhyaraman, R.; Sharon Jessika, S.; Sahaaya Arul Mary, S. BERT based citation recommender and impactful nodes identifier in RDF knowledge graphs. In Proceedings of the 4th International Conference on Soft Computing for Security Applications (ICSCSA); IEEE: Piscataway, NJ, USA, 2024; pp. 441–448. [Google Scholar] [CrossRef]
- Munna, T.A.; Delhibabu, R. Cross-domain co-author recommendation based on Knowledge Graph clustering. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2021; Volume 12672, pp. 782–795. [Google Scholar] [CrossRef]
- Verma, S.; Bhatia, R.; Harit, S.; Batish, S. Scholarly knowledge graphs through structuring scholarly communication: A review. Complex Intell. Syst. 2023, 9, 1059–1095. [Google Scholar] [CrossRef] [PubMed]
- Manghi, P. Challenges in building scholarly knowledge graphs for research assessment in open science. Quant. Sci. Stud. 2024, 5, 991–1021. [Google Scholar] [CrossRef]
- Manghi, P.; Atzori, C.; De Bonis, M.; Bardi, A. Entity deduplication in big data graphs for scholarly communication. Data Technol. Appl. 2020, 54, 409–435. [Google Scholar] [CrossRef]
- Färber, M. The microsoft academic knowledge graph: A linked data source with 8 billion triples of scholarly data. In Proceedings of the International Semantic Web Conference; Springer: Cham, Switzerland, 2019; pp. 113–129. [Google Scholar]
- Färber, M.; Ao, L. The Microsoft Academic Knowledge Graph enhanced: Author name disambiguation, publication classification, and embeddings. Quant. Sci. Stud. 2022, 3, 51–98. [Google Scholar] [CrossRef]
- Diesner, J.; Evans, C.; Kim, J. Impact of entity disambiguation errors on social network properties. In Proceedings of the International AAAI Conference on Web and Social Media; AAAI Press: Washington, DC, USA, 2015; Volume 9, pp. 81–90. [Google Scholar] [CrossRef]
- Steorts, R.C.; Hall, R.; Fienberg, S.E. A Bayesian Approach to Graphical Record Linkage and Deduplication. J. Am. Stat. Assoc. 2016, 111, 1660–1672. [Google Scholar] [CrossRef]
- Mudgal, S.; Li, H.; Rekatsinas, T.; Doan, A.; Park, Y.; Krishnan, G.; Deep, R.; Arcaute, E.; Raghavendra, V. Deep Learning for Entity Matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD); Association for Computing Machinery: New York, NY, USA, 2018; pp. 19–34. [Google Scholar] [CrossRef]
- Thirumuruganathan, S.; Li, H.; Tang, N.; Ouzzani, M.; Govind, Y.; Paulsen, D.; Fung, G.; Doan, A. Deep learning for blocking in entity matching: A design space exploration. Proc. VLDB Endow. 2021, 14, 2459–2472. [Google Scholar] [CrossRef]
- Zhang, Z.; Groth, P.; Calixto, I.; Schelter, S. A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models. In Proceedings of the EDBT; Open Proceedings: Konstanz, Germany, 2025; pp. 922–934. [Google Scholar] [CrossRef]
- Peeters, R.; Steiner, A.; Bizer, C. Entity matching using large language models. arXiv 2023, arXiv:2310.11244. [Google Scholar]
- Arvanitis-Kasinikos, I.; Papadakis, G. Entity Matching with 7B LLMs: A Study on Prompting Strategies and Hardware Limitations. In Proceedings of the 27th InternationalWorkshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2025), Co-Located with EDBT/ICDT 2025. 2025. Available online: https://ceur-ws.org/Vol-3931/paper4.pdf (accessed on 26 February 2026).
- Steiner, A.; Peeters, R.; Bizer, C. Fine-tuning large language models for entity matching. In Proceedings of the IEEE 41st International Conference on Data Engineering Workshops (ICDEW); IEEE: Piscataway, NJ, USA, 2025; pp. 9–17. [Google Scholar]
- Devare, M.; Corson-Rikert, J.; Caruso, B.; Lowe, B.; Chiang, K.; McCue, J. VIVO: Connecting People, Creating a Virtual Life Sciences Community. D-Lib Mag. 2007, 13, 3. [Google Scholar] [CrossRef]
- Conlon, M.; Woods, A.; Triggs, G.; O’Flinn, R.; Javed, M.; Blake, J.; Gross, B.; Ahmad, Q.A.I.; Ali, S.; Barber, M.; et al. VIVO: A system for research discovery. J. Open Source Softw. 2019, 4, 1182. [Google Scholar] [CrossRef]
- Kaplan, A.; Betancourt, B.; Steorts, R.C. A Practical Approach to Proper Inference with Linked Data. Am. Stat. 2022, 76, 384–393. [Google Scholar] [CrossRef]
- Lebo, T.; Sahoo, S.; McGuinness, D.; Belhajjame, K.; Cheney, J.; Corsar, D.; Garijo, D.; Soil-Reyes, S.; Zednik, S.; Zhao, J. PROV-O: The PROV Ontology; W3C Recommendation; World Wide Web Consortium: Cambridge, MA, USA, 2013. [Google Scholar]
- Ciccarese, P.; Soiland-Reyes, S.; Belhajjame, K.; Gray, A.J.G.; Goble, C.; Clark, T. PAV ontology: Provenance, authoring and versioning. J. Biomed. Semant. 2013, 4, 37. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Li, J.; Suhara, Y.; Doan, A.; Tan, W.C. Deep entity matching with pre-trained language models. arXiv 2020, arXiv:2004.00584. [Google Scholar] [CrossRef]
- Hussain, I.; Asghar, S. Incremental author name disambiguation using author profile models and self-citations. Turk. J. Electr. Eng. Comput. Sci. 2019, 27, 3665–3681. [Google Scholar] [CrossRef]
- De Bonis, M.; Falchi, F.; Manghi, P. Graph-based methods for Author Name Disambiguation: A survey. PeerJ Comput. Sci. 2023, 9, e1536. [Google Scholar] [CrossRef] [PubMed]
- Gautam, B.; Terrades, O.R.; Pujadas-Mora, J.M.; Valls, M. Knowledge graph based methods for record linkage. Pattern Recognit. Lett. 2020, 136, 127–133. [Google Scholar] [CrossRef]
- Trivedi, R.; Dai, H.; Wang, Y.; Song, L. Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2017; pp. 3462–3471. [Google Scholar]
- Goel, R.; Kazemi, S.M.; Brubaker, M.; Poupart, P. Diachronic embedding for temporal knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2020; Volume 34, pp. 3988–3995. [Google Scholar]
- Dasgupta, S.S.; Ray, S.N.; Talukdar, P. Hyte: Hyperplane-based temporally aware knowledge graph embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2001–2011. [Google Scholar]
- Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. A Survey of Blocking and Filtering Techniques for Entity Resolution. arXiv 2019, arXiv:1905.06167. [Google Scholar] [CrossRef]
- Zhang, W.; Wei, H.; Sisman, B.; Dong, X.L.; Faloutsos, C.; Page, D. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM ’20); ACM: New York, NY, USA, 2020; pp. 744–752. [Google Scholar] [CrossRef]
- Köpcke, H.; Thor, A.; Rahm, E. Evaluation of Entity Resolution Approaches on Real-World Match Problems. Proc. VLDB Endow. 2010, 3, 484–493. [Google Scholar] [CrossRef]
- García-Duque, J.; López-Nores, M.; Pazos-Arias, J.; Fernández-Vilas, A.; Díaz-Redondo, R.; Gil-Solla, A.; Blanco-Fernández, Y.; Ramos-Cabrer, M. A six-valued logic to reason about uncertainty and inconsistency in requirements specifications. J. Log. Comput. 2006, 16, 227–255. [Google Scholar] [CrossRef]


| Dataset | Training | Validation | Test | |||
|---|---|---|---|---|---|---|
| #Pos | #Neg | #Pos | #Neg | #Pos | #Neg | |
| DBLP-Scholar (D-S) | 4277 | 18,688 | 1070 | 4672 | 1070 | 4672 |
| DBLP-ACM (D-A) | 1776 | 8114 | 444 | 2029 | 444 | 2029 |
| Elements by Type | Governance Role | Alignment |
|---|---|---|
| Classes | ||
| ontodup:ScholarlyRecord | Source-level bibliographic record; unit of deduplication. | bibo:Document (compatible with VIVO-like graphs). |
| ontodup:MatchAssertion | Reified decision (subject–predicate–object) to attach state and evidence. | rdf:Statement. |
| ontodup:Evidence | Traceable container for evidence associated with an assertion. | Extensible; no fine-grained typing in the core. |
| Individuals | ||
| ontodup:AssertionStatusScheme | Controlled lifecycle scheme for assertions. | skos:ConceptScheme. |
| ontodup:StatusProposed | Recorded assertion pending decision or validation. | skos:Concept. |
| ontodup:StatusAutoAccepted | Policy-accepted assertion; eligible for materialization. | skos:Concept. |
| ontodup:StatusHumanValidated | Human-confirmed assertion; eligible for materialization. | skos:Concept. |
| ontodup:StatusHumanRejected | Human-rejected assertion; never materialized; overrides auto-acceptance. | skos:Concept. |
| ontodup:StatusConflict | Assertion flagged due to contradiction; requires inspection. | skos:Concept. |
| Object properties | ||
| ontodup:status | Assigns a governance state to an assertion. | Range restricted to ontodup:AssertionStatus. |
| ontodup:hasEvidence | Links evidence item(s) to an assertion. | Range: ontodup:Evidence. |
| ontodup:notSameAs | Negative relation for blocking and constraint closure. | Symmetric, irreflexive. |
| ontodup:conflictWith | Marks contradiction between entities under incompatible assertions. | Symmetric, irreflexive. |
| Datatype properties | ||
| ontodup:llmTotalTokens | Operational footprint of the LLM judgment per assertion. | Experimental metadata (xsd:integer). |
| ontodup:llmLatencySeconds | End-to-end LLM latency per assertion. | Experimental metadata (xsd:decimal). |
| Role and Artifact | Type | Key Content |
|---|---|---|
| OntoDup core ontology ontodup-core.vivo.ttl | TBox | Defines ontodup:ScholarlyRecord, ontodup: MatchAssertion, the status lifecycle and properties such as ontodup:localId (used as a stable intra-source reference) and governance links. |
| Provenance vocabulary ontodup-vocab.vivo.ttl | ABox | Declares c4o:BibliographicInformationSource instances for DBLP/ACM/Scholar with stable IRIs that are referenced from dcterms:source. |
| DBLP-ACM (D-A) | ||||||||
| Model | TP | FP | FN | Prec. | Rec. | |||
| Ditto | 409 | 444 | 403 | 6 | 41 | 0.9853 | 0.9077 | 0.9449 |
| gpt-4o-mini | 446 | 444 | 431 | 15 | 13 | 0.9664 | 0.9707 | 0.9685 |
| gpt-4o | 460 | 444 | 443 | 17 | 1 | 0.9630 | 0.9977 | 0.9801 |
| gpt-4.1-nano | 486 | 444 | 429 | 57 | 15 | 0.8827 | 0.9662 | 0.9226 |
| gpt-4.1-mini | 475 | 444 | 442 | 33 | 2 | 0.9305 | 0.9955 | 0.9619 |
| gpt-4.1 | 465 | 444 | 444 | 21 | 0 | 0.9548 | 1.0000 | 0.9769 |
| DBLP-Scholar (D-S) | ||||||||
| Model | TP | FP | FN | Prec. | Rec. | |||
| Ditto | 1003 | 1070 | 973 | 30 | 97 | 0.9701 | 0.9093 | 0.9387 |
| gpt-4o-mini | 816 | 1070 | 762 | 54 | 308 | 0.9338 | 0.7121 | 0.8081 |
| gpt-4o | 1051 | 1070 | 978 | 73 | 92 | 0.9305 | 0.9140 | 0.9222 |
| gpt-4.1-nano | 1299 | 1070 | 999 | 300 | 71 | 0.7691 | 0.9336 | 0.8434 |
| gpt-4.1-mini | 1175 | 1070 | 1002 | 173 | 68 | 0.8528 | 0.9364 | 0.8927 |
| gpt-4.1 | 1111 | 1070 | 982 | 129 | 88 | 0.8839 | 0.9178 | 0.9005 |
| DBLP-ACM (D-A) | |||||
| Model | AutoAcc | Prop | Rejected | ||
| Ditto | 409 | 41 | 2023 | 34 | 7 |
| gpt-4o-mini | 446 | 10 | 2017 | 7 | 6 |
| gpt-4o | 460 | 109 | 1904 | 1 | 0 |
| gpt-4.1-nano | 486 | 0 | 1987 | 0 | 15 |
| gpt-4.1-mini | 475 | 3 | 1995 | 1 | 1 |
| gpt-4.1 | 465 | 82 | 1926 | 0 | 0 |
| DBLP-Scholar (D-S) | |||||
| Model | AutoAcc | Prop | Rejected | ||
| Ditto | 1003 | 48 | 4691 | 30 | 67 |
| gpt-4o-mini | 816 | 185 | 4741 | 153 | 155 |
| gpt-4o | 1051 | 524 | 4167 | 91 | 1 |
| gpt-4.1-nano | 1299 | 9 | 4434 | 5 | 66 |
| gpt-4.1-mini | 1175 | 78 | 4489 | 29 | 39 |
| gpt-4.1 | 1111 | 461 | 4170 | 84 | 4 |
| DBLP-ACM (D-A) | |||||||
| Model | n | ||||||
| gpt-4o-mini | 2473 | 315.99 | 14.00 | 329.99 | 0.8570 | 816,069 | 2119.35 |
| gpt-4o | 2473 | 315.99 | 14.00 | 329.99 | 1.1699 | 816,069 | 2893.24 |
| gpt-4.1-nano | 2473 | 315.99 | 14.02 | 330.01 | 1.0937 | 816,117 | 2704.79 |
| gpt-4.1-mini | 2473 | 315.99 | 14.00 | 329.99 | 1.0963 | 816,069 | 2711.20 |
| gpt-4.1 | 2473 | 315.99 | 16.00 | 331.99 | 1.1579 | 821,023 | 2863.53 |
| DBLP-Scholar (D-S) | |||||||
| Model | n | ||||||
| gpt-4o-mini | 5742 | 289.56 | 14.00 | 303.56 | 1.1489 | 1,743,021 | 6597.15 |
| gpt-4o | 5742 | 289.56 | 14.00 | 303.56 | 1.2092 | 1,743,021 | 6942.97 |
| gpt-4.1-nano | 5742 | 289.56 | 14.02 | 303.57 | 1.0806 | 1,743,117 | 6204.67 |
| gpt-4.1-mini | 5742 | 289.56 | 14.00 | 303.56 | 1.0923 | 1,743,021 | 6271.86 |
| gpt-4.1 | 5742 | 289.56 | 15.99 | 305.54 | 1.1158 | 1,754,439 | 6407.18 |
| Dataset | k | Blocking Recall | |
|---|---|---|---|
| DBLP-ACM | 5 | 0.996 | 13,080 |
| DBLP-Scholar | 150 | 0.981 | 392,400 |
| Dataset | TP | FP | FN | Prec. | Rec. | |||
|---|---|---|---|---|---|---|---|---|
| DBLP-ACM | 2120 | 2220 | 2066 | 54 | 154 | 0.974528 | 0.930630 | 0.952070 |
| DBLP-Scholar | 2935 | 5347 | 2894 | 41 | 2453 | 0.986030 | 0.541230 | 0.698860 |
| Dataset | Autoaccepted Links | Proposed Assertions | Proposed Positives | Missing Positives |
|---|---|---|---|---|
| DBLP-ACM | 2120 | 642 | 141 | 13 |
| DBLP-Scholar | 2935 | 2473 | 1766 | 687 |
| LeftRecord | RightRecord | MatchScore | Status | EvidenceSummary |
|---|---|---|---|---|
| DBLP/2123 | ACM/0 | 1.0000 | AutoAccepted | blocking = DeepBlocker = 0.6225 = 0.9810 score = 1.0000 |
| DBLP/1470 | ACM/1 | 1.0000 | AutoAccepted | blocking = DeepBlocker = 0.6225 = 0.9810 score = 1.0000 |
| DBLP/2360 | ACM/1886 | 0.9800 | Proposed | blocking = DeepBlocker = 0.6225 = 0.9810 score = 0.9800 |
| DBLP/180 | ACM/1668 | 0.9799 | Proposed | blocking = DeepBlocker = 0.6225 = 0.9810 score = 0.9799 |
| Dataset | Base Records | Inferred (No Rejections) | Inferred (with Rejections) | Rejection Overrides | ΔInferred |
|---|---|---|---|---|---|
| DBLP-ACM | 4910 | 2660 | 2690 | 54 | +30 |
| DBLP-Scholar | 66,879 | 61,572 | 61,604 | 41 | +32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Galán-Mena, J.; López-Nores, M.; Pulla-Sánchez, D.; Guerrero-Vásquez, L.F.; Salgado-Guerrero, J.P. OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication. Information 2026, 17, 325. https://doi.org/10.3390/info17040325
Galán-Mena J, López-Nores M, Pulla-Sánchez D, Guerrero-Vásquez LF, Salgado-Guerrero JP. OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication. Information. 2026; 17(4):325. https://doi.org/10.3390/info17040325
Chicago/Turabian StyleGalán-Mena, Jorge, Martín López-Nores, Daniel Pulla-Sánchez, Luis Fernando Guerrero-Vásquez, and Juan Pablo Salgado-Guerrero. 2026. "OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication" Information 17, no. 4: 325. https://doi.org/10.3390/info17040325
APA StyleGalán-Mena, J., López-Nores, M., Pulla-Sánchez, D., Guerrero-Vásquez, L. F., & Salgado-Guerrero, J. P. (2026). OntoDup: Governance-Aware Entity Matching for Scholarly Knowledge Graph Deduplication. Information, 17(4), 325. https://doi.org/10.3390/info17040325

