Ontology-Based Data Pipeline for Semantic Reaction Classification and Research Data Management

Borgelt, Hendrik; Kitel, Frederick Gabriel; Kockmann, Norbert

doi:10.3390/computers14080311

Open AccessArticle

Ontology-Based Data Pipeline for Semantic Reaction Classification and Research Data Management

by

Hendrik Borgelt

^*

,

Frederick Gabriel Kitel

and

Norbert Kockmann

^*

Laboratory of Equipment Design, Faculty of Biochemical and Chemical Engineering, TU Dortmund University, Emil-Figge-Str. 68, 44227 Dortmund, Germany

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(8), 311; https://doi.org/10.3390/computers14080311

Submission received: 1 July 2025 / Revised: 22 July 2025 / Accepted: 25 July 2025 / Published: 1 August 2025

Download

Browse Figures

Versions Notes

Abstract

Catalysis research is complex and interdisciplinary, involving diverse physical effects and challenging data practices. Research data often captures only selected aspects, such as specific reactants and products, limiting its utility for machine learning and the implementation of FAIR (Findable, Accessible, Interoperable, Reusable) workflows. To improve this, semantic structuring through ontologies is essential. This work extends the established ontologies by refining logical relations and integrating semantic tools such as the Web Ontology Language or the Shape Constraint Language. It incorporates application programming interfaces from chemical databases, such as the Kyoto Encyclopedia of Genes and Genomes and the National Institute of Health’s PubChem database, and builds upon established ontologies. A key innovation lies in automatically decomposing chemical substances through database entries and chemical identifier representations to identify functional groups, enabling more generalized reaction classification. Using new semantic functionality, functional groups are flexibly addressed, improving the classification of reactions such as saponification and ester cleavage with simultaneous oxidation. A graphical interface (GUI) supports user interaction with the knowledge graph, enabling ontological reasoning and querying. This approach demonstrates improved specificity of the newly established ontology over its predecessors and offers a more user-friendly interface for engaging with structured chemical knowledge. Future work will focus on expanding ontology coverage to support a wider range of reactions in catalysis research.

Keywords:

knowledge graph; reasoning; chemical reaction; NFDI4Cat; SHACL; SPARQL

1. Introduction

The Findability, Accessibility, Interoperability, and Reusability (FAIR) [1] of research data is becoming increasingly more important, not only for making data more interoperable for machine learning processes but also to accelerate research processes in general. One of the limiting factors in the FAIRness is the availability and quality of metadata. With terminologies like the IUPAC Goldbook [2], National Cancer Institute Thesaurus (NCIT) [3], and Voc4Cat [4], metadata aspects such as the gas hourly space velocity of a component used in a chemical experiment can be addressed. Nonetheless, the usefulness of such a metadata representation can be questionable. Detailed information is, of course, highly valuable, specifically for reevaluating experiments. However, it also increases the required querying capabilities of a researcher if they opt for simpler, more generic queries, such as inferring knowledge from existing metadata and transferring it into a representation more closely aligned with common terminology. The Reac4Cat ontology [5] established a generic data pattern for modeling reactions and enabling the classification of reactions occurring in a reaction experiment. This approach is further extended here, as challenges with having a common data representation and logic integration have been addressed. To extract and generate a common knowledge representation, pipelines for knowledge insertion are presented, along with an approach for extracting and further enriching knowledge for reaction classification. Specifically, the decomposition of a chemical entity into its functional subgroups and their representation in an ontology-aligned knowledge graph are presented in detail. The classification of chemical reactions is based on the terminology of the Reac4Cat ontology. However, a pipeline is presented for introducing reference reactions and a new classification technique.

The concept of classifying reactions and, more specifically, experiments performed in the lab, with possibly unknown side reactions, is not a novel approach. Databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [6] or the Rhea curated reaction database [7] are approaches to documenting a collection of reactions that already implement a semantic background such that, for example, reactants can be commonly addressed via terminology provided via the Chemical Entities of Biological Interest (ChEBI) ontology [8]. Because there is interest in maintaining a curated and therefore validated list of reactions, simple experiments that do not meet a certain standard of documentation are not necessarily suitable for inclusion in such a database. As research continually progresses, it is essential to classify, document, and occasionally publish new data through a knowledge graph. Therefore, the presented approach focuses more on immediate interaction with new data. A somewhat similar approach to modeling reactions and, more specifically, reaction networks can be found in the tool set surrounding OntoRXN [9]. OntoRXN is designed to calculate the specific reaction network associated with a particular reaction. Since this approach is not only more detailed and thereby more complex for querying purposes but also requires more knowledge of the process than often provided via, for example, an electronic lab notebook (ELN), this approach is regarded as promising as an extension for mature enough research data management pipelines.

2. Materials and Methods

To illustrate how the pipeline works, two example reactions are discussed that are referenced throughout. The first is a fat cleavage (hydrolysis) into respective fatty acids and trivalent glycol alcohol. This is used to examine flexibility and axiomatization in greater detail. The general formula can be seen in Figure 1.

The other reaction is an exemplary chemical reaction in which the substance quantities of the reactants and the products are hidden behind unexpressive data, such as the gas hourly space velocity or the liquid volume flows. Furthermore, it is a two-stage reaction, with the main reaction and a subsequent reaction that produces by-products. This reaction involves the hydrolysis of isoamyl acetate, followed by the oxidation of the resulting isoamyl alcohol with oxygen to form the respective isoamylaldehyde. The reaction is shown in Figure 2. This reaction is characteristically used to identify partial reactions or the reactants from a mixture that are required for classification as a specific reaction, in which other chemical reactions are present. Hence, the reaction in Figure 2 is also shown in its individual reaction steps. Since the decomposition into individual reaction steps and chemical activity states can be modeled in different levels of detail (individual quantum states, other steric arrangements), only the schematic procedure is presented here. It is up to the use case/user to choose the desired level of detail. Characteristically, Figure 2 shows a reaction in which the second reaction consumes the products of the first reaction. The specification of a stoichiometry that assumes a complete conversion would be critical in the case of an experimental reaction since such a complete conversion does not have to be achieved, e.g., due to reactant deficiency. Since the subsequent reaction requires reactants from the first reaction, the reactants of the second reaction are to be found both among the reactants and among the products of the first reaction. The implications for modeling and evaluation will be discussed in more detail later. However, it is worth noting here that real chemical experiments cannot always be reproduced using the idealized representation.

As a base for the description of chemical reactions, both colloquial chemical names and IUPAC designations, as well as the description using the International Chemical Identifier (InChI) [10], the InChI key (version of the respective InChI limited to 27 characters), and the Simplified Molecular Input Line Entry System (SMILES) [11] notation, as well as the associated SMILES arbitrary target specification (SMARTS) [12] notation, are used. In comparison, the colloquial chemical names are less precise than the IUPAC name, the SMILES notation, or InChi/InChi-key. However, the colloquial name remains in common use in many databases. IUPAC designation, SMILES notation, and InChI/InChI-key all have slightly different areas of application, where the informative value increases in respective order and the complexity of the description also increases. The SMILES notation remains the focus, as its semantics directly convey information about the structural composition. Nevertheless, this structural information is more readily interpretable than InChI Semantic, and there is therefore a plentiful supply of tools capable of analyzing the SMILES notation. A tool, which is intended for the evaluation of SMILES notations, is the SMARTS notation. It goes beyond conventional pattern matching, such as regular expressions, and can be used, for example, to restrict the neighborhood of a functional group described using SMILES notation. Examples of the syntax of the different formats are given in Table 1. An example of pattern matching via regular expression and SMARTS notation is shown in Table 2. It can be seen here that the SMARTS notation also examines the neighborhood (marked by underlining in Table 2).

The data basis for the tools presented here is provided by four pillars: manual data entry, uploading using forms, data exchange with databases, and the knowledge pre-modeled in ontologies. Since both manual data entry methods and form-based incorporation are tools adapted to the methodology presented here, existing knowledge sources are discussed first.

In this case, PubChem [14] and KEGG represent the knowledge sources addressed via Application Programming Interface (API). The KEGG is a database of chemical, genomic, health, and reaction network (self-described as “systems”) information published in 1995. The 12,306 reactions with 3202 reaction classes paired with 19,508 chemical substances form a good basis as a reference database, which is manually curated by the Kanehisa Laboratory at Kyoto University [6]. The KEGG utilizes its proprietary chemical identifiers, as well as identifiers from PubChem and ChEBI. One of the reactions can be seen in the sketch in Figure 3, which only shows some of the information provided from the KEGG database.

PubChem is a database provided by the National Library of Medicine and it covers an even larger number of chemicals, with over 121 million curated and 334 million community-derived entries [14]. Its description of a chemical and references to other databases are typically more detailed than those of KEGG. The primary advantage here is that all curated PubChem entries have both a SMILES and an InChI ID, which can also be queried via the API. In the following method, it serves as the basis for describing and referencing chemical entities in the knowledge graphs, thanks to its broader scope of chemical entities.

ChEBI is used to provide a standardized basis for describing chemical entities. Due to its existing structure, in the form of its taxonomy and hierarchy, it gives a good basis for describing chemicals. Since both the KEGG and PubChem lists the ChEBI class ID as a reference for their chemicals; instantiating the chemicals is also simplified.

In addition to using the ChEBI ontology, various ontologies and fragments thereof are reused, all of which come from the collection of Open Biological and Biomedical Ontologies (OBO) Foundry [16] ontologies. The ontology basis or top level is the BFO [17], but, due to limiting the number of mentioned reused ontologies, only parts from sub-level ontologies are further described. Important to note is the RXNO [18], which provides some of the main terminology for classifying reactions based on a named reaction, the MOP [19], which is somewhat of a predecessor to the RXNO, by refining, adding, and also extending the reaction description with more molecular processes, and the Reac4Cat ontology, which itself is not an OBO collection ontology but is aligned with the OBO principals [16] and serves as the main terminology for describing and modeling reaction data of an experiment.

3. Results

A brief overview of the method presented here is shown in Figure 4.

As this is a multi-step procedure, each significant step will be presented in the following sub-chapters. The user input, which marks the start of data processing, is not discussed first, as the knowledge graph represents the core structure. The steps Data Extraction and Data Normalization (Section 3.2) and Data Normalization (Section 3.3) thus represent the preparation of the data for populating this graph, followed by Semantic Enrichment via Structure Analysis (Section 3.4). Accordingly, this section presents the changes and additions made in the modeling of ChEBI, RXNO, MOP, and the Reac4Cat ontology. Since parts of the modeling are based on the automated data evaluation in the Section 3.4, the general procedure is presented here. After processing the ontology, the eponymous procedure is described in the Section 3.3. The aim here is to prepare the reaction data so that it can be modeled in Reac4Cat. However, the reactants and products have also been converted into their SMILES notation, allowing for their evaluation. The previously introduced databases are used as a look-up service and to normalize the different chemical identifiers. The Section 3.4 describes the evaluation of chemical substances based on functional groups via SMILES and SMARTS notation. As described above, the decomposition also serves as a basis for new classes, leading to the extended knowledge graph described in the Section 3.1. Since an alternative logic to the original axiomatization from Reac4Cat is used, which is firmly based on SHACL [20] and SPARQL [21], this and the data processing step based on it are presented in the last section, Section 3.4. Both standard inferencing in OWL [22] and external inferencing using SPARQL are presented here.

3.1. Base Ontology Modifications

Modifying an ontology requires adherence to quality standards and best practices. In light of the challenges associated with designing automated modification steps, it is recommended that the ontology be developed and curated by a human expert. As such, some parts of the modifications to the ontology are currently hard-coded (changes to the code would be required to alter these modifications) into the script, which is intentionally done to limit the ease and thereby the chance of undesirable modifications. Only the general approach of extending the ontology is described here. For best practices, it is recommended to either have the ontology prepared by an expert or set up once, considering the use case-specific limitations, to facilitate an automatic extension script.

Reac4Cat describes the core ontological model for describing an experiment. The classification into reaction types is performed primarily through the design pattern of the RXNO/MOP, which is a slight modification from the Reac4Cat ontologies approach of using general class axioms or left-hand-side logic. Finally, the chemical entity modeling is handled via ChEBI. Starting with the chemical entity modeling, ChEBI is used in its ontological design, rather than as a database. In other words, while the classes of the ChEBI ontologies are used, representing a chemical entity involves referencing a real-world chemical entity to a digital counterpart representation, which, in this case, utilizes the PubChem database entries.

A representation of this can be seen in Figure 5, where ontology classes are shown as ellipses and data instances are depicted via the trapezoidal shape. Where new classes are needed, they must be added to incorporate new chemical entities represented in the experimental data. Examples of this include the isoamylol class, which can be reused from the ChEBI ontology in this case. Such an addition of classes would occur, for example, for the permutation of various esters that have variations in their alkyl groups. For the data instance, the direct link to the PubChem website is used. This, in turn, helps with digital agents, which can, for example, utilize the PubChem API to extract more information, such as physical properties like density. While this information could also be imported into the knowledge graph, limiting the knowledge to a required minimum is beneficial for reasoning and querying operations.

As the ChEBI database does not include as many chemical substances as the PubChem database, manually adding chemical entity representations, such as the isoamylol class, to the used ontology is recommended for curation purposes, provided there is no complete overlap between the two databases.

While the ChEBI ontology classifies chemical entities via characteristic functionalities, such as a chemical entity being a carbonyl compound (CHEBI:35701 [23]), modeling reaction networks more holistically requires this description to be more specific. For example, a chemical in the group of carbonyl compounds can be an aldehyde, a ketone, or even both. As such, functional group modeling is introduced, whereby a chemical entity such as isoamyl acetate is assigned multiple functional groups via the has_functional_group relation. This can be seen in Figure 6. The functional group instances are, by intention, not depicted with their respective Uniform Resource Identifier (URI) as, due to the number of possible functional groups, no terminology service currently provides them with a fixed URI. As such, local reference URIs or RDF blind notes are used. Here, for example, https://github.com/AleSteB/Reac4Cat/blob/main/reac4cat_with_examples.owl#Isoamylol_residue is used, which, by design, does not resolve correctly. These functional groups are also assigned to classes, which are designed with the general ChEBI design pattern for chemical entities.

To differentiate between reference reactions and experimental data, two distinct data patterns are employed. Figure 7 represents the data pattern for reference reactions, where new reference data should be classified via axioms of the respective reaction classes. This enables various permutations of a reaction to be introduced and still classified as the same reaction. Specifically, permutations of chemical entities (e.g., isoamylol, butanol) are presented by the axioms here. However, the adoption of more advanced axiomatic systems has the potential to impose restrictions not only on chemical entities but also on environmental factors. Due to the modeling of functional groups, as demonstrated by the isoamylol and isoamylol class restrictions, more complex yet refined axioms can be designed for the main reaction class.

Figure 8 presents the data pattern used for experimental reactions. This model is derived from the Reac4Cat ontology [5] and is intentionally different from the reference data pattern. For purposes of simplicity, Figure 8 only sketches the design pattern. Please refer to the original publication for the in-depth description. The primary objective of this differentiation is to prevent the drawing of inferences that should not be drawn. For reference reactions, it can be assumed that all data are available. However, experimental data may be incomplete. Reasoning based on such incomplete information may cause inconsistencies and should, therefore, be avoided. By using different data patterns and thereby different relations between instances describing reference datasets and experimental datasets, the axioms and thus the logic can differentiate between different data patterns.

With an ontology generated through this approach, the data can be classified as described in Section 3.4.

3.2. Data Extraction and Data Normalization

To introduce experimental data and reference data into the data processing pipeline, a graphical user interface (GUI) is developed, allowing researchers to perform these evaluations independently, rather than relying on a hosted service. What is presented here could, of course, be established as a web service. As the current method still has some limitations that should be addressed first, no details of hosting this as a web service are discussed in this work. Work in this field is, however, still ongoing and will be part of future work.

As shown in Figure 9, the GUI has three primary data import methods: manual entry, by EnzymeML data sheets [24], and by using the “Upload Manually” button. Additionally, data can be imported by searching for a KEGG reaction in the search bar. KEGG entries are considered as reference reactions, while imports over the manual route and the EnzymeML data sheet are considered as experimental data. For manual entry, the data is processed directly. In contrast, for the EnzymeML sheet, the data is extracted by the pyEnzymeML package [25] and then processed. Both, the manual route and the EnzymeML data sheets, can directly provide SMILES notations, which can be used to find the respective entries in the PubChem database and perform SMARTS pattern analysis. If the SMILES are not provided, a manual search using the colloquial name is performed to retrieve the SMILES notation from PubChem for SMARTS pattern analysis. For the import of KEGG reactions, only the reaction identifier is needed, such as ‘R12516’; the reaction entry is shown in Figure 3. The respective PubChem Entry and SMILES notation are retrieved through the KEGG API.

The missing PubChem entries and SMILES notations are all collected by the PubChem API, and the PubChem entries (URIs) are instantiated in the knowledge graph. For normalization purposes, a SMILES notation is required to perform the SMARTS pattern analysis. As such, entries with only InChI notation are converted into their respective SMILES notation through a search over the PubChem API.

3.3. Semantic Enrichment via Structure Analysis

For semantic enrichment of the data, functional groups are extracted from the chemical entities and modeled in the knowledge graph. This approach offers the advantage that class axioms for general reactions do not necessitate exact chemical entity matching. Instead, one can rely on the characteristics of the functional groups of a specific chemical entity for classification. This enables the further generalization of the axioms used for describing chemical reactions. Since decomposing complex structures, such as the isoamyl residue of isoamyl acetate, into further functional groups, such as a methyl residue, is possible, one needs to decide to what level this decomposition is reasonable. Since, to the authors’ knowledge, there are no ontologies/RDF terminologies that model the hierarchical structure of such function groups. These hierarchies, which are necessary for the respective modeling purposes, are not pre-specified. It is up to the user of this method to describe the required hierarchy. To make the functional groups more accessible for the reasoning purposes of an ontology, the classes of these functional groups should be modeled when a decision is made on which SMARTS patterns to use for pattern matching. The matching process, which is performed using the SMARTS pattern and RDKit (Release_2025_03_4) [26], the respective Python 3.13 package for evaluating these patterns, is then used to model the respective functional groups in the knowledge graph.

3.4. Automated Semantic Analysis and Reasoning

Two different methods are used to classify the experimental reaction tests and the reference reactions. Since ontologies are used to describe data deterministically and also factually correctly, experimental reactions pose a challenge. Metadata is often found to be incomplete and relies on human interpretation. A guarantee of accuracy is not always possible, even by humans. In order not to burden the logic of an ontology with errors, only the reference reactions that are assumed to be factually correct are classified directly in the ontology. The classification of experimental data, which, in some cases, must be regarded as factually incorrect or incomplete, is only classified by comparison with the reference measurements. Two different data processing methods are chosen for this purpose. For the regular classification of reference reactions, the standard OWL logic, located in the ontology, is used. This is checked using a reasoner such as HermiT [27], which is used to perform automatic reasoning. The general logic was already explained in Figure 7.

A different procedure is used for the comparison and classification of experimentally recorded reactions. Apart from additional conditions that can be implemented, it is assumed that an experiment contains a defined reference reaction if the reactants and products of this reference reaction are present in the experiment. As this can still be subject to errors, only an annotated reference is generated, rather than performing the complete logical classification. This should only be carried out if a classification can be assumed to be error-free with sufficient certainty. As the number of possible rules exceeds the scope of the description here, the general procedure is described.

To clearly emphasize the difference between a reference reaction and an experiment, a distinction in the data patterns was chosen to describe the information, as in Section 3.1 Base Ontology Modifications and Figure 7 and Figure 8. Applying the ontology logic only to the reference reactions, rather than it being erroneously applied to the experimental reactions, thereby reduces the potential for error. The comparison of whether a reference reaction is present in an experiment is now performed with a SPARQL query instead of the complex logic used in the first version of the Reac4Cat ontology. This offers some advantages, especially in terms of computational effort. Since the logic or query is only applied to an excerpt of the knowledge graph, rather than the entire knowledge graph, memory demand and, consequently, computational time are reduced.

To better capture the logic embedded in the SPARQL query [21], SHACL [20], especially the SHACL Advanced Feature set (SHACL-AF) [28], is used to generate a fixed semantic artifact, which can be imported into the actual ontology by an import statement in the ontology file. To work with the advanced feature set, pySHACL is used as the Python package [29]. The advantage of SHACL-AF is that a SPARQL query can be executed in a SHACL shape. Additionally, SHACL-AF assumes that modifying the actual knowledge graph with SHACL validation should be permitted. In contrast, previous SHACL specifications were intended only for validation, without any modification to the knowledge graph.

In the SPARQL query, which is applied outside the actual SHACL shape as part of the provided code, it is examined whether the reactants of the reference reaction match those of the experiment and also whether the products of the reference reaction match those of the experiment. The SPARQL query is given in Figure 10.

The comparison conducted herein is executed through the utilization of the “FILTER NOT EXISTS” function terms. Consequently, a one-to-one matching of the reactants and products is not performed; instead, it is ensured that all reactants and products can be identified within the experiment. Once these are found, the experiment is linked to the reference reaction using the relation “has_predicted_reaction_part”. Further criteria and checks can, of course, be integrated into the SPARQL query, which increases the accuracy of the reaction classification. Thanks to the very generic terminology used in the SPARQL query, it is possible to write simple axioms, such as those found in an ontology. However, the logic of the actual ontology should not be compromised by errors in the data. Additionally, the preceding SHACL shape also enables the detection of general modeling errors, which, in turn, ensure the smooth functioning of the ontology logic’s reasoning and thereby simplify the use of more complex logics.

The results of the SHACL validation, the OWL reasoning, and the SPARQL querying are displayed in the GUI, allowing them to be cross-checked and validated by the researcher.

4. Discussion

The presented technique enables the modeling of experimental reaction datasets with minimal metadata in greater detail within the Reac4Cat ontology by extracting functional groups, which also facilitates comparison with reference reactions. Referenced reactions are also represented in more detail as reaction classification is possible through the detailed description of the functional groups. By this increase in modeling depth, which comes without great effort for the curators and users, further restrictions on the reaction classification can be integrated more meaningfully.

Furthermore, the use of SPARQL queries, which are embedded in SHACL shapes thanks to the SHACL-AF, enables complex logic, such as that presented in the Reac4Cat ontology, to be performed outside the actual OWL reasoning. This saves resources precisely because the left-hand-side logic [30] is very computationally expensive. In contrast, only small parts of the graph need to be queried and checked thanks to the SHACL-SPARQL query. The general class axiom approach of the Reac4Cat ontology was found to be quite demanding in terms of computational resources. Due to the complexity of the axioms on the left-hand side of logic, it was found that reasoning performed with HermiT resulted in reasoning times of up to one day, which often exceeded the stable runtime of the reasoner itself. As testing the cause of this time effect for the left-hand side logic was not feasible, the SHACL-AF was prioritized for use. Here, querying times of less than one second were found for datasets with up to forty reaction entities. As this appeared to be following regular SHACL validation and SPARQL query times, no further investigation was performed.

Combined with a simple user interface and the possibility of implementing the described method as a background service linked to a chemical database sample, this also provides a means for such methods as those used in reaction and catalysis research to be easily implemented in everyday research. Other databases that use similar metadata and APIs can also be easily integrated.

However, the quality of the reaction classification should also be discussed again. For the experiment in Figure 2, the SPARQL query identifies the reaction of isoamyl acetate and water to form isoamyl alcohol and acetate. A reference is then established using the has_predicted_reaction_part relation. To identify the subsequent reaction of isoamylol with oxygen to form isoamylaldehyde and water, not only the reactants but also the products are searched for as possible reactants in the SPARQL query. However, the number of potential reactions that could be found here also increases considerably.

Due to the more detailed modeling of chemical entities using functional groups, a more complex classification is also possible. For example, a distinction can be made between simple ester cleavage and complete hydrolysis/saponification of a fatty molecule, where modeling functional groups is advantageous for describing glycerol. Other areas of application include, for example, the description of lactam ring cleavages, where rings of different chain lengths are depicted.

Several limitations still restrict automated data processing, particularly in terms of metadata. Firstly, there is a lack of and inconsistent metadata and metadata schema. For example, specifications such as SMILES notation or InChI notation are not always mandatory, even if the specification of chemical entities as metadata is required for reaction data sets. Inconsistent notations can quickly lead to mix-ups, not only in data sets but also in databases, for example, when the dissolved ionic form of acetate is intended but the notation does not indicate this. The existing ontologies also have weaknesses that will be corrected over time. For example, there are few alternatives to ChEBI, which, for many chemical substances, assumes the state found in the human body as the standard chemical state of a substance. It is also important to mention the limitations of the proposed approach here. Due to the lack of a terminology that can provide the functional group modeling required for the correct embedding of instances generated via the SMARTS pattern comparison, parts of the code rely on hardcoded fragments. These hardcoded aspects, like the SMARTS pattern for functional groups, new functional group instances and classes, and new axioms for reaction classification, should be optimized in the future. Therefore, the development of embedding these functionalities close to either the ontology or available for the SMARTS pattern matching, which can be considered a digital agent, is ongoing.

5. Conclusions

A methodology was presented that can describe reactions in more detail than previous methods by incorporating functional groups more strongly into the modeling. The evaluation of which functional groups are present in a molecule is carried out by an automated data pipeline that processes different databases for this purpose. A detailed model was presented with which reference reaction data could be compared with data from chemical experiments. Care was taken to ensure the consistency of the RDF modeling by avoiding the implementation of any incorrect assumptions or statements. To achieve this modeling in a more resource-efficient manner than previous methods, a new approach was presented using the SHACL-AF, which combines the advantages of SHACL and SPARQL to perform a form of reasoning. A key advantage is that this approach can be easily linked to existing ontologies, as SHACL shapes represent an artifact that can be imported into the OWL language, unlike pure SPARQL queries. Two example reactions were used to show how this can be applied. While this methodology represents a continuation and refinement of existing approaches to processing reaction datasets in knowledge graphs, it is worth noting the approach’s extensibility. Through more detailed reaction descriptions, more complex modeling can be implemented in both the OWL and the SHACL-SPARQL query. However, it is essential that the modeling aligns with the data description, which is currently one of the limiting factors in many respects.

Author Contributions

Conceptualization, H.B. and F.G.K.; methodology, H.B. and F.G.K.; software, H.B. and F.G.K.; writing—original draft preparation, H.B.; writing—review and editing, F.G.K. and N.K.; supervision, N.K.; project administration, N.K.; funding acquisition, N.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the German Research Foundation (DFG), grant number 441926934, in the framework of the NFDI4Cat initiative.

Data Availability Statement

The data, such as the modified ontology and other raw files as well as the code, are provided at URL: https://hdl.handle.net/21.11165/4cat/2dc8-9kkd Version 1.0 (accessed on 24 July 2025). All other data is openly accessible through the respective database entries.

Acknowledgments

During the preparation of this manuscript/study, the author(s) used OpenAI’s ChatGPT GPT-4o for the purposes of reformulating and aligning preformulated text. Additionally, the author(s) used GitHub Copilot (Version 1.5.52-243 powered by GPT-4.1) for the purpose of generating the programming code behind the manuscript/study. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
BFO	Basic Formal Ontology
ChEBI	Chemical Entities of Biological Interest
ELN	Electronic Lab Notebook
FAIR	Findable, Accessible, Interoperable, Reusable
KEGG	Kyoto Encyclopedia of Genes and Genomes
MOP	Molecular Process Ontology
NCIT	National Cancer Institute Thesaurus
OBO	Open Biological and Biomedical Ontologies
OWL	Web Ontology Language
RXNO	Name Reaction Ontology
SHACL	Shapes Constraint Language
SHACL-AF	SHACL Advanced Feature Set
SPARQL	SPARQL Protocol and RDF Query Language
URI	Uniform Resource Identifier

References

GO FAIR Initiative FAIR Principles. 2024. Available online: https://www.go-fair.org/fair-principles/ (accessed on 1 July 2025).
IUPAC Compendium of Chemical Terminology: IUPAC Gold Book. (International Union of Pure, 2019). Online Version 2.3.3. Available online: https://goldbook.iupac.org/ (accessed on 1 July 2025).
Sioutos, N.; Coronado, S.; Haber, M.; Hartel, F.; Shaiu, W.; Wright, L. NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information. J. Biomed. Inform. 2007, 40, 30–43. [Google Scholar] [CrossRef] [PubMed]
Moustakas, N.; Behr, A.; Borgelt, H.; Huskova, N.; Khare, R.; Talab, M.; Köbl, J.; Chandrashekhar, V.; Petrenko, T.; Dörr, M.; et al. Voc4Cat: Vocabulary guidelines for NFDI4Cat (2.1). Zenodo 2025, 1–22. [Google Scholar] [CrossRef]
Behr, A.; Borgelt, H.; Kockmann, N. Reac4Cat-Ontology: Harnessing the Power of Ontological Description Logic in Catalysis Research as a Practical Approach to Knowledge Inferences. Datenbank Spektrum 2024, 24, 139–150. [Google Scholar] [CrossRef]
Kanehisa Laboratories KEGG: Kyoto Encyclopedia of Genes and Genomes. 2024. Available online: https://www.kegg.jp/kegg/kegg1a.html (accessed on 1 July 2025).
Bansal, P.; Morgat, A.; Axelsen, K.; Muthukrishnan, V.; Coudert, E.; Aimo, L.; Hyka-Nouspikel, N.; Gasteiger, E.; Kerhornou, A.; Neto, T.; et al. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res. 2021, 50, D693–D700. [Google Scholar] [CrossRef] [PubMed]
OBO Foundry/Ontobee Chemical Entities of Biological Interest (ChEBI) Ontology. 2025. Available online: https://ontobee.org/ontology/CHEBI (accessed on 1 July 2025).
Garay-Ruiz, D.; Bo, C. Chemical reaction network knowledge graphs: The OntoRXN ontology. J. Cheminform. 2022, 14, 29. [Google Scholar] [CrossRef] [PubMed]
Heller, S.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D. InChI, the IUPAC International Chemical Identifier. J. Cheminform. 2015, 7, 23. [Google Scholar] [CrossRef] [PubMed]
Heller, S.; Gasteiger, A. SMILES: A Language for Chemical Structure Representation. J. Chem. Inf. Comput. Sci. 2001, 41, 703–712. [Google Scholar]
Daylight Chemical Information Systems, Inc. SMARTS—A Language for Describing Molecular Patterns. Available online: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html (accessed on 1 July 2025).
National Center for Biotechnology Information Isoamyl Acetate. Available online: https://pubchem.ncbi.nlm.nih.gov/compound/Isoamyl-acetate (accessed on 1 July 2025).
National Center for Biotechnology Information About PubChem. 2025. Available online: https://pubchem.ncbi.nlm.nih.gov/docs/about (accessed on 1 July 2025).
Genes, K. Genomes R12516—KEGG Entry. Available online: https://www.kegg.jp/entry/R12516 (accessed on 1 July 2025).
Jackson, R.; Matentzoglu, N.; Overton, J.; Vita, R.; Balhoff, J.; Buttigieg, P.; Carbon, S.; Courtot, M.; Diehl, A.; Dooley, D.; et al. OBO Foundry in 2021: Operationalizing Open Data Principles to Evaluate Ontologies. Database 2021, 2021, baab069. [Google Scholar] [CrossRef] [PubMed]
Arp, R.; Smith, B.; Spear, A. Building Ontologies with Basic Formal Ontology; Massachusetts Institute of Technology: Cambridge, MA, USA, 2015. [Google Scholar]
Batchelor, C. Chemical Reactions Ontology (RXNO). 2012. Available online: https://github.com/rsc-ontologies/rxno (accessed on 1 July 2025).
Batchelor, C. Molecular Process Ontology (MOP). 2012. Available online: https://github.com/rsc-ontologies/rxno (accessed on 1 July 2025).
Knublauch, H.; Kontokostas, D. Shapes Constraint Language (SHACL). (W3C,2017,7). Available online: https://www.w3.org/TR/2017/REC-shacl-20170720/ (accessed on 1 July 2025).
Harris, S.; Seaborne, A. SPARQL 1.1 Query Language. (W3C, 2013, 3). Available online: https://www.w3.org/TR/2013/REC-sparql11-query-20130321/ (accessed on 1 July 2025).
Dean, M.; Schreiber, G. OWL Web Ontology Language Reference. (W3C, 2004, 2). Available online: https://www.w3.org/TR/2004/REC-owl-ref-20040210/ (accessed on 1 July 2025).
EMBL’s European Bioinformatics Institute: CHEBI:35701—2-Hydroxybenzoic Acid. Available online: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=35701 (accessed on 1 July 2025).
Range, J.; Halupczok, C.; Lohmann, J.; Swainston, N.; Kettner, C.; Bergmann, F.; Weidemann, A.; Wittig, U.; Schnell, S.; Pleiss, J. EnzymeML—A Data Exchange Format for Biocatalysis and Enzymology. FEBS J. 2022, 289, 5864–5874. [Google Scholar] [CrossRef] [PubMed]
Initiative, E. PyEnzyme—A Python Library for EnzymeML Processing. 2025. Available online: https://github.com/EnzymeML/PyEnzyme (accessed on 1 July 2025).
Landrum, G.; Tosco, P.; Kelley, B.; Rodriguez, R.; Cosgrove, D.; Vianello, R.; Sriniker; Gedeck, P.; Jones, G.; Kawashima, E.; et al. rdkit/rdkit: 2025_03_4 (Q1 2025) Release (Release_2025_03_4). Zenodo 2025. [Google Scholar] [CrossRef]
Glimm, B.; Horrocks, I.; Motik, B.; Stoilos, G.; Wang, Z. HermiT: An OWL 2 Reasoner. J. Autom. Reason. 2014, 53, 245–269. [Google Scholar] [CrossRef]
Knublauch, H.; Steyskal, S.; Allemang, D. SHACL Advanced Features. (W3C, 2017, 6). Available online: https://www.w3.org/TR/2017/NOTE-shacl-af-20170608/ (accessed on 1 July 2025).
Sommer, A.; Car, N. pySHACL. (2025, 3). Available online: https://github.com/RDFLib/pySHACL (accessed on 1 July 2025).
Sattler, U.; Stevens, R. Being Complex on the Left-Hand-Side: General Concept Inclusions. Available online: https://ontogenesis.knowledgeblog.org/1288/ (accessed on 15 February 2024).

Figure 1. Illustration of the hydrolysis of triglycerides. The R indicates different chain residues, which must not be the same.

Figure 2. Depiction of a reaction as found in experimental data and sub-reactions occurring in this experiment. The depiction of data from an experimental dataset (the top reaction equation) is therefore not stoichiometrically balanced, as the nature of experimental data does not provide a clear indication of stoichiometric balance. In the middle row, the main reaction intended for this experiment is depicted, and in the bottom row, a subsequent reaction is shown.

Figure 3. Sketch of a reaction entry in the KEGG reaction database. In this case, the most relevant information from reaction R12516 [15] is depicted. The reactants and products of the example reaction can be seen in the “equation” row, with their respective KEGG-specific chemical IDs.

Figure 4. The overall data transformation pipeline is depicted here. Two main data inputs are given via the user input, where new reaction data can be introduced, and the base ontology, where the pre-modified ontology can be introduced. Central results are depicted via the graphical user interface.

Figure 5. Representation of the generated class and instance hierarchy for chemical entities in the ontology. For description purposes, the RDF type is written on the right, while the classes have names and information similar to what can be found in the knowledge graph.

Figure 6. Representation of instance relations for chemical entities and their respective functional group residues in the ontology. For descriptive purposes, the structural formal is depicted alongside the instances used in the ontology. The instance name is only depicted for the PubChem entry, as instances using the Reac4Cat prefixes are intended to be blank nodes.

Figure 7. Depiction of the logical axioms and class hierarchy used for reference reactions. Depicted are only the axioms for the ester hydrolysis class and the isoamylol class, which are designed to be used for OWL reasoning. For a better overview, only the functional groups of isoamylol are shown.

Figure 8. Depiction of the data pattern used for experimental reactions. This pattern follows the Reac4Cat ontology [5] data pattern and as such is only sketched here. The data pattern differs from that of reference reactions in that it includes instances that describe the reactant and product mixture composition.

Figure 9. Illustration of three of the many windows of the GUI. The upper windows represent the main window, used for navigational purposes and to display results in the “Process Log” column. On the left side of this window, the introduced reactions, here, r0, r1, r2, r3, and R12516, are displayed. With the lower left window, the respective import method for new data can be chosen. In the lower right, the window for entering new experimental data manually is shown.

Figure 10. Illustration of the SPARQL query main body, used in the SHACL-AF shape, in order to compare experimental reaction data against referenced reaction data. The SPARQL query would be embedded in a SHACL shape using SPARQL rules from the advanced feature set and would be accompanied by information, such as the used prefixes. In the depiction, one can see the main functionality by which the relation “has_predicted_reaction_part” relation is constructed between entities that match the SPARQL query requirements. The “FILTER_NOT_EXISTS” functional term is used for the comparison of whether all reactants of the reference reaction can be found in the experimental reaction. Using a nested version is required for filtering, such that only complete matches are considered.

Table 1. Different notation formats for isoamyl acetate with increasing abstraction but also varying information level. Here, only notations found in PubChem [13] are presented.

Colloquial Naming	Isoamyl Acetate
IUPAC	3-methylbutyl acetate
SMILES	CC(C)CCOC(=O)C
InChI	InChI=1S/C7H14O2/c1-6(2)4-5-9-7(3)8/h6H,4-5H2,1-3H3
InChI Key	MLFHJEHSLIIPHL-UHFFFAOYSA-N

Table 2. Example of pattern matching with regular expressions and SMARTS notation. The matching part is printed in bold, while parts relevant for matching are marked with an underline.

Method	Pattern	Match
SMILES notation	CC(C)CCOC(=O)C	-
Regular expression	“OC(=O)C”	CC(C)CCOC(=O)C
SMARTS notation	C[OX2][CX3](=[OX1])[OX2]C	CC(C)CCOC(=O)C

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Borgelt, H.; Kitel, F.G.; Kockmann, N. Ontology-Based Data Pipeline for Semantic Reaction Classification and Research Data Management. Computers 2025, 14, 311. https://doi.org/10.3390/computers14080311

AMA Style

Borgelt H, Kitel FG, Kockmann N. Ontology-Based Data Pipeline for Semantic Reaction Classification and Research Data Management. Computers. 2025; 14(8):311. https://doi.org/10.3390/computers14080311

Chicago/Turabian Style

Borgelt, Hendrik, Frederick Gabriel Kitel, and Norbert Kockmann. 2025. "Ontology-Based Data Pipeline for Semantic Reaction Classification and Research Data Management" Computers 14, no. 8: 311. https://doi.org/10.3390/computers14080311

APA Style

Borgelt, H., Kitel, F. G., & Kockmann, N. (2025). Ontology-Based Data Pipeline for Semantic Reaction Classification and Research Data Management. Computers, 14(8), 311. https://doi.org/10.3390/computers14080311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ontology-Based Data Pipeline for Semantic Reaction Classification and Research Data Management

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Base Ontology Modifications

3.2. Data Extraction and Data Normalization

3.3. Semantic Enrichment via Structure Analysis

3.4. Automated Semantic Analysis and Reasoning

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI