A Knowledge Graph Framework for Dementia Research Data

Featured Application: Applying knowledge graphs, graph analytics, and graph machine learning for integrating multi-modal dementia research data. Abstract: Dementia disease research encompasses diverse data modalities, including advanced imaging, deep phenotyping, and multi-omics analysis. However, integrating these disparate data sources has historically posed a signiﬁcant challenge, obstructing the uniﬁcation and comprehensive analysis of collected information. In recent years, knowledge graphs have emerged as a powerful tool to address such integration issues by enabling the consolidation of heterogeneous data sources into a structured, interconnected network of knowledge. In this context, we introduce DemKG, an open-source framework designed to facilitate the construction of a knowledge graph integrating dementia research data, comprising three core components: a KG-builder that integrates diverse domain ontologies and data annotations, an extensions ontology providing necessary terms tailored for dementia research, and a versatile transformation module for incorporating study data. In contrast with other current solutions, our framework provides a stable foundation by leveraging established ontologies and community standards and simpliﬁes study data integration while delivering solid ontology design pa�erns, broadening its usability. Furthermore, the modular approach of its components enhances ﬂexibility and scalability. We showcase how DemKG might aid and improve multi-modal data investigations through a series of proof-of-concept scenarios focused on relevant Alzheimer’s disease biomarkers.


Introduction
The dawn of "omics" technologies, accompanied by advancements in imaging, clinical data collection, laboratory testing, and phenotyping, has profoundly influenced biomedical research [1][2][3][4][5][6][7].This multi-modal se ing has provided an unprecedented, comprehensive view of complex biological systems, thereby inspiring a shift towards a more integrated understanding of diseases.However, the introduction of data from diverse modalities also presents unique challenges.Effectively integrating and interpreting the sheer volume, complexity, and diversity of data generated by these sources requires sophisticated computational tools.Moreover, the data, which are often distributed across various databases, publications, and repositories, pose considerable barriers to seamless data integration.Even more daunting is the task of transforming multi-modal data into clinically actionable insights, requiring the ability to connect data from molecular to clinical scales, a feat complicated by the enormous diversity and complexity of individual diseases.These hurdles highlight the need for innovative strategies and tools to harness the potential of multi-modal data in propelling the field of precision medicine.
Since biological reality is often modeled as a network or graph [8,9], one technological approach that has gained significant traction is the use of knowledge graphs (KGs) [10], which allow for the integration and organization of diverse biomedical data types, facilitating their analysis and interpretation.
After Google introduced the knowledge graph in 2012, highlighting the advantages of the approach [11], KGs have become increasingly popular, finding adoption in industry with subsequent launches by companies such as Microsoft, Amazon, Airbnb, and Facebook [12], as well as in academia [13,14].Nonetheless, the definition of KGs can vary based on the application context.In biomedicine, they can be characterized as data structures meant to gather and disseminate real-world knowledge, where nodes depict significant biomedical entities and the edges delineate diverse relationships that could exist between these entities [15].KGs embody a methodological transition toward a more comprehensive representation of reality, facilitating the integration of heterogeneous data types and providing an intuitive, graph-based structure for representing intricate relationships between diverse biomedical entities.
Constructing a KG entails a series of methodological and technological decisions that profoundly impact the utility and effectiveness of the resulting product.A pivotal consideration in this process is the selection of a graph paradigm, which provides the theoretical and practical foundation for the structure and function of the KG.There are two primary approaches in this regard: Resource Description Framework (RDF) and Labeled Property Graphs [16][17][18].Both of these approaches offer robust technological solutions, but each has its own strengths and weaknesses.While RDF offers standardization and robustness ideal for semantic applications, it may suffer from verbosity and computational inefficiency.Conversely, LPGs excel in their flexibility and intuitive structure, which allow for the straightforward representation of complex relationships and properties on both nodes and edges, but they may struggle in scenarios demanding high interoperability and standardization.Thus, the choice often hinges on the specific project requirements and constraints.
In addition to choosing a graph paradigm, selecting a data model or graph schema is another critical decision for building a KG.This model dictates how entities of interest and their relationships are represented within the KG.This aspect can be approached in two main ways: using an ad hoc data model tailored to the project's specific needs or adopting a standard model such as ontologies.In particular, biomedical ontologies have emerged as essential tools in standardizing terminology, modeling biological realities [19], supporting data annotation [20][21][22][23], and facilitating biomedical text mining [24,25].With ongoing concerted efforts from the scientific community, these ontologies have evolved to incorporate fine-grained knowledge across various biomedical subdomains, as exemplified by initiatives such as the Open Biological and Biomedical Ontologies (OBO) Foundry [26] and the National Center for Biomedical Ontology (NCBO) [27] and its Bi-oPortal [28].Moreover, using logical modeling and annotation, biomedical ontologies make assertions that span and connect levels of biological organization, from the molecular level to phenotype and disease definitions.This ability to traverse and link multiple scales of biological information makes ontologies an invaluable resource for the construction of KGs for biomedical research.
The biomedical field is rich in open databases that offer scientific knowledge from various subdomains, including molecular biology (genomics, proteomics, and pathways), drugs, and disease characterization.These sources hold the potential for a more comprehensive understanding of biomedical phenomena; however, their value is often hindered by their dispersal across different platforms.KGs have emerged as instrumental tools for integrating and exploiting these disparate sources, fostering a multitude of projects that aim to unify the spread-out biomedical knowledge.
A prime example of such an initiative is the Monarch Initiative [29], which integrates genetic, phenotypic, and disease-related data to facilitate the identification of disease genes and variants.Similarly, the Clinical Knowledge Graph (CKG) [30] is an open-source platform that integrates proteomics, public databases, and literature.It effectively utilizes KGs to augment and enrich biomedical data, thereby facilitating informed clinical decision-making.Likewise, PrimeKG [31] is a multimodal KG that integrates a multitude of high-quality resources, representing various biological scales, i.e., from genotypes to clinical phenotypes.The scalable precision medicine open knowledge engine (SPOKE) [32] also integrates multiple biological data sources to provide structured knowledge ranging from low-level molecular biology to pharmacology and clinical practice.Furthermore, the KG-COVID-19 [33] project responded to the COVID-19 crisis by building a unified KG from disparate biomedical information about SARS-CoV-2, illustrating how KGs can effectively drive knowledge synthesis, particularly in emergent health situations.
As the number of available KGs increases, it has become evident that social and technical limitations exist, especially the need for standardization in entity naming and graph representation approaches [34,35].Regarding modeling standardization, the Biolink Model [36] has emerged as a high-level data model that provides standard terms and relations for describing biological entities and their relationships for organizing data in biomedical KGs.Biolink serves both as a map for bringing together data from different sources under one unified model and as a bridge between ontological domains.As a similar initiative to OBO, centered around KGs, the KG-Hub project [37] provides a collection of tools and libraries for building interoperable KGs and a mechanism for sharing them to foster their reuse.
In addition to their ability to model and query data, graph analytics and graph machine learning techniques have made notable advancements [38,39], supported by opensource libraries such as GRAPE [40] and KGTK [41].One technique particularly relevant in the biomedical domain is graph embedding [42][43][44][45][46][47], which allows us to capture complex graph structures into lower-dimension vectors.Exploiting these features to integrate specific patient data with large biomedical KGs has already shown promising results in deriving actionable clinical outputs, as evidenced by advancements in understanding diseases such as multiple sclerosis [42] and Alzheimer's disease [48].Recent dementia research uses multi-modal data to understand the condition from various aspects, including genomics, transcriptomics, metabolites, imaging, and clinical features.Having a framework that enables the systematic construction and instantiation of research and clinical data in a standardized manner offers significant benefits.
This paper introduces DemKG, a KG framework designed specifically for dementia research needs.The framework leverages reference ontologies from OBO, standard KG technologies from KG-Hub, and an instantiation tooling to transform source data into the KG following sound design pa erns within the ontological model.DemKG reuses most of its knowledge sources, provides specific terminological extensions to cover gaps identified in the scope of dementia, and ingests biological databases of interest, resulting in an integrative KG that covers the multiple data modalities involved in the research, including genomics, proteomics, imaging, fine-grained phenotyping, and clinical tests.Thanks to its design, DemKG is easily extensible, delivering means to customize and deploy in modern graph databases for enhanced data querying and retrieval.The expressive knowledge model supports advanced analytics through graph and network algorithms, which play an active role in the progression of research and be er patient care through the implementation of precision medicine.

Related Work
Advancements in storage and graph technologies, coupled with the increasing availability of open scientific data, have led to the emergence of multiple biological KGs [49].Projects such as the Monarch Integrated Knowledge Graph, the Clinical Knowledge Graph (CKG), PrimeKG, and the scalable precision medicine open knowledge engine (SPOKE), previously introduced in the introduction, bear similarities to our initiative.
The Monarch Integrated Knowledge Graph [29] is a notable example of biological KGs, which assimilates various data types (including genotype, phenotype, and disease) from multiple sources into a unified semantic graph model.The Monarch KG has been instrumental in our project, DemKG, as it not only serves as a primary data source but also offers an array of tools we utilize.Our philosophy aligns closely with that of the Monarch KG, emphasizing a robust semantic foundation while integrating data from a variety of external sources, including other ontologies and extensions.We build upon this work to extend it with dementia-related knowledge and provide means for integrating study data.
CKG [30] is an open-source platform designed to harmonize a wide range of "omics" data types into a coherent structure, including genomics, transcriptomics, proteomics, and metabolomics.CKG favors a custom data model formed from a selected set of concepts and relationships from specific ontologies.On top of the KG, CKG integrates statistical and machine learning algorithms to streamline the analysis and interpretation of typical proteomics workflows.DemKG resonates with CKG's mission to improve the modeling and integration of omics data.However, it deviates fundamentally from its approach to data modeling, wherein CKG employs a more circumscribed model.
PrimeKG [31] is a multimodal KG for precision medicine analyses.Like its counterparts, it integrates a plethora of resources to describe a broad spectrum of diseases with relationships across major biological scales.One of them is combining the entire range of approved drugs with their therapeutic action, distinguishing it from other systems.Moreover, unlike DemKG, PrimeKG employs a custom approach to its data model, incorporating ten types of nodes and thirty types of undirected edges extracted from reference ontologies.Furthermore, it lacks a systematic schema to integrate experimental and study data.
SPOKE [32] is a KG that connects information from 41 biological data sources, structured as 21 different node types and 55 edge types, ranging from low-level molecular biology to pharmacology and clinical practice.It uses 11 different ontologies to organize the data semantically meaningfully and, in its last iteration, also integrates the Biolink model whenever it is found to be practical.SPOKE is implemented as a Neo4j database built from a collection of Python scripts and provides a graphical user interface and a REST API for end-user access.Our method stands distinct from SPOKE in several crucial aspects.Primarily, it offers an open toolkit for KG construction and personalization, ensuring both platform and representational paradigm autonomy.Moreover, despite utilizing a comparable modeling approach, DemKG fosters a closer connection with a vast array of domain ontologies by preserving links to explicitly defined terms and relationships.Finally, our framework provides a flexible and robust module for research data integration.
In summary, our work distinguishes itself from similar efforts through a comprehensive approach that integrates a well-established terminological foundation and community standards, follows design pa erns conducive to data integration, and defines terminological extensions specific to the dementia domain, facilitated through a dedicated lowcode solution for seamless study data integration.

Terminological Foundation
In the construction of the knowledge graph, the initial and pivotal decision revolves around selecting an appropriate graph schema to provide a solid conceptual base that effectively captures data entities drawn from the array of biological subdomains pertinent to dementia research.This choice presents a dichotomy: one option involves creating a flexible, ad hoc schema tailored to the identified needs, while the alternative entails adopting a more structured strategy that employs standard terminologies and ontologies.Our methodology aligns with the la er approach, and a fundamental design principle in the construction of our KG is the utilization of domain reference ontologies to ensure the following: 1.The concept definitions are concise, accurate, and relevant; 2. There exists an active community keeping the ontology updated; 3.They are widely recognized, cross-referenced, and follow consistent design pa erns.
The criteria set forth are congruent with the guiding principles of the OBO foundry.OBO endorses an extensive range of domain-specific ontologies that are distinguished by well-demarcated scopes, the reutilization of concepts across ontologies, and alignment with a unified upper-level model, specifically the Basic Formal Ontology (BFO) [50], and relations are defined in the Relations Ontology (RO).Given these a ributes, we gave preferential consideration to OBO ontologies during our selection process.
As the KG must cater to a variety of domains, adopting this approach enables us to concentrate mainly on integration and only define new terms when detecting a gap.Some notable examples of the employed OBO ontologies include the Gene Ontology [51,52], Chemical Entities of Biological Interest (CHEBI) [53], and Protein Ontology (PR) [54] for the genetic and molecular domain.For the phenotype and disease domain, we utilize the Monarch Disease Ontology (MONDO) [55], Human Phenotype Ontology (HP) [56,57], and Phenotype And Trait Ontology (PATO) [58].In the area of anatomy, we incorporate the Uber-Anatomy Ontology (UBERON) [59,60] and the Foundational Model of Anatomy (FMA) [61].For neuropsychological tests and their relations, we include the Neuropsychological Testing Ontology (NPT) [62] and the Neurocognitive Integrated Ontology (NIO) [63].For modeling experimental se ings, the Ontology for Biomedical Investigations (OBI) [64,65] plays a central role.
These ontologies provide a significant level of detail, and reusing or referencing concepts between them expands the knowledge network, facilitating the exploitation of multi-domain and multi-level relations.For example, this interconnectedness simplifies navigation from HP phenotypes referenced in a disease definition in MONDO to specific genes in GO, proteins in PR, and molecular entities in CHEBI.Furthermore, we also include relevant Monarch data and annotation ingestions; specifically, gene and gene-phenotype annotations, filtered protein-protein interactions from the STRING database [66], and pathway knowledge from the Reactome pathway knowledgebase [67].The complete list of knowledge sources and annotations is listed in Table 1.While the standardization offered by domain ontologies is undoubtedly a strength, it can also impose limitations due to the inherent trade-off with flexibility.This high level of detail can complicate the integration of non-OBO ontologies and external datasets.Additionally, querying the graph requires a comprehensive understanding of the underlying model.We employ the Biolink model as our high-level data model to mitigate these issues.Biolink offers a means to utilize higher-level concepts from its "category" hierarchy while still allowing references to more specific ontology terms.The same versatility is available for relationships through the use of the "related_to" hierarchy, thus providing a balance between standardization and flexibility in our knowledge graph.

Terminological Extensions
OBO covers most of the conceptualization needs, but gaps remain relevant to the implementation.To overcome this issue, we implement an application ontology that is also one of the inputs of the merging process.The primary interventions relate to phenotypic normality, as well as to the necessary assay and platform definitions missing from OBI.
HP and MONDO thoroughly model disease states, conditions, and abnormal phenotypes, leaving out any reference to normal counterparts.To allow the categorization of instances of normal/healthy cases, we introduced a "Phenotypic normality" hierarchy.This new hierarchy is modeled as a sibling branch of the HP "Phenotypic abnormality", mirroring its hierarchy to allocate the "normality" concepts of interests.
In dementia research, the utilization of neuropsychological assessments such as the Mini-Mental State Examination (MMSE) [75], the Consortium to Establish a Registry for Alzheimer's Disease (CERAD) wordlist memory test (WLT) [76], Visual Object and Space Perception (VOSP) ba ery [77], Trail Making Test (TMT) [78], Clock Drawing Test [79], and Controlled Oral Word Association Test (COWAT-FAS) [80] is instrumental in quantifying cognitive function domains and tracking disease progression.We have implemented the necessary concepts to cover CERAD, VOSP, and COWAT-FAS tests, with the primary classes allocated under the "cognitive function assay" branch of NPT, while also relating to the mental and cognitive functions they assess.
The AT(N) classification system [81] is another tool of great importance for assessing the subject's biological state and understanding the intricate relationships between key biomarkers and their impact on disease evolution.AT(N) categorizes biomarkers according to their role in the disease progression, namely, Beta-amyloid deposition (A), pathologic tau (T), and neurodegeneration (N).Within each biomarker category, values can be positive or negative (+/−), derived from defined normal or abnormal cut points, resulting in the creation of eight distinct AT(N) "biomarker profiles" (Table 2).To provide proper terminological coverage, we have defined new classes for each biomarker profile and phenotype terms related to abnormal CSF protein concentration phenotypes related to phosphorylated tau (P-tau) and total tau (T-tau) missing from HP.Each biomarker profile is defined under the "value specification" class from OBI, with asserted logical axioms to associate them with the specific phenotype.
Table 2. AT(N) biomarker profiles and categories as defined by the NIA-AA Research Framework.Each biomarker profile is modeled as a descendant of the "value specification" class defined in OBI.

Technical Implementation
The implementation consists of three main software pieces covering different parts of the KG generation, integrated into a building pipeline: the extensions ontology builder, the KG-builder, and the data transformer module.To maximize effectiveness and reproducibility, in all three sub-projects, we employ state-of-the-art ontology and graph tooling maintained by the community and relevant projects such as Monarch and the "universal biomedical data translator" from the National Center for Advancing Translational Sciences (NCATS) [82].
The extensions ontology builder produces an OWL ontology using the Ontology Development Kit (ODK) v1.4.1 [83] as the building framework.The ODK provides a preconfigured, standardized environment with a set of tools that support all stages of the ontology lifecycle (creation and editing, building, and testing, and releasing with version control) and ensures a systematic approach to ontology maintenance.When possible, we define new classes that follow a pa ern using the Dead Simple OWL Design Pa erns (DOS-DP) v0.1.10 [84], reducing manual editing and consequently reducing errors and improving reproducibility.All the axioms are kept under OWL2 [85] DL profile.
The KG-builder is responsible for obtaining the different sources of knowledge and merging them into the terminological KG.Built upon the KG-Hub tooling ecosystem, the main configuration inputs are the merge and download YAML descriptor files, guiding the download and merge steps.When available, the ontologies are downloaded from the KG-Hub repository [86].OBO ontologies are already maintained as Biolink-compliant graphs in the Knowledge eXchange Format (KGX) [87] in the KG-OBO project [88] and are directly merged from each specific release artifact.The merging step includes all downloaded sources and the extensions ontology to obtain a final KGX graph.
One challenge when converting OWL ontologies into a graph structure lies in the difficulty of accessing class relationships established through subclass and class equivalence axioms.These assertions hold significant value in capturing the biomedical knowledge outlined in the comprehensive OBO ontologies.To address this situation, both the ontology and builder modules materialize class equivalence axioms.In the context of the extensions ontology, we utilize the relation-graph [89] library during the later stages of the construction process.In the case of OBO ontologies, the KG-builder retrieves a subset of links from the materialization output within Ubergraph [90], which also employs relation-graph.
The transformer module is a Python solution that provides an accessible approach to generating graph data in KGX format from tabular source input.This module adopts a YAML-based transform definition schema, mirroring the approach of other tools in the pipeline.This schema adheres to a standardized structure wherein users can define mappings from columns to specific classes paired with various instantiation design pa erns.The schema effectively models common research entities, including medical history, physical examination, and measurement assays, all aligned with dedicated instantiation pa erns that are further elaborated upon in the subsequent subsection.
The builder pipeline integrates all steps and can be configured to generate two artifacts: solely the terminological graph or the terminological graph with data instantiation.

Data Transformation Design Pa erns
One of the aims of the KG is to integrate raw research data to enable explicit connections with knowledge concepts.We propose a set of design pa erns to support the data instantiation of patient/subject study visits, phenotype observations arising from these visits, measurements/analyses derived from samples collected from different specimens, and neuropsychological test results.In all these pa erns, OBI is the central ontology employed to enable the relating of clinical and research concepts with specific entities of the biomedical domain.Figures 1-3 illustrate the main pa erns through simplified concept map figures, depicting the main ontology classes and properties involved, identified with a pseudo-CURIE of the format PREFIX, namely, "class label", where prefix is the OBO ontology prefix.The first pa ern models the relations between study protocol/visit encounters, the agents involved, and the resulting outputs.The pa ern mainly utilizes concepts defined in the Neurodegenerative Disease Data Ontology (NDDO) [91] (integrated in NIO) and the Ontology for General Medical Science (OGMS).The pa ern supports a proper logical definition of longitudinal protocols, common in dementia research studies.
Clinical history phenotypes are characterized through observations at a study visit or from existing records.The framework leverages a pa ern that relates visits with specific clinical administration, the finding, and the observed phenotype, usually a phenotype or disease concept from MONDO or HP.Relevant metadata can also be linked to the OGMS clinical entities, such as dates, agents involved, and locations.This pa ern is shared across medical history, physical examination, and diagnosis processes.Figure 1 illustrates both the visits and clinical pa erns.
A critical component of research data encompasses various assay measurements and proteomic datasets.We employ OBI's assay design pa erns [92] to capture the multiple aspects involved in this process.These pa erns enable the comprehensive integration of data pertaining to the assay, the specimen, and the molecule or material under examination, such as a protein or leukocyte count.Several relevant ontologies, including GO, PR, and Cell Ontology (CL), supply the necessary terminologies.We leverage entities from UBERON to denote the anatomical origin of the sample.This pa ern facilitates the preservation of crucial metadata about processes, encompassing information about the type of assay, the specimen or sample employed, experimental conditions such as freeze-thaw cycles, and the date and time of collection.Such metadata is of considerable value for resource management and can significantly aid research analyses.For instance, the type of tube in which a sample was collected could influence assay results and should be accounted for in linear models.Overall, it provides a more comprehensive context of the conditions under which experiments are conducted, enhancing the reproducibility and reliability of experimental outcomes.
Analyses derived from neuroimaging techniques, including segmentation measurements from tools such as Freesurfer [93] and Automatic Sub Hippocampal Segmentations (ASHS) [94], along with white ma er evaluations from Diffusion Tensor Imaging (DTI) [95] and peak width of skeletonized mean diffusivity (PSMD) [96], play an indispensable role in dementia research.The pa ern supporting this data modality follows a similar approach to the previous one, illustrated in Figure 2. To associate the measured anatomical entities, we utilize the FMA, which offers precise terms to align with the parcellation regions delineated by the widely used brain atlases in segmentation software, particularly for hemisphere-specific terms.More general terms from UBERON can be obtained using the "xref" property, employed for mapping concepts between different ontologies.
The last design pa ern focuses on effectively relating the information content of a given test with the cognitive domain, providing means by which to stratify subjects via cognitive staging and the specific domain or phenotypic abnormality from HP at query time.This pattern exploits the axioms that connect cognitive tests with the evaluated domains.

Results
We have developed a KG framework that harmonizes biomedical knowledge and evidence from various sources, coupled with a transformation module designed to streamline the integration of multi-modal and omics data in dementia research.The core components of the framework encompass the extensions ontology builder, which provide ontological definitions to fill identified gaps from the domain ontologies; the KG-builder, in charge of obtaining, merging, and producing the KG; and the data transformer module, a low-code interface to transform source study data.All components are publicly accessible on GitHub (h ps://github.com/demkg-framework/,accessed on 30 August 2023).This trio of tools forms an intuitive building pipeline and also offers flexibility for customization, enabling users to construct the graph from scratch, adapt it to specific requirements, and deploy it on their preferred platform and graph database.
The backbone of our implementation is rooted in established community standards, technologies, and methodologies.The initial step involved the selection of a comprehensive array of domain reference biomedical ontologies, primarily from OBO, to form an expressive knowledge model for our primary KG.These ontologies offer a variety of welldefined concepts across varying levels of granularity, encapsulating intricate details of biological reality in the form of hierarchical relationships and concept networks.
To facilitate a consistent term mapping across various ontologies and mitigate computational demands, we utilized pre-built KGs from the KG-Hub initiative and the KG-OBO subset as our foundation, employing the KGX tool for the merging phase of the KGbuilder pipeline.The KG-Hub initiative utilizes the Biolink model as its high-level data model, which we adopted to introduce greater flexibility and provide a comprehensive yet adaptable terminology overlay on the ontological model.The Biolink model facilitated the creation of both relaxed and detailed modeling and query capabilities, thereby enhancing the standardization and flexibility of our model.The default KG consists of 1.5 M nodes and 11.5 M edges.
To fill the identified gaps in the foundational model, we developed specific terminological extensions through the extensions ontology.We employed ODK to systematically introduce new terms, leveraging the OBO ecosystem to import and extend relevant external terms using DOS-DP whenever feasible.
Finally, the transformation module provides a low-code solution to transform tabular source data and generate necessary instance nodes and edges by following specific design pa erns that effectively depict study visits, phenotype observations, measurements/analyses derived from samples, and neuropsychological test results.These design pa erns promote efficient data instantiation under the ontological model of the source research data, interconnecting various aspects of the study design outputs and providing a robust platform for data querying and network-oriented analyses.Figure 4 shows an overview of the framework components.

Use Case: Graph-Enabled Phenotype, Flow, and Protein Exploration from AT(N) Biomarker Profiles
To validate the DemKG framework, we applied it to the Dementia Disease Initiation (DDI) study data, a multi-site longitudinal observational study aimed at identifying early biomarkers for patients at risk of developing dementia [97].The DDI dataset encompasses a range of clinical items, including medical history, standardized physical, neurological, and cognitive examinations, as well as laboratory and proteomic assays derived from blood and cerebrospinal fluid (CSF) samples, MRI, FDG-PET, and amyloid PET imaging, along with genomic analyses.We integrated these diverse data modalities and explored various aspects of the key biomarkers of the AD continuum, as categorized by the AT(N) classification.

Experimental Setup
The central DDI data platform is the XNAT archiving system [98], which is complemented by tailored customizations and data export functionalities, including automatic biomarker-based AT(N) classification, and population-adjusted norming for pertinent screening tests such as CERAD [99,100], VOSP [77], and TMT [78,101].We implemented the transformation descriptor for the DDI data, involving direct mappings from clinical codes and rules to translate assay and experiment results into specific phenotype and disease entities.We then fed the descriptor along with the aggregated Comma-separated values (CSV) dump from XNAT to the transformation module to obtain the graph representation.
The DDI cohort graph comprises 96,939 nodes and 362,824 edges, whereas an average subject subgraph with four visits has 3469 nodes and 8284 edges.This transformed graph was merged into the final DDI-KG, which we ingested using the KGX module into a Neo4j Community instance deployed in a Podman container configured with eight cores and 16 GB of RAM, running on the secured servers of the TSD (Tjeneste for Sensitive Data) facilities managed by the University of Oslo.We opted for Neo4j due to its widespread adoption, the capabilities of its Cypher query language, and its reliable performance.Furthermore, KGX automatically creates node indices and constraints to improve loading and query performance for this platform.
Taking advantage of these features, the setup proves efficient with the resultant graph model, particularly for queries with clearly defined traversals and designated node labels.Figure 5 offers a preliminary analysis for estimating query performance, tracing the time consumed in navigating paths that extend from one to ten hops from subject nodes to various relevant node types in the graph.As anticipated, the number of target nodes considerably affects query performance, primarily driven by the increased number of edges to evaluate and traverse, coupled with the augmented data volume to handle.This scenario is especially pronounced in the most populated and interconnected node types, namely, proteins, genes, and diseases.Therefore, queries involving numerous or unrestricted quantities of such nodes require thoughtful design.

Figure 5.
Mean execution times over ten runs for variable-length traversal queries between 1 and 10 connections, navigating from subject nodes to key Biolink categories.

Experimental Results
A key objective of the DDI study is to comprehend the evolution of subjects across different disease states within the biological reality, and the AT(N) classification system is a pivotal reference point.The developed design pa erns facilitate connections at various levels, enabling the exploration of individual and group trajectories across visits and expediting the retrieval of relevant phenotypes using graph queries (Figure 6).Using the AT(N) entities defined in the extensions ontology, we queried the graph database to investigate the flow between the different biomarker profiles.This exploration helped unravel the transitions between them at the cohort level, aiding in data filtering for parallel research endeavors.Moreover, presented visually (Figure 7), the outcomes of these queries proved instrumental in quality control efforts by highlighting unlikely transitions from pathological to normal states.Such interventions are vital since AT(N) profiles derive from biomarker measurements, where unexpected transitions may result from issues or errors in the respective assays.As shown in Figure 7b, one of the valuable a ributes of KGs that incorporate domain ontologies is richer semantic querying.Leveraging the hierarchical structure within phenotype and disease ontologies, we exploited semantic querying to gather phenotypes spanning different domains and visualized their prevalence across the AT(N) profiles.As depicted in Figure 8, we focused on phenotypes extracted from the "Abnormality of higher mental function" class within the HP ontology.Phenotypes related to memory, language, and executive function were referenced based on the rules established for the norming items in the cognitive screening section of the dataset descriptor.To capture complex graph structures into low-dimensional vector space, we utilized the GRAPE library to create node embeddings using the node2Vec algorithm [102] with Skip Gram [103] and applied them to evaluate various aspects of the AT(N) biomarkers.
We conducted an interesting experiment to investigate if the embeddings of subject visits showed any pa erns in the low-dimensional space or were influenced by specific AT(N) profiles.Using t-SNE [104] to reduce the embeddings to two dimensions, we observed a clear tendency for Tau pathology to group together in the embedding space, suggesting shared characteristics among the phenotypes assessed in those visits.The visit node embeddings are visualized in Figure 9, accompanied by a decision boundary computed through a logistic regression model.Lastly, we combined the graph query capabilities, node embeddings, and topological metrics to obtain a broader overview of the relationships between assay proteins and the AT(N) protein biomarkers to assist in decision-making processes that could steer future analyses.Since the graph provides explicit links between available assays and the analytes being evaluated, we gathered CSF-derived ELISA and proteomics target proteins for comparison, focusing on the shared network encompassing GO biological processes (BPs).
For assessing protein relationships, we employed a simple pair-wise cosine similarity measure.This allowed us to quickly gauge how closely protein nodes were related and then rank the proteins that were most closely associated with the AT(N) panel (Figure 10).To examine shared BPs between AT(N) and the assessed proteins, we employed a graph query to obtain the extensive network of protein activities.Given that proteins participate in thousands of such processes, to enhance navigability, we used GRAPE to calculate node betweenness and closeness centrality metrics, utilizing them as indicators of node relevance for prioritizing and narrowing down the pool of BPs to be investigated.A snapshot of this process is depicted in Figure 11.

Discussion
In our work, we introduce DemKG, a KG framework designed to integrate various ontologies and knowledge sources to focus on dementia research data.This framework aims to cover terminological and design needs for multi-modal and omics data, with additional terminological extensions developed when necessary.We also followed specific pa erns to cater to typical dementia research data outputs.
A key advantage of DemKG is its flexibility and ease of extension or customization to adapt to particular needs, made possible by the generalizable and pa ern-based technologies employed in different components of the framework.Another relevant feature of DemKG is the friendly interface of the transformation module, which lowers the technical barrier to effectively integrating study research data in the KG.
However, there exists an important limitation in its implementation: once built, the KG does not support modifications without risking underlying integrity, forcing a complete build and possibly ingestion when new versions become available.This limitation, a consequence of using KGX as the backbone for merging and building operations, may ultimately limit projects with streamed or on-demand data ingestion needs.
Nevertheless, our implementations remain open-source, primarily based on open knowledge sources, and the building pipelines employ systematic approaches with templating engines that are easily customizable.While our focus is dementia research, the broad biomedical ontologies forming the foundation of our terminological model make our KG applicable to other biomedical research datasets as well.Thus, the broader implications of our work extend beyond the application of the KG.Large biomedical KGs are proving to be an excellent tool for biomedical research, especially in domains requiring knowledge across different fields.The capacity to integrate disparate data and knowledge opens up opportunities for insights that were previously challenging to achieve.Approaches such as Precision Medicine greatly benefit from the implementation of KGs in their workflow.
This benefit is especially pronounced in dementia research, where the number of newly discovered biomarkers, phenotypes, and life conditions rapidly increases.These elements become part of the knowledge base that can be applied to the patient's biological signature.In this context, a KG like ours can play a crucial role in advancing our understanding of dementia and potentially informing patient care strategies.

Conclusions
In conclusion, DemKG presents a flexible and integrative approach to handle the ever-increasing complexity and multi-modality of dementia research data by leveraging a KG representation and relation capabilities.
The DemKG framework offers several distinct advantages over other solutions currently available.First, it is constructed based on well-established ontologies and adheres to recognized community standards, guaranteeing a solid and interoperable foundation.This is further enhanced by ontological extensions specifically crafted to facilitate detailed dementia research data analysis, filling a critical gap in the existing frameworks.
In addition to the above, DemKG integrates a low-code transformer module, simplifying the integration of study data and making the framework accessible to researchers with various levels of expertise.This module significantly reduces the time and technical know-how needed to merge study data, streamlining the data integration process considerably when compared to other solutions.
Furthermore, DemKG employs tooling to generate knowledge graphs in the platform-agnostic KGX format.This approach allows for easy deployment in a platform of the user's choice, offering flexibility in how and where the data can be used, and ensuring that the framework is adaptable to existing systems and future technological advancements.Enhancing its flexibility, the framework offers an open-source and customizable design, facilitating easy adoption and adaptation not only for dementia research but also potentially extending its utility to research into other diseases.
While there are limitations to the support for post-build modifications in its current iteration, addressing these in future work could broaden its applicability further.Despite these challenges, DemKG and similar KGs hold significant potential for propelling biomedical research and patient care advancements, extending from dementia to other medical conditions.

Figure 1 .
Figure 1.Concept map of the visits (light blue) and clinical (orange) design pa erns, depicting the main ontology classes employed to model data entities.

Figure 2 .
Figure 2. Concept map of the experimental measurements design pa ern.

Figure 3 .
Figure 3. Exemplification of the neuropsychological test design pattern, through a CERAD recall test.

Figure 4 .
Figure 4. Overview of the DemKG framework components.

Figure 6 .
Figure 6.A DDI subject subgraph that illustrates study visits and associated phenotypes, visualized with Neo4j Bloom and further edited for readability.(a) An overview of longitudinal visits.Subjects are connected to each visit via the "biolink:participates_in" predicate.The logical sequencing of visits is established through the "biolink:precedes" predicate, facilitating query traversal.Clinical entity nodes represent associated medical processes (medical history, cognitive screenings, lab assays, and more), serving as the source of observations and conclusions while also supplying context and metadata for encounters and experimental setups.These nodes link to phenotype and disease entities to depict the outcomes of the clinical/research processes.(b) A specific visit branch tracing the path from the individual subject to the evaluated phenotypes and diseases noted during a medical history recording.Additional data from clinical entities are omi ed to maintain clarity and uphold subject privacy.

Figure 7 .
Figure 7. Graph-based analysis illustrating the transitional flow among AT(N) biomarker profiles within the DDI cohort over successive protocol visits.(a) A Sankey diagram depicting the transitions in biomarker profiles.(b) The Cypher query utilized to calculate transition counts based on the predefined AT(N) biomarker profiles in the ontology.

Figure 8 .
Figure 8. Dot plot from the collected phenotypes from subjects and their prevalence among the different AT(N) biomarker profiles.

Figure 9 .
Figure 9. t-SNE visualizations of node embeddings.(a) Sca er plot output from GRAPE for all node embeddings from the KG representing the topological connectivity, colored by node type.It displays similarity and some possible clusters (Balanced accuracy: 60.32% ± 1.25%); separability consideration derives from evaluating a Decision Tree trained on five Monte Carlo holdouts, with a 70/30 split between training and test sets.(b) Visit node embeddings with nodes labeled by their associated T biomarker from AT(N) (pathologic tau).The dashed line marks the decision boundary between node types computed from a logistic regression model, with an accuracy of 0.831.

Figure 10 .
Figure 10.Cosine similarity of target proteins to AT(N) proteins.(a) CSF ELISA protein panel.(b) Synaptic protein panel from proteomics assays.

Figure 11 .
Figure 11.A snapshot of BP prioritization from node centrality.(a) Full subnetwork of shared BPs between AT(N) and synaptic panel proteins.(b) Sankey diagram with the top 10 BPs obtained from closeness centrality.

Table 1 .
List of DemKG knowledge sources.