Mapping Cancer Registry Data to the Episode Domain of the Observational Medical Outcomes Partnership Model (OMOP)

: A great challenge in the use of standardized cancer registry data is deriving reliable, evidence-based results from large amounts of data. A solution could be its mapping to a common data model such as OMOP, which represents knowledge in a uniﬁed semantic base, enabling decentralized analysis. The recently released Episode Domain of the OMOP CDM allows episodic modelling of a patient’ disease and treatment phases. In this study, we mapped oncology registry data to the Episode Domain. A total of 184,718 Episodes could be implemented, with the Concept of Cancer Drug Treatment most frequently. Additionally, source data were mapped to new terminologies as part of the release. It was possible to map ≈ 73.8% of the source data to the respective OMOP standard. Best mapping was achieved in the Procedure Domain with 98.7%. To evaluate the implementation, the survival probabilities of the CDM and source system were calculated ( n = 2756/2902, median OAS = 82.2/91.1 months, 95% Cl = 77.4–89.5/84.4–100.9). In conclusion, the new release of the CDM increased its applicability, especially in observational cancer research. Regarding the mapping, a higher score could be achieved if terminologies which are frequently used in Europe are included in the Standardized Vocabulary Metadata Repository.


Introduction
A cancer diagnosis is often followed by complex treatment that can last for years. Recently, many new therapeutic approaches have been developed, either derived from basic research or the use of new diagnostic measures, such as DNA sequencing, which examine the tumor in more detail [1]. Thus, an initial diagnosis of cancer is often accompanied by a series of diagnostic modifiers, such as Gleason score, grading, stage group. From this set of characteristics, the treatment strategy can be derived, and success can be estimated. When a new therapeutic approach is selected, the physician considers which therapeutic measures have already been carried out to increase the probability of a positive response and to reduce the risk of an adverse reaction [2]. These developments in predictive medicine have ensured that guideline-based treatment is increasingly shifting towards a personalized approach. However, the ability to give a more detailed specification of the tumor phenotype due to greater stratification possibilities also leads to decreasing case numbers within a specific tumor entity. The use of an appropriate study population to achieve significant results is limited by complex inclusion and exclusion criteria. In addition, the methods for analyzing these complex relationships, for example using Artificial Intelligence (AI) techniques, are continuing to evolve. These models require larger sample sizes than the current statistical methods in order to derive valid results. The potential applications of AI in the field of oncology have grown rapidly in recent years. Especially, the use of AI in the field of image analysis has delivered great progress [3]. The identification of complex patterns in radiological images aids the detection of malignant tumors and simplifies clinical decision-making processes. For example, one study has shown that an algorithm can predict whether a pulmonary nodule will become cancerous within the next 2 years, with 80% accuracy [4]. In addition to this use of AI in early cancer detection, image analysis can assist in the identification of tumor-specific diagnostic factors. In another study, a deep learning algorithm was used to predict the Gleason score of a patient tumor using prostatectomy images with 70% accuracy [5]. Besides the use of image analysis, AI can also help with the analysis of genomic sequencing data. As sequencing capabilities are increasing, so is the number of discovered genomic mutations, leading to researchers having to clarify associations between genomic mutations and phenotypes using literature research. This is where AI approaches might be able to simplify human workloads [6].
In modern cancer research, it is crucial to establish data exchange or decentralized analysis pipelines based on a homogeneous data semantic base in the joint networks of individual research institutions [7]. For example, Cancer Core Europe, a consortium of 28 European cancer institutions, has stated that there is a "need for creating a uniform platform for translational cancer research to bring together enough centers to generate the critical mass of patients, expertise and resource required to make a significant breakthrough in cancer care" [7] (p. 523). However, the German Cancer Consortium has identified several challenges for the establishment of such networks. Because of different data protection laws worldwide, merging data is challenging. Furthermore, depending on the data infrastructure, there are different technical requirements, such as documentation systems and others, that can make data exchange difficult. However, in general, the greatest challenge lies in semantic heterogeneity [8]. Semantic heterogeneity in this context means that two IT systems fulfill the prerequisites for receiving data from each other (syntactic interoperability), but the interpretation of this data is not possible due to ambiguous semantics. A solution could be the mapping of cancer data to a common data model (CDM) which represents knowledge with unified semantics and enables decentralized analysis. Many CDMs come with analytical applications. Thus, the integration of heterogenous operational databases into a CDM enables the use of CDM-developed analytical applications, such as package libraries and REST APIs. A well-known CDM in the field of clinical research is the PCORnet Model from the Patient-Centered Outcomes Research Institute (PCORI). They have developed a policy of data standards to enable the efficient use of data in clinical and patient-powered research without violating data protection regulations. These data standards lead to the semantic alignment of the source data, so that multi-centered studies are possible without the respective institutions having to give up control of their data [9]. This can enable larger cohort sizes, which can be analyzed using AI through federated learning. The Clinical Data Interchange Standards Consortium (CDISC) has developed several data models that cover the different phases of the clinical research process. There are data models for study planning, data collection, the tabulation of study data, and analysis. These data models maintain compatible standards across all converted datasets. In a related study using resident registry data, the most common CDMs in the clinical research domain (SCDM v.5.0, PCORnet v.3.0, OMOP v.3.0, CDISC, SDTM v.1.4.) were evaluated in terms of completeness, integrity, flexibility, integrability, and implementability for EHR-based longitudinal registry data [10]. It was found that the OMOP CDM v.3.0, provided by the Observational Health Data Sciences and Informatics (OHDSI) community, achieved the best scores regarding the evaluation criteria of the study. OHDSI is a multi-stakeholder interdisciplinary collaboration founded in 2014. It arose from the public-private partnership with the US Food and Drug Administration (FDA). After FDA funding ceased, it was decided that a collaboration should be developed; the CDM was adopted as an open-source project with the aim of integration into scientific applications. Nowadays, this collaboration consists of an international network of researchers and over 100 observational health databases from 19 countries. It develops technical solutions for the representation of uniform medical data from different source systems, tackling the lack of standardized her and EMR data and the absence of consistent patient-level data in obser-vational databases [11]. It provides open-source applications with the goal of strengthening the research community, whose findings can then be considered in clinical questions. For example, there is a comprehensive R package library that allows feature extraction from OMOP, and AI-based analysis of these extracted OMOP data to be performed. Also of note is the PatientLevelPrediction package, which provides patient-specific prediction models using machine learning and deep learning algorithms [12]. In addition, federated pipelines of different semantic homogeneity databases reduce capture bias, and a large number of observed patients in a study leads to higher statistical power and greater stratification possibilities. In September 2021, OMOP was supplemented by a new Episode Domain [13]. This Episode Domain contains the master table Episode, which displays an episodic modeling of the course of a disease, depending on its respective concepts. Episodic modelling of cancer is essential to represent the complex disease process. Correct episodic modeling is therefore of particular importance to derive evidence from oncology data. By implementing the standardized concept of Disease Dynamic in the Episode Domain, survival probabilities with cancer-specific endpoints can be calculated via the CDM. This concept is based on the Response Evaluation Criteria in Solid Tumors (RECIST). These models were defined by an international working group aiming to establish uniform regulations for physicians to classify responses to tumor treatment [14]. The availability of RECIST data is essential to enable the comparison of the analysis results across institutions, for example from survival analyses in multicentered studies. The Episode Domain also contains the Episode_Event table which allows linkage of the abstracted Episodes to low-level events of the CDM, newly embedded with the implemented standardized vocabulary. Besides extending the CDM with the Episode Domain, new oncology terminologies, primarily those commonly used in cancer care such ICD-O-3, ATC, and others, were added to OHDSI's Standardized Vocabulary Metadata Repository.
However, the extent to which Episodes of a cancer course can be represented through the implementation of newly added tables, and how well oncology registry data can be displayed through the newly standardized vocabulary, such as ATC, HemOnc, ICDO3, and Cancer Modifier, have never been investigated. The data used in this study were collected from the clinical cancer registry (KKR) of the University Hospital Hamburg-Eppendorf (UKE), and range from the structured recording of a new diagnosis until the death of the patient within the UKE. The KKR has existed since 2010, and documents all cancer patients who have received cancer-related diagnostic or therapeutic measures at the UKE. Moreover, the KKR must report these collected cancer data to the national registry for quality assurance and research purposes.
The objective of this study was firstly to find out to what extent the source terminologies of the clinical cancer registry can be mapped to the respective OMOP standard. Secondly, we investigated to what extent the source data of the tumor documentation system can be transferred to the Episode Domain. Finally, we explored how well survival analyses can be derived from the OMOP CDM compared to the source system. Thus, overall survival analyses were conducted across the CDM and source system.

Materials and Methods
The implementation of the new tables was carried out in three phases. The first phase comprised episodic modeling according to Disease Extent, Disease Dynamic and Treatment ( Figure 1). The second phase involved the mapping of the cancer data to the oncology standardized vocabulary, and the last phase comprised the linkage of Episodes to the underlying clinical events of the CDM by the implementation of the Episode_Event table.

Source System
The Giessener Tumor Documentation System (GTDS) [15] is a client-server application with an ORACLE database management system at the backend. The frontend is provided by an ORACLE forms-and a web application. Its interface is connected to the central Health Information System (HIS) so that patients with a cancer diagnosis are automatically imported into the documentation system. All imported cases are reviewed by a clinical documentation specialist and then documented in a structured form using different input masks. The relational GTDS database comprises 422 tables that are related by primary and foreign keys. For a correct data query, a deep understanding of the cardinality of the tables is essential. Primary and foreign keys must be connected correctly to avoid either an endless query loop or duplicate data entries. A patient population of 26,000 was included in this study. This provided the groundwork for the mapping process.

Episodic Modelling
In the first step, only Episodes which described the extent of the disease were modelled. Possible attributes were Confined Disease (Concept_id: 32942), Invasive Disease (Concept_id: 32943) and Metastatic Disease (Concept_id: 32944). For the modeling of these Episodes, values from the Tumor-Node-Metastases staging system (TNM) from the source system were chosen as the starting point for the modeling. Date-exact TNMs were aggregated using a custom algorithm, and time intervals were derived from these aggregated data. In addition, the source data were mapped to standardized concepts of disease response during treatment (Disease Dynamic) according to RECIST, which reflects the phase of the patient's disease and derives survival probabilities. The source system provides the disease state of the patient at a certain time point (to a day). The measurement points for the determination of the remission status are summarized in time intervals (start date, end date) under the application of a custom algorithm that firstly derives a time interval from the measurement points, secondly takes into account the underlying concept (Complete Remission (Concept_id: 32946), Remission (Concept_id: 32945), Partial Remission (Con-cept_id: 32947), Stable Disease (Concept_id: 32948) and Progression (Concept_id: 32949)), and thirdly merges time intervals with the same underlying concept. For the presentation of Cancer Drug Treatment (Concept_id: 32941), Cancer Radiotherapy (Concept_id: 32940) and Cancer Surgery (Concept_id: 32939), the corresponding tables from the source system were used. The OHDSI OncoRegimenFinder algorithm [16]) was used for modeling the Treatment Regimen (Concept_id: 32531). Drugs that were administered within a 30-day time window were summarized in regimens and translated to HemOnc [17] terminology where possible.

Vocabulary Mapping
ICD-O3 is a combined classification of the topography and morphology of a tumor [18]. The topography is derived, in part, from the ICD-10 code and has a 4-digit character that covers the range from C00.0 to C80.9 which, similar to ICD-10, specifies the tumor category. The Morphology code of ICD-O3 specifies the type of cell of the neoplasm and the behavior. The ICD-O3 is implemented in the CDM in the Condition Domain and links the Condition_occurrence events with the disease episodes of the oncology module.
The North Association of Central Cancer Registries (NAACCR) defines cancer registry standards for the structured acquisition of data in North America [19]. NAACCR incorporates existing ontologies and classifications, such as ICDO-3, into its data standards. This ontology is mainly used in cancer registries in the USA and Canada. All data collected in the context of cancer therapy and diagnosis are assigned to specific items, which are either superordinate or assigned to special schemes, according to the respective cancer entity. Each item has a NAACCR value. Source items were mapped according to NAACR at item and value level.
The National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH) of the USA, provides information and research services for making biomedical data usable in the context of healthcare, and grants access to evidence-based results [20]. In 2003, the NLM developed and administered the ontology SNOMED-CT (Systematized Nomenclature of Medicine Clinical Terms) [21]. It has nine hierarchically arranged concepts, of which this study uses Clinical Finding and Procedure, covering hierarchical levels 1 and 2. By incorporating the root concept, the underlying subtypes can be identified with their associated descendants. The higher the concept class of the corresponding domain, the more descendants can be identified in SNOMED-CT. However, it is also possible to infer the root concept from the descendants.
HemOnc is a medical Wiktionary. It provides information on treatment regimens, subdivided by disease subtypes, and additionally offers information on drugs, interventions, and general information on the treatment of neoplasms [22]. The HemOnc Wiki was integrated into the Standardized Vocabulary Metadata Repository v5 to provide a link between the abstraction of Treatment Regimen Episodes of the oncology module and low-level drug events of the OMOP CDM [23].
As part of the Episode implementation, source data were mapped to the new vocabulary ( Figure 2, Table 1). In the OMOP CDM, the ICDO-3 classification was used to represent the cancer diagnosis. The elements that were used to specify the tumor diseases in more detail were included in the domains of Measurement and Observation. As part of the implementation of the Episode Domain, the source data was mapped to ICDO-3, SNOMED, ATC, HemOnc, Cancer Modifier, and NAACCR standardized vocabularies. Thereby, the primary approach was to map the oncological data to the SNOMED-CT terminology. If another classification system was more granular, with respect to cancer representation, it was preferred to SNOMED-CT.

Linkage between Episodes and Underlying Clinical Events
The linkage of the abstracted Episodes to the corresponding underlying events of the CDM was performed via the Episode_Event table. The pooling of polymorphic foreign keys of the clinical tables in the Episode_Event table provides the possibility to link the unique identifiers of low-level events of the CDM with an Episode. Thus, all therapeutic or diagnostic measures can be assigned to an Episode. For example, this table can then be used to query which measures have been undertaken during the event of a progressive course, e.g., for renewed diagnostics, re-radiation, surgery, and other measures.

CDM Application and Comparison
To test the applicability and accuracy of the CDM, overall survival of a breast cancer cohort was calculated via the CDM and source system and compared to the real results. The Null Hypothesis (H0) was tested, which assumed that the calculated overall survival of the two systems was the same. The Alternative Hypothesis (H1), on the other hand, assumed that there were differences in overall survival between the systems. The probability of error (alpha error) was set at 5% for this test. This means that, if p > 0.05, H1 would be rejected and the H0 hypothesis could be accepted. The calculations of the survival analyses were performed in a dynamic R Markdown report. The DatabaseConnector package was used to extract the survival cohorts from the Source and CDM database. These cohorts were merged into one dataset using the dplyr package and then stratified analyzed using the Survminer and Survival packages.

Episodic Modeling
Within this study, a total of 184,718 Episodes could be implemented in the new Episode table of the CDM. This standardized data pool of concepts can be used by most of the OHDSI collaborative applications, allowing cross-institutional comparison. In the Episode Domain, the concept classes of Disease Extent, Disease Dynamic and Treatment were implemented. There were 26,700 documented TNMs. From these, 18,561 could be mapped to the Disease Extent concept during modeling. Regarding the Disease Dynamic, which reflects the disease status, a total of 31,627 Disease Dynamic concepts could be derived from a total of 147,816 measurement points. The concept of Complete Remission, with 60% (n = 18,980), was the most frequent outcome (Figure 3a).
With respect to the episodic modeling of treatment phases, 99,840 Treatment Episodes could be derived from the source system (Figure 3b

Vocabulary Mapping
Furthermore, vocabulary concepts, assigned to relation types in the CDM, could be queried via the Concept_relationship table in the Standardized Vocabularies Domain. This linking of relationship types makes it possible to query additional information of a concept without this information being available in the source system. It was possible to implement 79 distinct relationships.
The "Maps to" relation was the most frequently occurring relation (n = 463,880, 16.77%) ( Table 2). The relation "Has priority", on the other hand, was the least represented, with five (0.0002%) events. In total, 2,765,952 oncological data-related elements could be mapped to the standard. It was possible to map 153,490 data entries to the standardized Cancer Modifier vocabulary. Among them, the concept of topography of the tumor (n = 53,744, 35.01%) could be mapped most frequently. The OncoRegimenFinder algorithm extracted 16,303 Treatment Regimens from the documented ATC data in the source system. There were 26,443 documented protocols, similar to Treatment Regimens of the Episode Domain, in the source system. Therefore, the algorithm extracted 38.35% fewer Treatment Regimens than were stored in the source system. A brief comparison of time intervals showed that only 42% of the detected Treatment Regimens had at least one correct start or end date, assuming that the documented protocols of the source system represented the actual values. From these Treatment Regimens, 60% (n = 9800) could be assigned to the regimen class Chemotherapy. The regimen class Immunosuppressive Therapy had the lowest number of events, with only 3% (n = 485). The achieved vocabulary mapping score between the CDM and source system depending on the Domain can be seen in Table 3.

Linkage between Episodes and Underlying Clinical Events
By using the Episode_Event table, it is possible to link the underlying clinical events to the derived Episodes. Clinical events were assigned to an Episode by their time interval. Only those events were assigned to an Episode whose examination date fell within the time interval of an Episode. In total, 2,056,721 events could be assigned to an Episode, with most of them to the Measurement Domain (Table 4). On average, 8.23 clinical events from the Measurement Domain could be assigned per Episode. Most linked events were obtained in the Measurement Domain (per Episode: 8.23, n = 820,013). In total, 2,056,815 events could be assigned to an Episode.

CDM Application and Comparison
Regarding the applicability of the CDM, it was tested to identify if the results of overall survival analyses across the source system and CDM were similar. The calculated survival probabilities did not differ significantly from each other (p-value = 0.82) and thus the H0 hypothesis was accepted. Accordingly, the median survival of a patient with breast cancer was 164 months in CDM; the calculated median survival in the source system was two months shorter ( Table 5). The percentage deviation in cohort size was 1.5%, with a larger cohort included in the source system. The Number at Risk distribution can be seen in Figure 4.

Discussion
The newly integrated Episode and Episode_Event tables, which were introduced in the last OMOP release, represent an enrichment for the representation of long-term chronic diseases, including cancer. Due to the complexity of the disease and treatment approaches, episodic modeling of the respective disease phase is useful. It is now possible to represent the course of disease according to its timeline regarding treatment and disease development. In addition, low-level clinical events of a patient can now be assigned to an Episode via their temporal reference. The use of this new data structure within the CDM increases the capabilities of oncology data analysis and visualization.
Regarding overall survival, it can be concluded that the results did not differ significantly between the systems. Nevertheless, only overall survival was evaluated in this work. The median difference in progression-free and disease-free survival was not investigated, which should be completed in the future. In addition, it must be considered that in the context of mapping to the standardized vocabularies, data entries from the source system occasionally could not be mapped to the respective standard, resulting in a reduced number of patients included in the CDM.
Additionally, it must be noted that the extension of the CDM is not yet integrated in all applications of the OHDSI community, and is only implemented on the CDM database level. Thus, the extension is also not yet integrated in the analytical and methodological toolchain provided by the OHDSI community, such as Hades, Atlas, and others, which currently limits the evaluation options of the new release. However, it can be assumed that this will be mitigated by the next major release.
In addition to implementing the Episodes, this project also addressed vocabulary mapping. Existing terminologies from the source system can be mapped to the OMOP standard. Alternatively, the terminologies in the source system can serve as a starting point for an elaborate mapping process to a completely new vocabulary, as in the case of the HemOnc mapping process. Mapping the HemOnc terminology to the OMOP standard improves the Treatment Regimen evaluation [23]. However, the algorithm developed by the Oncology Working Group (OncoRegimenFinder), which allows the translation of ATC substances to HemOnc terminology, is considered an even greater improvement. It is now possible to derive Treatment Regimens from ATC terminology, which is used in almost all European hospital systems for drug coding. In addition, by mapping source data to the OMOP-ATC and the OMOP-HemOnc standards, it is possible to revert from both terminologies to the respective RxNorm standard, which is a clinical drug dictionary for all drugs that are approved for the pharmaceutical market of the USA [20]. Through the standardized vocabularies maintained by the OHDSI 'CDM and THEMIS Working Group', it is now feasible to translate data elements from one of these three terminologies to each other, enabling international observational cancer research [23]. Furthermore, besides the ontological integration of Treatment Regimens into the CDM, the HemOnc Wiki also includes information on phase I-III clinical trials. In future releases, it is planned to include this information in the Standardized Vocabulary Metadata Repository, which would allow inference from the Treatment Regimen to the performed studies before approval [22,23]. Overall, the ontological structure of the CDM simplifies the complexity and the effort of the generation and phenotyping of cohorts; high level terms of CDM concepts can be incorporated into the query as a parameter, rather than each parameter individually, as is common in the source system.
As part of this work, source data were mapped to the Cancer Modifier vocabulary, an OMOP standard composed of different standards such as NAACCR, WHO, and SEER. However, this vocabulary is currently integrated into the CDM ontology via only two relations, which severely limits its query and analysis options, especially regarding other terminologies included in the Standardized Vocabulary Metadata Repository. Nevertheless, it can be assumed that the number of relations will increase with new vocabulary releases. Project-related mapping to the NAACCR vocabulary was challenging, since its data structure is semantically very heterogeneous compared to the source system. Therefore, only a few data elements from the nationwide basic dataset for standardized cancer registration in Germany (ADT/GEKID), which is implemented in the source system, were mapped to the NAACCR standard. Not many terminologies that are used as standard in Europe or Germany are part of the latest vocabulary release. This complicates the mapping process, since the data must be prepared in a complex manner before they meet NAACCR vocabulary standards, leading to a loss of data during the preparation process. Therefore, in a next step, it would make sense to include other terminologies, especially those frequently used in the European or German area, such as the basic dataset ADT/GEKID, into the Standardized Vocabulary Metadata Repository of the CDM. Additionally, this would offer a general applicability of German cancer registries for data harmonization. Especially with regard to cross-cancer registry analyses, i.e., federated learning environments, this should be completed as a next step.

Limitations
In the context of this project, cancer-related patient data including diagnostic characteristics and prior therapies were mapped to the OMOP CDM v5.4. It has to be noted that data protection regulations in Germany make data harmonization difficult. Specifically, it is not possible to link a patient's medical record with their cancer diagnosis and map possible interactions between the development of cancer and medical records without considerable formal effort. For example, German data protection regulations impede the HL-7 import from the HIS via the GTDS interface, or the ETL into the data warehouse of the KKR of the UKE. Especially regarding personalized and predictive medicine, the problem of challenging data protection regulations should be revised. Additionally, the harmonization of EHR, EMR and registry data should be further advanced. Finally, in this study, record linkage was not considered during the mapping process. This could lead to patient duplications in the future in joint research projects with other institutions. Furthermore, concerning the calculation of overall survival, it is noteworthy that only those patients were included in the analysis whose diagnosis data could be mapped to the respective ICDO-3 standard in the Condition Domain of the CDM. Conversely, this means that all patients were excluded from the source system who could not be mapped to an ICDO-3 standard within the framework of the CDM mapping.

Conclusions
The new module, which was officially introduced by the OHDSI community in the OMOP CDM v5.4 release, is a great addition to the field of joint observational cancer research. It is currently the only CDM in the field of clinical research that includes a comprehensive standardized terminology of cancer representation and allows time-dependent episode-based modeling of disease progression. In addition, by mapping to the OMOP ontology, the source data can be enriched with additional information. This increases its application and evaluation possibilities. Furthermore, unified semantics offers the easy implementation of an AI-federated algorithm pipeline.
Nevertheless, many terminologies were included in the Standardized Vocabulary Metadata Repository that are rarely used or not used at all in the European or German areas, limiting the mapping success. This gap should be closed in the coming years to guarantee the mappability of different oncological data sources to the CDM. Especially, the inclusion of the basic oncology dataset (ADT/GEKID) in the standardized vocabulary would considerably facilitate and expand data harmonization between German cancer registries and enable joint analyses.
Additionally, it should be considered that duplicate patients can also occur in distributed research networks. These can only be clearly identified via record linkage. Therefore, future research should especially consider how to establish record linkage within the CDM across distributed research networks without contradicting the country-specific data protection regulations in place, potentially through the use of superior pseudonyms, and prepare the essential steps to enable precision medicine and precise oncology.