Common Data Model and Database System Development for the Korea Biobank Network

Ko, Soo-Jeong; Choi, Wona; Kim, Ki-Hoon; Lee, Seo-Joon; Min, Haesook; Oh, Seol-Whan; Choi, In Young

doi:10.3390/app112411825

Open AccessArticle

Common Data Model and Database System Development for the Korea Biobank Network

by

Soo-Jeong Ko

^1,2

,

Wona Choi

^1,2

,

Ki-Hoon Kim

^1,2,

Seo-Joon Lee

¹

,

Haesook Min

³,

Seol-Whan Oh

¹ and

In Young Choi

^1,*

¹

Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea

²

Department of Biomedicine & Health Sciences, The Catholic University of Korea, Seoul 06591, Korea

³

Division of Biobank, Department of Precision Medicine, National Institute of Health, Cheongju-si 28159, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(24), 11825; https://doi.org/10.3390/app112411825

Submission received: 29 October 2021 / Revised: 7 December 2021 / Accepted: 9 December 2021 / Published: 13 December 2021

(This article belongs to the Special Issue New Trends in Medical Informatics II)

Download

Browse Figures

Versions Notes

Abstract

:

The importance of clinical information related to specimens is increasing due to the research on human biological specifications being conducted worldwide. In order to utilize data, it is necessary to define the range of data and develop a standardized system for collected resources. The purpose of this study is to establish clinical information standardization and to allow clinical information management systems to improve the utilization of biological specifications. The KBN CDM, consisting of 18 tables and 177 variables, was developed. The clinical information codes were mapped in standard terms. The 27 diseases in the group were collected from 17 biobanks, and all disorders not belonging to the group were standardized and loaded. We also developed a system that provides statistical visualization screens and data retrieval tools for data collection. This study developed a unified management system to model KBN CDM that collects standardized data, manages clinical information, and shares the information systematically. Through this system, all participating biobanks can be integrated into one system for integrated management and research.

Keywords:

common data model; Korea Biobank Network; biological specimen banks; data management; clinical information system

1. Introduction

As the number of studies using human biospecimens increases worldwide, the importance of clinical information linked to human materials is increasing [1,2,3]. The common data model (CDM) is an internationally popular approach, and many joint studies are being conducted to standardize data structures and terms to enable joint research [4,5,6]. As data structure and format standardization progress in research using biospecimens [7], an environment for the more active use of clinical information has been created, and various models using bioresources have become the basis for expansion and development [8].

In Korea, the NBK (National Biobank of Korea) developed the HuBIS_Sam (Human Biobank Information System_Sample), which is a historical management system for all processes involving the collection, storage, and distribution of human biospecimens. The HuBIS_Sam was developed in 2008, and by 2020, it had been used by 60 human biobanks throughout Korea. Among these biobanks, 17 human material banks have formed a network called the Korea Biobank Network (KBN) under the Korea Centers for Disease Control and Prevention. Biobanks use Excel files or other clinical information management systems to generate and manage clinical information linked to samples. This information is then distributed to researchers [9]. There is no integrated management system that can manage the collected clinical information and systematically distribute it [9,10,11,12].

2. Motivation

The human material banks in the KBN have a large-scale biospecimen infrastructure, with material from approximately 500,000 hospital patients. However, the use of the clinical information is limited because it was not collected using a purpose-based collection system, and the connection between samples and clinical information is insufficient and not standardized. To utilize these data, it would be necessary to establish a repository using the CDM by defining the specific scope of resource information and preparing a standardization system for terms and collected resources. In addition, the development of an information technology-based integrated management system that can manage the collected data in an integrated manner would also be needed.

The purpose of this study was to establish a clinical information standardization CDM and to establish an integrated clinical information management system to increase the value of biospecimen utilization and clinical data. If clinical information can be systematically collected and stored, researchers can be better supported and standardized data can be actively used in various studies such as drug development and biomarker verification. We propose a data modeling system for integrating the biobank data in Korea.

3. Related Work

3.1. Related Concepts

The KBN is an institution that was established to enhance research competitiveness in the Korean medical field by establishing a network among human resource banks in Korea for collecting, managing, and distributing biospecimens to researchers at the national level [13,14].

A CDM provides a common data structure and format for research using data from different databases [4,15]. It further provides a systematic environment through which research methods can be compared effectively, and reproducible results can be obtained based on standardized data. A CDM is a human-centered model: all clinical data tables are linked around the PERSON table, and all terms are standardized.

A biobank is an institution that establishes and manages biospecimens used as research materials and provides them to researchers. The biobank stores biospecimens from donors with a variety of clinical conditions, performs quality checks, creates databases for processing clinical information or epidemiological data, and distributes biospecimens for research [16].

3.2. Related Research

In Europe, the Minimum Information About BIobank Data Sharing (MIABIS) dataset was created so that biobank samples would have the minimal datasets required for information and data sharing. It consists of a total of 52 variables: 17 variables for data sharing such as bank ID and contact person, and 34 variables for data describing sample collections. Among these, basic patient information and clinical records are not separate data variables and are not coded but instead are recorded in free text format as one item [17,18].

In 2006, a European study developed a central database application for a tissue bank. In the TuBaFrost system, both patient case data and tissue sample data were centrally managed. Although the patient clinical information was collected in the patient case data part, it was necessary to request detailed data from each biobank during the study because it stored minimal information [19].

In the United States, various types of disease-specific biobanks are actively being built. Autism BrainNet is a collaborative network of university-based sites with the shared goal of building an extensive enough archive of postmortem brain material to support current genetic and neuropathologic research [20]. In China, biobanks collect data on specialized diseases [21]. For instance, China’s Liver Cancer Biobank maintains a database of 13 disease groups related to liver cancer. Seven types of data are collected: patient characteristics, image data, laboratory tests, operative data, tumor characteristics, pathology, and follow-up data [22].

4. Proposed System

4.1. KBN-CDM Development

The specifications of the proposed CDM are presented as a unified modeling language in Figure 1. The data model was designed to include all 27 disease types and all other diseases that do not belong to one of the 27 types that were included in the information from the 17 biobanks, which was collected in 2019.

The model consists of a total of 18 tables and 177 variables. The patient information of each hospital is loaded into the BASICINFO table, and because one patient can be admitted multiple times, the BASICINFO table and the REGISTRATION table form a 1:N relationship. The REGISTRATION table and the remaining 15 tables containing clinical information were also designed to form a 1:N relationship. The codes for the data are managed in the VOCA table. Among these codes, clinical information such as diagnosis, drinking habits, and smoking habits are mapped to international standards such as Systematized Nomenclature of Medicine–Clinical Terms (SNOMED-CT) or Logical Observation Identifiers Names and Codes (LOINC).

The details of each KBN clinical CDM table are given in Table 1. The BASICINFO table consists of basic patient information such as patient number, sex, date of birth, occupation, and registered biobank. Occupation is coded into 13 types and mapped using the SNOMED-CT language. The REGISTRATION table contains records about patient registration with a certain institution. The information included in this table consists of registration date, age (at the time of registration), and diagnosed disease (at the time of registration). As mentioned, diseases are classified into 27 types. The SPECIMEN table contains the collected specimen information. All data were obtained by linking them to the HuBIS_Sam system used in KBN. The information included in this table consists of the specimen collection date, specimen type, and number of specimens. The information included in the STAGE table consists of the American Joint Committee on Cancer (AJCC) pre-op and post-op results.

The BDMEASURE table contains information related to body measurement data. These data include height, weight, body mass index (BMI), systolic blood pressure, diastolic blood pressure, pregnancy status, loss of weight (yes or no), amount of weight lost, European Cooperative Oncology Group (ECOG) waist circumference, pulse rate, temperature, respiration rate, head circumference, and umbilical cord blood count. The FAMHISTORY table includes information on the patient’s prior disease status before being registered on the system, that is, information such as any other disease history before registration, similar family disease history, and any family history of allergies. The CLINEXAM table includes laboratory test results at the time of biobank registration. The NOTE table includes pathology and surgery records. The EXAM table includes functional tests such as lung capacity tests and ECG tests or exceptional test results such as the Apgar score of premature babies.

The OPERATION_DETAIL table includes surgical operation record details, whereas the PATHOLOGY table includes pathology record details. Data from both tables include information regarding the body part that the surgery was performed on, the size of the tumor, or any other information considered important during surgery or radiology. The TREATMENT table includes non-surgical treatment information, drug treatment information, or any other past treatment that the specific patient has had. The SYMPTOM table includes records of what symptoms caused by the disease the patient has had, such as hepatic coma status and ascites status. Medical survey data that patients have prepared themselves, such as International Prostatism Symptom Score (IPSS) and urination matching, are collected in the survey table. The FOLLOWUP table includes follow-up data of the patient, such as death, disease recurrence, or metastasis. In the VOCA table, internal management codes for terms used in the overall table definition and standard term codes for each term are stored.

4.2. Migration to the KBN CDM and System Architecture

The overall system architecture of the proposed KBN clinical information management system is shown in Figure 2. Normally, researchers in various institutions collect disease-related medical information needed for research. Using the unique mapping table tool developed in this study, the data collected are integrated into the KBN CDM database (DB). The proposed unique mapping table data model specifications are further described in Section 3.1.

In addition, our proposed system provides statistical visualization screens, some useful search tools, and data extraction functions (in Excel format). To ensure General Data Protection Regulation (GDPR) compliance, administrators can set data processing restrictions on a per-user basis. Therefore, the user can access, modify, or delete data processing restrictions on a per-user basis. Thus, allowing the user to access, modify, or delete data at any time according to the authority granted by the system, based on the data management policy of each institution.

4.3. System Development

The KBN clinical information integrated system was constructed by separating the application and DB servers as shown in Figure 3. The presentation layer is composed of an HTML5-based web program, developed in Spring Framework-based JAVA and SCSS; it is compatible with Internet Explorer, Chrome, and Firefox web browsers. Windows 10 was used as the operating system. The Open Java Development Kit (JDK) v1.8 was used for the compiler, and the Spring Tool Suite was used as an integrated development tool. The development environment was constructed using Spring Boot 2.2.5 and Spring Framework 5.2.4. The repository and presentation layers are connected to the application programming interface, and business logic is processed in the controller and service. A batch was used to generate the statistical data for each screen. The MySQL database management system was used.

4.4. Implementation

Figure 4 shows the actual dashboard screen of the PC application of the proposed system. A variety of information, including specimen registration statuses according to hospital, age, sex, and type of disease, is shown so that the administrator can monitor the overall statistical status. Figure 4a shows an overview of the body of the selected patient. Users can click on each body part to see specific information in the frames in Figure 4c–f, which change according to the body part the user has chosen. Figure 4b shows the overall registration status of the whole database.

As noted above, Figure 4c–f are body part-specific status screens. In particular, Figure 4c shows the registration status of the hospital-specific biobank. Figure 4d shows the registration statistics according to age in intervals of 10 years. Figure 4e depicts the sex statistics, and Figure 4f shows statistics regarding the registration status according to the type of disease.

The information provided in this dashboard differs according to the user authorization level. For example, a KBN manager can not only see all disease or hospital-related data but can also check the upload status of the institutions. In contrast, users who are in charge of each institution’s biobank are only allowed to browse data uploaded from their own institution.

Figure 5 shows the clinical information search screen, where users can search for and extract data by filtering, according to the desired conditions. The first screen of the clinical information search page is shown in Figure 5a. Users can check the data summary for each receipt with the patient’s receipt number, KBN_ID, biobank, gender, sex, disease diagnosis code, receipt date, registrant, and registration date variables. If the user clicks on the screen shown in Figure 5b, a slide menu appears where the data conditions can be selected. In the KBN_ID field, it is possible to search by directly entering the patient ID, and the desired data can be filtered and searched by selecting the presence or absence of each clinical item of data such as biobank, gender, occupation, or disease group. If the search conditions are checked and the arrow (Figure 5f) is clicked, the data that meet the conditions in Figure 5a can be viewed.

When users click on the Excel download button (Figure 5d), all data that meet the criteria are extracted as an Excel file. Since the data have already been anonymized before being uploaded to the system, the anonymized data are extracted when the data are extracted. Each table is divided into separate sheets in one Excel file and extracted. When selecting a data row (Figure 5c), a pop-up window appears, allowing detailed data to be checked for each table. The entire table list is shown in Figure 5h. In Figure 5j, the data values for each table can be checked. At the top of the screen (Figure 5i), basic patient information is displayed on all screens for added convenience. The clinical data page also displays a different range of data depending on the permissions. KBN managers have administrator accounts and can check and extract the data of the entire hospital, but users with hospital manager accounts can only check the data of each biobank.

5. Evaluation

5.1. Results

5.1.1. Participating Hospitals

Collected samples were obtained from the patients of 17 hospitals who agreed to provide personal information in 2019 (n = 65,754). Figure 6 lists the hospitals to which each biobank belongs. Each hospital has one biobank, and each hospital’s biobank is assigned a biobank code. The largest number of patients were registered at Seoul National University Hospital (n = 13,077, 19.89%), followed by Gyeongsang National University Hospital (n = 9522, 14.48%), and Pusan National University Hospital (n = 6575, 10.00%). Jeju National University Hospital has the smallest number of registered patients (n = 78, 0.12%).

5.1.2. Banking Samples

Figure 7 presents the top ten categories of collected samples. In 2019, the total number of specimen registrations was 706,491. Among them, serum was collected the most (229,942, (32.55%), alongside plasma (226,142, 32.01%) and buffy coat (79,507, 11.25%), which were also frequently collected. These three sample types accounted for more than 60% of the total number of specimens.

The basic demographics of the study population are shown in Table 2. Among the total population (n = 65,754), the majority were male (n = 34,531, 52.52%), whereas 31,223 were female (47.48%). The majority of patients (n = 17,352, 26.39%) were 70 years old or older, followed by those aged 60 to 69 years (n = 15,563, 23.67%), and those aged 50 to 59 years (n = 12,010, 18.27%).

The 27 diseases or conditions can be classified into three types: (i) cancer-related diseases (10 diseases), (ii) non-cancer-related diseases (15 diseases), and (iii) healthy individuals and other diseases not included in the first two groups. Except for the other disease group (n = 24,951, 37.95%), ischemic heart disease (n = 5473, 8.32%) was received the most. The least common disease type was head and neck cancer (n = 104, 0.16%). Liver cancer and thyroid cancer-related information was collected by most hospitals (15 hospitals). In contrast, premature birth information was collected by the fewest number of hospitals (two hospitals).

6. Discussion and Conclusions

In the KBN, the active utilization of data is limited because the data collection method, data format, and collection items are different for each bank. The KBN CDM was built by modeling the database to enable standardized data collection, and an integrated management system was developed to utilize the collected data.

Each of the 17 participating hospitals provide databases for different types and numbers of diseases according to the different kinds of data they collect. Conventional biobanks exist on their own. That is, each institution with biobanks has its own data format and individual management system. Naturally, this causes difficulties in the collaborative analysis of biobanks. This research hence proposed KBN: a unique biobank that integrates all the participating institutions with biobanks into one system, enabling integrated analysis, research diversity, and easily converged research.

An important point is that the capacity of the KBN is not restricted to 27 types of disease. If any hospital or institution is willing to participate by adding data for a new type of disease, a new table can be constructed using the system configuration and will be compatible with the rest of the system. Because of this functionality, the KBN CDM can be expanded without limitations.

One limitation of this research is that it is integrated at a domestic level and not internationally. In addition, in the current system, each biobank manages the KBN_ID that identifies the patient, so if a patient is admitted to another hospital, the data are treated as the data of two patients, causing redundancy errors.

Therefore, research to improve work efficiency and data quality by developing an extract, transform, and load process that can automatically extract data from EMRs in KBN CDM format is needed. Future research should also include integration and should link with internationally integrated biobanks such as MIABIS, which is currently feasible because KBN follows international code and mapping standards [23,24,25].

Author Contributions

Conceptualization, S.-J.K. and I.Y.C.; methodology, S.-J.K., W.C. and K.-H.K.; software, S.-J.K., W.C. and K.-H.K.; validation, S.-J.K., W.C. and K.-H.K.; formal analysis, S.-J.K.; investigation, H.M.; resources, H.M.; data curation, S.-J.K. and H.M.; writing—original draft preparation, S.-J.L., S.-J.K. and W.C.; writing—review and editing, H.M., S.-J.L., S.-W.O. and I.Y.C.; visualization, S.-J.K.; supervision, I.Y.C.; project administration, H.M. and I.Y.C.; funding acquisition, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study was conducted with data from Korean Biobank Network. The study was supported and funded by the Korea Disease Control and Prevention Agency (#4845-303).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing was not applicable to this study. Data supporting the findings of this study are available from the National Biobank of Korea.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, X.; Du, L.; Qiao, Y.; Zhang, X.; Zheng, W.; Wu, Q.; Chen, Y.; Zhu, G.; Liu, Y.; Bian, Z.; et al. Ferroptosis is governed by differential regulation of transcription in liver cancer. Redox Biol. 2019, 24, 101211. [Google Scholar] [CrossRef] [PubMed]
Bellos, I.; Pergialiotis, V.; Perrea, D.N. Kidney biopsy findings in vancomycin-induced acute kidney injury: A pooled analysis. Int. Urol. Nephrol. 2021, 1–12. [Google Scholar] [CrossRef] [PubMed]
Mecatti, G.C.; Sánchez-Vinces, S.; Fernandes, A.M.A.P.; Messias, M.C.F.; de Santis, G.K.D.; Porcari, A.M.; Marson, F.A.L.; Carvalho, P.O. Potential lipid signatures for diagnosis and prognosis of sepsis and systemic inflammatory response syndrome. Metabolites 2020, 10, 359. [Google Scholar] [CrossRef]
Yu, Y.; Ruddy, K.J.; Hong, N.; Tsuji, S.; Wen, A.; Shah, N.D.; Jiang, G. ADEpedia-On-OHDSI: A next generation pharmacovigilance signal detection platform using the OHDSI common data model. J. Biomed. Inform. 2019, 91, 103119. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Wang, L.; Miao, S.; Xu, H.; Yin, Y.; Zhu, Y.; Dai, Z.; Shan, T.; Jing, S.; Wang, J.; et al. Analysis of treatment pathways for three chronic diseases using OMOP CDM. J. Med. Syst. 2018, 42, 260. [Google Scholar] [CrossRef] [Green Version]
Choi, S.A.; Kim, H.; Kim, S.; Yoo, S.; Yi, S.; Jeon, Y.; Hwang, H.; Kim, K.J. Analysis of antiseizure drug-related adverse reactions from the electronic health record using the common data model. Epilepsia 2020, 61, 610–616. [Google Scholar] [CrossRef]
Steven, V.V.A.; Singer, M.; Fink, M.Y. A call to standardize preanalytic data elements for biospecimens. Physiol. Behav. 2019, 176, 139–148. [Google Scholar] [CrossRef]
Grizzle, W.E.; Bledsoe, M.J.; Al Diffalha, S.; Otali, D.; Sexton, K.C. The utilization of biospecimens: Impact of the choice of biobanking model. Biopreserv. Biobank. 2019, 17, 230–242. [Google Scholar] [CrossRef] [PubMed] [Green Version]
An, A.R.; Kim, K.M.; Park, H.S.; Jang, K.Y.; Moon, W.S.; Kang, M.J.; Lee, Y.C.; Kim, J.H.; Chae, H.J.; Chung, M.J. Association between expression of 8-OHdG and cigarette smoking in non-small cell lung cancer. J. Pathol. Transl. Med. 2019, 53, 217–224. [Google Scholar] [CrossRef]
Byun, J.K.; Choi, Y.K.; Kang, Y.N.; Jang, B.K.; Kang, K.J.; Jeon, Y.H.; Lee, H.W.; Jeon, J.H.; Koo, S.H.; Jeong, W.I.; et al. Retinoic acid-related orphan receptor alpha reprograms glucose metabolism in glutamine-deficient hepatoma cells. Hepatology 2015, 61, 953–964. [Google Scholar] [CrossRef] [Green Version]
Yun, D.; Jang, M.J.; An, J.N.; Lee, J.P.; Kim, D.K.; Chin, H.J.; Kim, Y.S.; Lee, D.S.; Han, S.S. Effect of steroids and relevant cytokine analysis in acute tubulointerstitial nephritis. BMC Nephrol. 2019, 20, 88. [Google Scholar] [CrossRef] [PubMed]
Park, J.; Shin, Y.; Kim, T.H.; Kim, D.H.; Lee, A. Plasma metabolites as possible biomarkers for diagnosis of breast cancer. PLoS ONE 2019, 14, e0225129. [Google Scholar] [CrossRef]
Cho, S.Y.; Hong, E.J.; Nam, J.M.; Han, B.; Chu, C.; Park, O. Opening of the national biobank of korea as the infrastructure of future biomedical science in Korea. Osong Public Health Res. Perspect. 2012, 3, 177–184. [Google Scholar] [CrossRef] [Green Version]
Park, O.; Cho, S.Y.; Shin, S.Y.; Park, J.S.; Kim, J.W.; Han, B.G. A strategic plan for the second phase (2013–2015) of the Korea biobank project. Osong Public Health Res. Perspect. 2013, 4, 107–116. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fitzhenry, F.; Resnic, F.S.; Robbins, S.L.; Denton, J.; Nookala, L.; Meeker, D.; Ohno-Machado, L.; Matheny, M.E. Creating a common data model for comparative effectiveness with the observational medical outcomes partnership. Appl. Clin. Inform. 2015, 6, 536–547. [Google Scholar] [CrossRef] [Green Version]
Cambon-Thomsen, A.; Rial-Sebbag, E.; Knoppers, B.M. Trends in ethical and legal frameworks for the use of human biobanks. Eur. Respir. J. 2007, 30, 373–382. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Norlin, L.; Fransson, M.N.; Eriksson, M.; Merino-Martinez, R.; Anderberg, M.; Kurtovic, S.; Litton, J.E. A minimum data set for sharing biobank samples, information, and data: MIABIS. Biopreserv. Biobank. 2012, 10, 343–348. [Google Scholar] [CrossRef] [Green Version]
Merino-Martinez, R.; Norlin, L.; van Enckevort, D.; Anton, G.; Schuffenhauer, S.; Silander, K.; Mook, L.; Holub, P.; Bild, R.; Swertz, M.; et al. Toward global biobank integration by implementation of the minimum information about biobank data sharing (MIABIS 2.0 Core). Biopreserv. Biobank. 2016, 14, 298–306. [Google Scholar] [CrossRef]
Isabelle, M.; Teodorovic, I.; Morente, M.M.; Jaminé, D.; Passioukov, A.; Lejeune, S.; Therasse, P.; Dinjens, W.N.; Oosterhuis, J.W.; Lam, K.H.; et al. TuBaFrost 5: Multifunctional central database application for a European tumor bank. Eur. J. Cancer 2006, 42, 3103–3109. [Google Scholar] [CrossRef]
David, G.A.; Matthew, P.A.; Olaf, A.; Steven, C.; Carolyn, H.; Patrick, R.H.; Melissa, M.; Ikue, N.; Jane, P.; Cynthia, S.; et al. Chapter 3—Autism BrainNet: A network of postmortem brain banks established to facilitate autism research. In Handbook of Clinical Neurology; Huitinga, L., Webster, M.J., Eds.; Elsevier: Amsterdam, The Netherlands, 2018; Volume 150, pp. 31–39. [Google Scholar]
Patil, S.; Majumdar, B.; Awan, K.H.; Sarode, G.S.; Sarode, S.C.; Gadbail, A.R.; Gondivkar, S. Cancer oriented biobanks: A comprehensive review. Oncol. Rev. 2018, 12, 357. [Google Scholar] [CrossRef]
Yang, Y.; Liu, Y.M.; Wei, M.Y.; Wu, Y.F.; Gao, J.H.; Liu, L.; Zhou, W.P.; Wang, H.Y.; Wu, M.C. The liver tissue bank and clinical database in China. Front. Med. China 2010, 4, 443–447. [Google Scholar] [CrossRef] [PubMed]
Trouillon, T.; Dance, C.R.; Gaussier, E.; Welbl, J.; Riedel, S.; Bouchard, G. Knowledge graph completion via complex tensor factorization. J. Mach. Learn. Res. 2017, 18, 1–38. [Google Scholar]
Xia, S.; Zhang, Z.; Li, W.; Wang, G.; Giem, E.; Chen, Z. GBNRS: A novel rough set algorithm for fast adaptive attribute reduction in classification. IEEE Trans. Knowl. Data Eng. 2020, 1, 1. [Google Scholar] [CrossRef]
Xia, S.; Zheng, Y.; Wang, G.; He, P.; Li, H.; Chen, Z. Random space division sampling for label-noisy classification or imbalanced classification. IEEE Trans. Cybern. 2021, 51, 1–14. [Google Scholar] [CrossRef] [PubMed]

Figure 1. KBN clinical CDM unified modeling language.

Figure 2. KBN clinical information management system.

Figure 3. Technical specifications.

Figure 4. KBN Dashboard screenshot: (a) body layout, (b) registration status, (c) registration status with respect to each hospital-specific biobank, (d) registration status by age, (e) registration status by sex, and (f) registration status by type of disease.

Figure 5. Clinical information search screenshot (a) first page, (b) search button, (c) row data, (d) Excel download, (e) sliding menu for filtering, (f) apply button, (g) clinical data viewer, (h) table list, (i) basic patient information, and (j) clinical data.

Figure 6. Registered participants by hospital. SU: Seoul National University Hospital, KS: Gyeongsang National University Hospital, BS: Pusan National University Hospital, KM: Keimyung University Dongsan Medical Center, CB: Chungbuk National University Hospital, CN: Chungnam National University Hospital, JN: Chonnam National University Hwasun Hospital, IJ: Inje University Paik Hospital, AS: Asan Medical Center, JB: Jeonbuk National University Hospital, KB: Kyungpook National University Hospital, AJ: Ajou University Hospital, KW: Kangwon National University Hospital, SC: Soonchunhyang University Bucheon Hospital, KR: Korea University Guro Hospital, WK: Wonkwang University Hospital, JJ: Jeju National University Hospital.

Figure 7. Specimen count.

Table 1. KBN clinical common data model table.

Table	Columns	Description
BASICINFO	6	Donor ID, bank name, sex, birth, job
REGISTRATION	9	Age at registration, disease date, disease diagnosis
SPECIMEN	7	Specimen collection date, specimen type, number of specimens
STAGE	8	AJCC_T, AJCC_N, AJCC_M, AJCC_stage, AJCC pre-op, AJCC post-op
BDMEASURE	19	Height, weight, BMI, blood pressure, pregnancy status, weight loss, ECOG, waist circumference, pulse rate, body temperature, respiration rate, head circumference, UBV_Ar (umbilical cord blood arterial), UBV_Ve (umbilical cord blood vessels)
DRINSMOK	16	Drinking/smoking experience, duration, quit duration, frequency, amount
HISTORY	10	History disease, drug history
FAMHISTORY	7	Family history of the same disease, family relations, family history of allergic reactions
CLINEXAM	10	Laboratory examination name, unit, result of numeric type, result of character type, date
NOTE	10	Pathology record, surgical record, radiology report, functional test record, record date, record free text
EXAM	8	Examination name, results, date
OPERATION_DETAIL	6	Operation record detail, operation detail results, operation date
PATHOLOGY	6	Pathology record detail, pathology detail results, pathology date
TREATMENT	7	Treatment name, treatment detail, treatment date
SYMPTOM	6	Symptom name, symptom results, symptom onset date
SURVEY	8	Survey name, survey date, survey result
FOLLOWUP	20	Renal replacement therapy, recurrence, metastasis, death, cause of death
VOCA	5	KBN name, KBN code, standard code, standard name, reference

Table 2. Basic demographics.

Variable	Category	Total N (%)
Sex	Male	34,531 (52.52%)
	Female	31,223 (47.48%)
Age	Less than 10	2953 (4.49%)
	10 to 19	1603 (2.44%)
	20 to 29	2976 (4.53%)
	30 to 39	5866 (8.92%)
	40 to 49	7411 (11.27%)
	50 to 59	12,010 (18.27%)
	60 to 69	15,563 (23.67%)
	70 years or older	17,352 (26.39%)
	Unknown	20 (0.03%)
Disease	Ischemic heart disease	5473 (8.32%)
	Stomach cancer	3223 (4.90%)
	Blood cancer	3155 (4.80%)
	Lung cancer	3061 (4.66%)
	Colon cancer	2611 (3.97%)
	Breast cancer	2511 (3.82%)
	Pregnancy, childbirth, postpartum care	2389 (3.63%)
	Thyroid cancer	2220 (3.38%)
	Liver cancer	2129 (3.24%)
	Healthy individuals	1758 (2.67%)
	Prostate cancer	1548 (2.35%)
	Glomerular disease/renal failure	1541 (2.34%)
	Musculoskeletal disorders	1538 (2.34%)
	Diabetes	1165 (1.77%)
	Kidney transplant status	1096 (1.67%)
	Hepatitis	907 (1.38%)
	Kidney cancer	901 (1.37%)
	In situ breast carcinoma, benign breast tumor, breast disorder	648 (0.99%)
	Prostatic hyperplasia	570 (0.87%)
	Cirrhosis of the liver	539 (0.82%)
	Inflammatory growth disease	528 (0.80%)
	Asthma	455 (0.69%)
	Cerebral infarction	300 (0.46%)
	Chronic obstructive pulmonary disease	272 (0.41%)
	Premature birth	161 (0.24%)
	Head and neck cancer	104 (0.16%)
	Other diseases *	24,951 (37.95%)
Total		65,754

* Top-5 diseases in other diseases: bladder cancer (n = 1042), other cerebrovascular diseases (n = 620), ovarian cancer (n = 540), pancreatic cancer (n = 494), and benign cancer of other and ill-defined parts of the digestive system (n = 487).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ko, S.-J.; Choi, W.; Kim, K.-H.; Lee, S.-J.; Min, H.; Oh, S.-W.; Choi, I.Y. Common Data Model and Database System Development for the Korea Biobank Network. Appl. Sci. 2021, 11, 11825. https://doi.org/10.3390/app112411825

AMA Style

Ko S-J, Choi W, Kim K-H, Lee S-J, Min H, Oh S-W, Choi IY. Common Data Model and Database System Development for the Korea Biobank Network. Applied Sciences. 2021; 11(24):11825. https://doi.org/10.3390/app112411825

Chicago/Turabian Style

Ko, Soo-Jeong, Wona Choi, Ki-Hoon Kim, Seo-Joon Lee, Haesook Min, Seol-Whan Oh, and In Young Choi. 2021. "Common Data Model and Database System Development for the Korea Biobank Network" Applied Sciences 11, no. 24: 11825. https://doi.org/10.3390/app112411825

APA Style

Ko, S.-J., Choi, W., Kim, K.-H., Lee, S.-J., Min, H., Oh, S.-W., & Choi, I. Y. (2021). Common Data Model and Database System Development for the Korea Biobank Network. Applied Sciences, 11(24), 11825. https://doi.org/10.3390/app112411825

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Common Data Model and Database System Development for the Korea Biobank Network

Abstract

1. Introduction

2. Motivation

3. Related Work

3.1. Related Concepts

3.2. Related Research

4. Proposed System

4.1. KBN-CDM Development

4.2. Migration to the KBN CDM and System Architecture

4.3. System Development

4.4. Implementation

5. Evaluation

5.1. Results

5.1.1. Participating Hospitals

5.1.2. Banking Samples

6. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI