The Korea Cancer Big Data Platform (K-CBP) for Cancer Research
Abstract
1. Introduction
2. Materials and Methods
2.1. Data Sources
2.2. De-Identification of Records
2.3. Creation of Clinical Datasets and Cancer Registries
2.4. Data Validation and Monitoring
2.5. Data Merging
2.6. Construction of the K-CBP
2.7. Legal and Regulatory Processes
3. Results
3.1. Collection of Data
3.2. Construction of Clinical Datasets and Clinical Cancer Registries
3.3. Data Merging
3.4. Construction of the K-CBP
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Conflicts of Interest
References
- Dunnenberger, H.M.; Crews, K.R.; Hoffman, J.M.; Caudle, K.E.; Broeckel, U.; Howard, S.C.; Hunkler, R.J.; Klein, T.E.; Evans, W.E.; Relling, M.V. Preemptive clinical pharmacogenetics implementation: Current programs in five US medical centers. Annu. Rev. Pharmacol. Toxicol. 2015, 55, 89–106. [Google Scholar] [CrossRef] [PubMed]
- Collins, F.S.; Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 2015, 372, 793–795. [Google Scholar] [CrossRef] [PubMed]
- Wheeler, S.B.; Basch, E. Translating Cancer Surveillance Data Into Effective Public Health Interventions. JAMA 2017, 317, 365–367. [Google Scholar] [CrossRef] [PubMed]
- Van Eijkeren, M.A.; De Graaff, J.; Van Etten, F.H. Familial ovarian cancer. Eur. J. Obstet. Gynecol. Reprod. Biol. 1992, 47, 263–266. [Google Scholar] [CrossRef]
- Marshall, F.F.; Stewart, A.K.; Menck, H.R. The National Cancer Data Base: Report on kidney cancers. The American College of Surgeons Commission on Cancer and the American Cancer Society. Cancer 1997, 80, 2167–2174. [Google Scholar] [CrossRef]
- Donohue, J.H.; Stewart, A.K.; Menck, H.R. The National Cancer Data Base report on carcinoma of the gallbladder, 1989–1995. Cancer 1998, 83, 2618–2628. [Google Scholar] [CrossRef]
- Cance, W.G.; Stewart, A.K.; Menck, H.R. The National Cancer Data Base Report on treatment patterns for hepatocellular carcinomas: Improved survival of surgically resected patients, 1985–1996. Cancer 2000, 88, 912–920. [Google Scholar] [CrossRef]
- Akaza, H.; Usami, M.; Hinotsu, S.; Ogawa, O.; Kagawa, S.; Kitamura, T.; Tsukamoto, T.; Naito, S.; Hirao, Y.; Murai, M.; et al. Characteristics of patients with prostate cancer who have initially been treated by hormone therapy in Japan: J-CaP surveillance. Jpn. J. Clin. Oncol. 2004, 34, 329–336. [Google Scholar] [CrossRef]
- Warren, J.L.; Klabunde, C.N.; Schrag, D.; Bach, P.B.; Riley, G.F. Overview of the SEER-Medicare data: Content, research applications, and generalizability to the United States elderly population. Med. Care 2002, 40, IV-3–IV-18. [Google Scholar] [CrossRef]
- Ambs, A.; Warren, J.L.; Bellizzi, K.M.; Topor, M.; Haffer, S.C.; Clauser, S.B. Overview of the SEER—Medicare Health Outcomes Survey linked dataset. Health Care Financ. Rev. 2008, 29, 5–21. [Google Scholar]
- Savova, G.K.; Masanz, J.J.; Ogren, P.V.; Zheng, J.; Sohn, S.; Kipper-Schuler, K.C.; Chute, C.G. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, component evaluation and applications. J. Am. Med. Inf. Assoc. 2010, 17, 507–513. [Google Scholar] [CrossRef] [PubMed]
- Ho, W.H.; Lee, K.T.; Chen, H.Y.; Ho, T.W.; Chiu, H.C. Disease-free survival after hepatic resection in hepatocellular carcinoma patients: A prediction approach using artificial neural network. PLoS ONE 2012, 7, e29179. [Google Scholar] [CrossRef] [PubMed]
- Dunbar, J.K.; Dillon, J.; Garden, O.J.; Brewster, D.H. Increasing survival of hepatocellular carcinoma patients in Scotland: A review of national cancer registry data. HPB (Oxford) 2013, 15, 279–285. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Choi, I.Y.; Park, S.; Park, B.; Chung, B.H.; Kim, C.S.; Lee, H.M.; Byun, S.S.; Lee, J.Y. Development of prostate cancer research database with the clinical data warehouse technology for direct linkage with electronic medical record system. Prostate Int. 2013, 1, 59–64. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Lee, D.H.; Lee, S.H.; Rha, K.H.; Choi, I.Y.; Lee, J.Y.; Kim, S.W.; Lee, S.; Hong, S.K.; Byun, S.S.; Jeong, I.G.; et al. The Establishment of K-CaP (the Multicenter Korean Prostate Cancer Database). Korean J. Urol. 2013, 54, 229–233. [Google Scholar] [CrossRef] [PubMed]
- Garcelon, N.; Neuraz, A.; Salomon, R.; Faour, H.; Benoit, V.; Delapalme, A.; Munnich, A.; Burgun, A.; Rance, B. A clinician friendly data warehouse oriented toward narrative reports: Dr. Warehouse. J. Biomed. Inform. 2018, 80, 52–63. [Google Scholar] [CrossRef] [PubMed]
- Shats, O.; Goldner, W.; Feng, J.; Sherman, A.; Smith, R.B.; Sherman, S. Thyroid Cancer and Tumor Collaborative Registry (TCCR). Cancer Inform. 2016, 15, 73–79. [Google Scholar] [CrossRef] [PubMed]
- Park, Y.H.; Kim, Y.; Yu, H.; Choi, I.Y.; Byun, S.S.; Kwak, C.; Chung, B.H.; Lee, H.M.; Kim, C.S.; Lee, J.Y. Is lymphovascular invasion a powerful predictor for biochemical recurrence in pT3 N0 prostate cancer? Results from the K-CaP database. Sci. Rep. 2016, 6, 25419. [Google Scholar] [CrossRef][Green Version]
- Brusselaers, N.; Vall, A.; Mattsson, F.; Lagergren, J. Tumour staging of oesophageal cancer in the Swedish Cancer Registry: A nationwide validation study. Acta Oncol. 2015, 54, 903–908. [Google Scholar] [CrossRef]
- Jernigan, P.L.; Wima, K.; Hanseman, D.J.; Hoehn, R.S.; Ahmad, S.A.; Shah, S.A.; Abbott, D.E. Natural history and treatment trends in hepatocellular carcinoma subtypes: Insights from a national cancer registry. J. Surg. Oncol. 2015, 112, 872–876. [Google Scholar] [CrossRef]
- Kim, Z.; Min, S.Y.; Yoon, C.S.; Jung, K.W.; Ko, B.S.; Kang, E.; Nam, S.J.; Lee, S.; Hur, M.H.; Korean Breast Cancer, S. The Basic Facts of Korean Breast Cancer in 2012: Results from a Nationwide Survey and Breast Cancer Registry Database. J. Breast Cancer 2015, 18, 103–111. [Google Scholar] [CrossRef] [PubMed]
- Teufel, A. Bioinformatics and database resources in hepatology. J. Hepatol. 2015, 62, 712–719. [Google Scholar] [CrossRef] [PubMed]
- Agarwal, R.; Kumar, B.; Jayadev, M.; Raghav, D.; Singh, A. CoReCG: A comprehensive database of genes associated with colon-rectal cancer. Database (Oxford) 2016, 2016. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Byun, S.S.; Hong, S.K.; Lee, S.; Kook, H.R.; Lee, E.; Kim, H.H.; Kwak, C.; Ku, J.H.; Jeong, C.W.; Lee, J.Y.; et al. The establishment of KORCC (KOrean Renal Cell Carcinoma) database. Investig. Clin. Urol. 2016, 57, 50–57. [Google Scholar] [CrossRef] [PubMed]
- Post, A.R.; Ai, M.; Kalsanka Pai, A.; Overcash, M.; Stephens, D.S. Architecting the Data Loading Process for an i2b2 Research Data Warehouse: Full Reload versus Incremental Updating. AMIA Annu. Symp. Proc. 2017, 2017, 1411–1420. [Google Scholar] [PubMed]
- Shah, N.D.; Steyerberg, E.W.; Kent, D.M. Big Data and Predictive Analytics: Recalibrating Expectations. JAMA 2018, 320, 27–28. [Google Scholar] [CrossRef]
- Hinkson, I.V.; Davidsen, T.M.; Klemm, J.D.; Kerlavage, A.R.; Kibbe, W.A. A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine. Front. Cell Dev. Biol. 2017, 5, 83. [Google Scholar] [CrossRef]
- Lee, L.; Wang, K.; Li, G.; Xie, Z.; Wang, Y.; Xu, J.; Sun, S.; Pocalyko, D.; Bhak, J.; Kim, C.; et al. Liverome: A curated database of liver cancer-related gene signatures with self-contained context information. BMC Genom. 2011, 12, S3. [Google Scholar] [CrossRef]
- Garcelon, N.; Neuraz, A.; Salomon, R.; Bahi-Buisson, N.; Amiel, J.; Picard, C.; Mahlaoui, N.; Benoit, V.; Burgun, A.; Rance, B. Next generation phenotyping using narrative reports in a rare disease clinical data warehouse. Orphanet J. Rare Dis. 2018, 13, 85. [Google Scholar] [CrossRef]
- Jung, K.Y.; Kim, T.; Jung, J.; Lee, J.; Choi, J.S.; Mira, K.; Chang, D.K.; Cha, W.C. The Effectiveness of Near-Field Communication Integrated with a Mobile Electronic Medical Record System: Emergency Department Simulation Study. JMIR Mhealth Uhealth 2018, 6, e11187. [Google Scholar] [CrossRef]
- Lee, J.; Jung, J.; Park, P.; Chung, S.; Cha, H. Design of a human-centric de-identification framework for utilizing various clinical research data. Hum.-Centric Comput. Inf. Sci. 2018, 8, 19. [Google Scholar] [CrossRef]






| Category | Subcategory | Number of Subjects | Description of Features | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of patients | 515,780 | ||||||||||
| Medical records | Pathology records | 925,599 | Specimen type, method of examination, clinical and pathological diagnosis | ||||||||
| Order sheet | 11,703,931 | Orders related to treatment and discharge | |||||||||
| Tumor bank | Blood sample | 32,760 | Information on pathologic stages; normal and tumor tissues, sample status, and location information | ||||||||
| Tissue sample | 17,813 | ||||||||||
| Genomics | NGS test | 280 | Mutations detected on panel-based NGS tests | ||||||||
| Clinical cancer registries at the NCC Hospital | Cancer – core features | 144,944 | Patient information, diagnosis, images, normal and tumor tissues, surgery, pathological and clinical stages, chemotherapy, radiation therapy, recurrence and metastasis, death | ||||||||
| Prostate cancer | 5167 | ||||||||||
| Lung cancer | 24,504 | ||||||||||
| Pancreatobiliary cancer | 9966 | ||||||||||
| Kidney cancer | 2886 | ||||||||||
| Ovarian cancer | 4240 | ||||||||||
| Colorectal cancer | 17,328 | ||||||||||
| Liver cancer | 12,932 | ||||||||||
| Breast cancer | 18,287 | ||||||||||
| Gastric cancer | 15,056 | ||||||||||
| Thyroid cancer | 10,404 | ||||||||||
| Cancer statistical registry data from the National Cancer Control initiatives | Korea Central Cancer Registry | 2,745,050 | Nationwide data on the diagnosis and treatment of cancer and survival of patients | ||||||||
| National Cancer Screening Program | 90,197,402 | Data obtained from nationwide screening for stomach, liver, colorectal, breast, and uterine cervix cancers | |||||||||
| Financial Aid Program for Cancer Patients | 543,325 | Data relevant to financial aid for low-income cancer patients | |||||||||
| Hospice and Palliative Care | 56,433 | Information on performance status (ECOG), admission to and discharge from hospice institutions, and the use of hospice care | |||||||||
| Clinical cancer registries from external sources | Prostate cancer | 7934 | Complications, surgery | ||||||||
| Lung cancer | 3496 | Results of biopsy, gene mutation, surgery | |||||||||
| Pancreatic cancer | 538 | Tumor, physical examination findings, surgery | |||||||||
| Cancer Type | Data Type | Total | ||
|---|---|---|---|---|
| Structured | Unstructured | Manual Input | ||
| Prostate cancer | 165 | 66 | 13 | 244 | 
| Lung cancer | 146 | 85 | 4 | 235 | 
| Pancreatobiliary cancer | 319 | 34 | 54 | 407 | 
| Kidney cancer | 369 | 70 | 41 | 480 | 
| Ovarian cancer | 428 | 59 | 32 | 519 | 
| Colorectal cancer | 230 | 51 | 84 | 365 | 
| Liver cancer | 216 | 84 | 50 | 350 | 
| Breast cancer | 228 | 85 | 27 | 340 | 
| Gastric cancer | 175 | 141 | 14 | 330 | 
| Thyroid cancer | 156 | 244 | 4 | 404 | 
| Term | Definition | 
|---|---|
| Alternative patient key | A primary key that replaces a direct identifier with a random 8-digit number | 
| De-identification | Elimination of direct identifiers and quasi-identifiers so that individuals cannot be identified | 
| Clinical cancer registry | Outcome data such as diagnosis, treatment, and surgery that are selected among cancer clinical data; dataset refined in a form that can be used meaningfully | 
| National cancer control initiative data | Cancer-related data collected under a nationally led project | 
| External clinical cancer registry | Cancer-related clinical data from multiple institutions, including diagnosis, treatment, or surgery; dataset selected and refined for outcome data | 
| Structured data | Data that can be represented by a specific number or word and whose format is standardized | 
| Unstructured data | Free-text format data | 
| Manual input data | Data that cannot be automatically imported through computerization | 
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cha, H.S.; Jung, J.M.; Shin, S.Y.; Jang, Y.M.; Park, P.; Lee, J.W.; Chung, S.H.; Choi, K.S. The Korea Cancer Big Data Platform (K-CBP) for Cancer Research. Int. J. Environ. Res. Public Health 2019, 16, 2290. https://doi.org/10.3390/ijerph16132290
Cha HS, Jung JM, Shin SY, Jang YM, Park P, Lee JW, Chung SH, Choi KS. The Korea Cancer Big Data Platform (K-CBP) for Cancer Research. International Journal of Environmental Research and Public Health. 2019; 16(13):2290. https://doi.org/10.3390/ijerph16132290
Chicago/Turabian StyleCha, Hyo Soung, Jip Min Jung, Seob Yoon Shin, Young Mi Jang, Phillip Park, Jae Wook Lee, Seung Hyun Chung, and Kui Son Choi. 2019. "The Korea Cancer Big Data Platform (K-CBP) for Cancer Research" International Journal of Environmental Research and Public Health 16, no. 13: 2290. https://doi.org/10.3390/ijerph16132290
APA StyleCha, H. S., Jung, J. M., Shin, S. Y., Jang, Y. M., Park, P., Lee, J. W., Chung, S. H., & Choi, K. S. (2019). The Korea Cancer Big Data Platform (K-CBP) for Cancer Research. International Journal of Environmental Research and Public Health, 16(13), 2290. https://doi.org/10.3390/ijerph16132290
 
        


 
       