Medical Application of Big Data: Between Systematic Review and Randomized Controlled Trials

Featured Application: Systematic review and meta-analysis deserves to be considered the gold standard of evidence-based medicine in the era of big data. Abstract: In terms of medical health, we are currently living in the era of data science, which has brought tremendous change. Big data related to healthcare includes medical data, genome data


Introduction
Most researchers aim to conduct RCTs (randomized controlled trial) in their fields of interest. Of course, RCTs represent the core of research, and they are the type of study that can best reveal causal relationships in disease [1]. However, RCTs unfortunately require a lot of research funds, and even if such funds are available, it is difficult to proceed if the corresponding drug and placebo are not provided. These conditions make it very difficult for young and less-established researchers to conduct RCTs. One way to obtain research funds to conduct an RCT is to conduct a pilot RCT that obtains highly promising results. RCTs are the highest level of EBM (evidence-based medicine), while systematic reviews (SRs) can also provide the most solid evidence when RCTs are used as targets [2,3]. The problem is that researchers have to accept the reality that they cannot all conduct the RCTs that they always wish to conduct. There are also certain research topics that cannot be explored using RCTs. An alternative research method for such cases is an observational study, and by designing and analyzing a retrospective cohort using the big data that we want to talk about today, i.e., the data of the Insurance Corporation, high evidence can be achieved, although it is still not equivalent to RCTs. When one practices medicine, it is recommended that they follow the treatment guidelines, which are established based on evidence. Therefore, a retrospective study using big data is a very important starting point for research topics for which RCTs cannot be conducted or for when a new RCT is 2 of 10 being devised. The present review includes definitions and types of big data in the medical health area, the current application status of big data in the medical health area, features of the National Health Claimed Data for medical research, real applications of the Korean National Health Claimed Data for medical research, and EBM using SR (systematic review).

Definition and Types of Big Data in Medical Health Area
Big data can be explained in two ways [4]: first, for technology itself, it means analyzing skills to cover huge data in ways that would previously have been impossible with classic analytic tools; second, for huge data in itself. Big data can be categorized into classical (structured) and non-classical (unstructured) data. Non-classical data without characteristics means non-numerical data including pictures, sound, words, etc. The main purposes of using big data are to make policies for public health, medical institutions, business, and research. However, big data also involves the risk of it being used in ways other than its intended purpose.
Healthcare big data includes medical data, genomic data, and life-log data generated by humans during their lives, and it is scattered along a very wide range and on a huge scale (Figure 1), so it is essential to build an integrated platform that considers precise measurement, transmission, storage, and security methods for the proper use of such data [5,6]. Healthcare complexity arises due to the range of health conditions and their co-morbidities, varied treatments and outcomes, and intricate study designs, analytical methods, and data interpretation approaches in healthcare data management [7]. As a result, the role of domain knowledge can be dominant in both data analysis and interpretation of results [8]. There have been many researchers' definitions of medical big data, and some have categorized medical big data based on who owns the data compared to traditional clinical data [9]. Basically, medical big data is often difficult to access, and most researchers in the medical field are hesitant to share their data due to the risk of data misuse by the other parties [10]. In addition, medical big data is relatively structured because it adheres to protocols regarding the collection of an individual's medical information [11].
Appl. Sci. 2023, 13, x FOR PEER REVIEW 2 of 11 for research topics for which RCTs cannot be conducted or for when a new RCT is being devised. The present review includes definitions and types of big data in the medical health area, the current application status of big data in the medical health area, features of the National Health Claimed Data for medical research, real applications of the Korean National Health Claimed Data for medical research, and EBM using SR (systematic review).

Definition and Types of Big Data in Medical Health Area
Big data can be explained in two ways [4]: first, for technology itself, it means analyzing skills to cover huge data in ways that would previously have been impossible with classic analytic tools; second, for huge data in itself. Big data can be categorized into classical (structured) and non-classical (unstructured) data. Non-classical data without characteristics means non-numerical data including pictures, sound, words, etc. The main purposes of using big data are to make policies for public health, medical institutions, business, and research. However, big data also involves the risk of it being used in ways other than its intended purpose.
Healthcare big data includes medical data, genomic data, and life-log data generated by humans during their lives, and it is scattered along a very wide range and on a huge scale (Figure 1), so it is essential to build an integrated platform that considers precise measurement, transmission, storage, and security methods for the proper use of such data [5,6]. Healthcare complexity arises due to the range of health conditions and their co-morbidities, varied treatments and outcomes, and intricate study designs, analytical methods, and data interpretation approaches in healthcare data management [7]. As a result, the role of domain knowledge can be dominant in both data analysis and interpretation of results [8]. There have been many researchers' definitions of medical big data, and some have categorized medical big data based on who owns the data compared to traditional clinical data [9]. Basically, medical big data is often difficult to access, and most researchers in the medical field are hesitant to share their data due to the risk of data misuse by the other parties [10]. In addition, medical big data is relatively structured because it adheres to protocols regarding the collection of an individual's medical information [11]. Lee and Yoon [12] well-summarized medical big data versus traditional classical statistical analysis. The main difference between the two data types is that traditional classical statistical analysis focuses on hypothesis testing, while medical big data analysis focuses on hypothesis generating. In addition, the research question is characterized by the fact that traditional classical statistical analysis is conducted to interpret the causal relationship, while medical big data analysis focuses on the correlation between variables or identification of specific patterns [12]. In fact, the power of big data lies in identifying Lee and Yoon [12] well-summarized medical big data versus traditional classical statistical analysis. The main difference between the two data types is that traditional classical statistical analysis focuses on hypothesis testing, while medical big data analysis focuses on hypothesis generating. In addition, the research question is characterized by the fact that traditional classical statistical analysis is conducted to interpret the causal relationship, while medical big data analysis focuses on the correlation between variables or identification of specific patterns [12]. In fact, the power of big data lies in identifying correlations, not necessarily in establishing the significance or meaning of these correlations [13].
The potential value of medical big data has been demonstrated in: (1) predictive modelling for risk and resource use; (2) population management; (3) drug and medical device safety surveillance; (4) disease and treatment heterogeneity; (5) precision medicine and clinical decision support; (6) quality of care and performance measurement; (7) public health; and (8) research applications [14]. It is expected that the analysis of healthcare big data using artificial intelligence will facilitate the identification of specific patterns of diseases that we want to know about as well as the prevention, management, and treatment of diseases.

Current Application Status of Big Data in Medical Health Area
Big data has come to be increasingly recognized for its potential benefits in public health. In 2011, the UN (United Nations) declared the issues of 'NCD Crisis' and 'GOAL 25 by 25' [15]. NCD (Noncommunicable disease) refers to chronic diseases including cancer, cardiovascular disease, diabetes, dementia, and so on. These NCDs are increasingly being observed in the public health area compared to CD (communicable disease) due to the high prevalence and mortality of NCD, which is increasing with sharp speed even in proportion to the status of the aging society. The UN has declared that all efforts must be made to reduce health inequality in NCDs. Many research groups have now started conducting studies using big data to investigate the global burden of NCDs and inequality in NCDs. Moreover, the terminology 'Health care crisis' has appeared in recent days [16]. Korea's average growth rate in individuals' medical expenses is high among OECD (Organization for Economic Co-operation and Development) countries [17]. A similar concept as that which was investigated with evidence-based medicine is now being combined with research using big data.
Aside from its important role in public health, big data is also being widely used in the era of the health care industry and the focus on profitmaking. Medical industrialization is evolving because the delivery of medical care in an analogue style is expeditiously changing into that in a digital style. Medical industrialization using big data has launched a new era including the development of diagnostic strategies or the development of specific target agents.
Among the medical big data, the most recent one we encountered is the World Health Organization's (WHO) medical big data. Through the recent COVID-19 pandemic, the WHO's huge big data platform has been providing all of the information regarding the current status of the COVID-19 outbreak and the death rate in each country in real time. The WHO's World Health Data Hub is a comprehensive digital platform for global health data. It provides end-to-end solutions to collect, store, analyze, and share [18]. Not only COVID-19, but also various disease status including NCD were reported annually by WHO.
With the utilization of big data, the most crucial issue is personal de-identification, which is closely related to ethical concerns. In the US, the health insurance portability and accountability act was enacted as a law in 1996 [19]. This mandates that personal information must be converted by personal de-identification, which is mainly performed by health care clearinghouses. To manage transparent big data, it is necessary to manage the government-oriented system.

Korean National Health Claimed Data for Medical Research
Korea has a National Health Insurance Database system. This data system allows for full survey of the entire population of Korea. There are two main National Health Insurance Databases: National Health Insurance Service (NHIS) and Health Insurance Review and Assessment Service (HIRA) (Figure 2). The main merits of these databases include their large statistical power that allow for even small statistical differences to be found. They also have low levels of statistical errors and high levels of reproducibility. However, they are claimed data, which means that they are not originally created for research, so they require advanced processing skills. The biggest advantage of Korea's medical big data is Appl. Sci. 2023, 13, 9260 4 of 10 that it accommodates 99% of national data in the case of insurance claim data. In addition to individually accessing, applying for, and using data from nine institutions in the health and medical field, we are carrying out a project to open the data of nine institutions in the health and medical field to researchers so that they can be combined on an individual basis and used for public-purpose research. Therefore, it has the advantage that individual clinical records can be tracked more specifically [20]. They also have low levels of statistical errors and high levels of reproducibility. However, they are claimed data, which means that they are not originally created for research, so they require advanced processing skills. The biggest advantage of Korea's medical big data is that it accommodates 99% of national data in the case of insurance claim data. In addition to individually accessing, applying for, and using data from nine institutions in the health and medical field, we are carrying out a project to open the data of nine institutions in the health and medical field to researchers so that they can be combined on an individual basis and used for public-purpose research. Therefore, it has the advantage that individual clinical records can be tracked more specifically [20].

Advantages and Disadvantages of Big Data Research Using Industrial Data
In Korea, there are various types of big data. The most commonly encountered big data are Health Review and Assessment Service data and Health Insurance Corporation data. Of course, there are also data from the National Health and Nutrition Examination Survey [21], the National Statistical Office, and the Korea Centers for Disease Control and Prevention. The biggest advantage of this is that it is easy to verify statistical significance because much more data can be obtained than can be obtained from hospitals. However, it can also be a major drawback. For example, in big data research, even a slight change in methodology often changes the direction of the results.
In Korea, the National Health Insurance Service operates a system that reduces some of the deductibles for patients with rare diseases, cancer, and other severe and intractable diseases that involve high medical expenses. Rare and intractable diseases [22] are defined by specific V codes, and this system of definition has the advantage of leading to very convenient disease arrangement. The incidence and prevalence of these diseases can be easily obtained simply by organizing the disease codes (Table 1). For example, when searching for organ transplant patients, it is possible to quickly find them using specific codes such as kidney transplants (V005), liver transplants (V013), pancreatic transplants (V014), and heart transplants (V015) without having to find the types of individual organspecific diseases by manually using the ICD code. Moreover, when searching for dementia patients, various types of dementia diseases can be found in the ICD code (F00.1-F01.3), but the overall status can be quickly examined using the specific code of V810.
The big data research we conducted mainly uses health insurance corporation data. The problem is that these health insurance data themselves are not created for research purposes, so they need to be processed before being used for such purposes.
For diseases other than these V codes, an operational definition is required, which requires a verification procedure. In most cases, a retrospective cohort is created, and many studies [23,24] are conducted to analyze specific clinical indicators that occur when there is a risk of exposure and when there is no risk of exposure. Therefore, even if the

Advantages and Disadvantages of Big Data Research Using Industrial Data
In Korea, there are various types of big data. The most commonly encountered big data are Health Review and Assessment Service data and Health Insurance Corporation data. Of course, there are also data from the National Health and Nutrition Examination Survey [21], the National Statistical Office, and the Korea Centers for Disease Control and Prevention. The biggest advantage of this is that it is easy to verify statistical significance because much more data can be obtained than can be obtained from hospitals. However, it can also be a major drawback. For example, in big data research, even a slight change in methodology often changes the direction of the results.
In Korea, the National Health Insurance Service operates a system that reduces some of the deductibles for patients with rare diseases, cancer, and other severe and intractable diseases that involve high medical expenses. Rare and intractable diseases [22] are defined by specific V codes, and this system of definition has the advantage of leading to very convenient disease arrangement. The incidence and prevalence of these diseases can be easily obtained simply by organizing the disease codes (Table 1). For example, when searching for organ transplant patients, it is possible to quickly find them using specific codes such as kidney transplants (V005), liver transplants (V013), pancreatic transplants (V014), and heart transplants (V015) without having to find the types of individual organspecific diseases by manually using the ICD code. Moreover, when searching for dementia patients, various types of dementia diseases can be found in the ICD code (F00.1-F01.3), but the overall status can be quickly examined using the specific code of V810.
The big data research we conducted mainly uses health insurance corporation data. The problem is that these health insurance data themselves are not created for research purposes, so they need to be processed before being used for such purposes.
For diseases other than these V codes, an operational definition is required, which requires a verification procedure. In most cases, a retrospective cohort is created, and many studies [23,24] are conducted to analyze specific clinical indicators that occur when there is a risk of exposure and when there is no risk of exposure. Therefore, even if the operational definition is well performed, the results of the study can still be substantially influenced by each detailed methodology, such as the definition of exposure and the setting of the incubation period after exposure.

Big Data Research Design
Big data analytics is largely based on traditional statistical analysis and machine learning (ML). The boundaries between statistical inference and ML are debatable, but while some methods fall into one or the other, many are used in both [25]. Statistics focuses on making inferences to prove a specific hypothesis, while ML is about finding generalizable patterns of prediction [26].
Big data with more variables (features) than observed data leads to an increase in dimensionality, which means that there are more variables to control, but the observed data is relatively limited, and there are more and more empty spaces between the variables due to limited data, so the performance of the entire model will eventually decrease [7,27]. In other words, it is a major concern in big data analysis to consider the decrease in model performance caused by the curse of dimensionality compared to the actual benefit of increasing data. In general, the overfitting of the model due to too many variables causes the problem of generalization, so the curse of dimensionality is solved by reducing the dimensionality [28] or selecting the variables [29].
The most important factor in big data research is applying the right methodology. So how do we apply the right methodology? The answer lies in SR. We conduct SR to know the past of a certain study, study using big data to understand the current trend, and apply RCT to study preventive drugs or treatment that will be needed in the future. Conducting SR allows one to consider the design of all studies on the subject and know the strengths and weaknesses of each. Moreover, thorough qualitative evaluation of previous studies-whether they be observational studies or RCTs-can teach one how to overcome the shortcomings of such studies. Fortunately, disease codes are standardized and commonly used worldwide, and many papers using insurance data have been introduced worldwide, which is of great help in establishing research methodologies.
Here we suggest the important Q/As: Q: What is the first step to perform a research using medical health claimed data? A: Choose the data base you want to use. Incidence and mortality rates of specific diseases, or complications rates, etc. can be progressed through individual database access. However, if you plan a retrospective cohort study observed over a long follow-up period, it is convenient to use a platform that integrates multiple databases.
Q: Are there any restrictions on individual database access? Or are there any restrictions on access to the platform? Appl. Sci. 2023, 13, 9260 6 of 10 A: In Korea, access for research purposes is allowed for non-commercial research in accordance with the Health and Medical Technology Promotion Act. This is applied to both individual data access and platforms, and in the case of foreign countries, use access is determined after protocol review in the case of research purposes. Q: Why is SR necessary for research using medical claims data? A: Due to the nature of observational research, it is not easy to unify the methodology of research, and there is a risk that results may vary depending on the methodology. Re-searchers can determine the optimal methodology through SR and avoid possible bias.
Q: What should be kept in mind as a researcher when researching medical big data? A: Since methodology is important in observational studies, the accuracy of the definition of the disease, the accuracy of the definition of occurrence, and above all, the setting of the control group and index date are the most important.
Q: How could we use SR in designing an observational study using medical big data specifically?
A: In the research of a liver donor study using big data platforms, including Korea Centers for Disease Control and Prevention, Health Review and Assessment Service, National Statistical Office, Health Insurance claim data, SR was performed ( Figure 3) [30]. Figure 3a revealed that both OR and HR showed no significant difference between living liver donors and health controls. However, there were inconsistent methodological designs which represented a high risk of bias. Hence, after performing scrupulous reviews for qualitative analysis using ROBIN1 using own-judgment criteria for this study (Table 2), qualitative analysis was completed (Figure 3b). By excluding items from studies with a high risk of bias and adopting items from studies with a low bias, a stable methodology for selection and exclusion of subjects and setting and matching of control groups could be designed (Figure 4) [31].
diseases, or complications rates, etc. can be progressed through individual database ac-cess. However, if you plan a retrospective cohort study observed over a long follow-up period, it is convenient to use a platform that integrates multiple databases. Q: Are there any restrictions on individual database access? Or are there any restrictions on access to the platform?
A: In Korea, access for research purposes is allowed for non-commercial research in accordance with the Health and Medical Technology Promotion Act. This is applied to both individual data access and platforms, and in the case of foreign countries, use access is determined after protocol review in the case of research purposes. Q: Why is SR necessary for research using medical claims data? A: Due to the nature of observational research, it is not easy to unify the methodology of research, and there is a risk that results may vary depending on the methodology. Researchers can determine the optimal methodology through SR and avoid possible bias.
Q: What should be kept in mind as a researcher when researching medical big data? A: Since methodology is important in observational studies, the accuracy of the definition of the disease, the accuracy of the definition of occurrence, and above all, the setting of the control group and index date are the most important.
Q: How could we use SR in designing an observational study using medical big data specifically?
A: In the research of a liver donor study using big data platforms, including Korea Centers for Disease Control and Prevention, Health Review and Assessment Service, National Statistical Office, Health Insurance claim data, SR was performed ( Figure 3) [30]. Figure 3a revealed that both OR and HR showed no significant difference between living liver donors and health controls. However, there were inconsistent methodological designs which represented a high risk of bias. Hence, after performing scrupulous reviews for qualitative analysis using ROBIN1 using own-judgment criteria for this study (Table  2), qualitative analysis was completed (Figure 3b). By excluding items from studies with a high risk of bias and adopting items from studies with a low bias, a stable methodology for selection and exclusion of subjects and setting and matching of control groups could be designed (Figure 4) [31].     During quantitative analysis, almost all of the included studies carried a large pop lation size and long period. These kinds of outcomes compared with controls are mo appropriately analyzed using HRs and ORs. The random-effects model published by D Simonian and N. Laird [32] was used to determine the pooled overall incidence or mo During quantitative analysis, almost all of the included studies carried a large population size and long period. These kinds of outcomes compared with controls are most appropriately analyzed using HRs and ORs. The random-effects model published by DerSimonian and N. Laird [32] was used to determine the pooled overall incidence or mortality ratios with 95% confidence intervals for outcomes. Statistical heterogeneity was evaluated by the Cochran's Q test and the I 2 statistic. Meta-regression analysis was conducted for each moderator [33].
Many researchers do not spend much time in the process of qualitative evaluation. Qualitative evaluation of RCTs [34] can be completed relatively easily, but qualitative evaluation of observational studies is not easy. A qualitative evaluation can only be performed properly when the criteria for the methodology for each item are established [35,36]. We believe that an appropriate scientific methodology can only be created by following such a qualitative evaluation. Figure 3B evaluates how much the risk of bias is for the seven evaluation items. The problem here is that it is necessary to develop and evaluate indicators that can rank the risk for each of the seven evaluation items. The indicators developed in this way allow for detailed evaluation of these observational studies. Let us expand upon this. In the case of qualitative evaluation of items with bias due to confounding factors, we-who implement the SR-must separately create standards for this evaluation item. In this case, we did so based on whether or not it was matched and whether or not it was adjusted when calculating HR; if matching was completed and adjusted when calculating HR, it was evaluated as low, while if only matching was not adjusted when calculating HR, it was evaluated as moderate. In this way, an optimal methodology can be established in the process of qualitative evaluation of all items (Table 2). Through this qualitative evaluation, we could determine how to set up the control group, how to set the exposure period, how to set the index date, what exclusion criteria to set for the target group, and what to analyze (for example, RR or HR). Moreover, we could also understand how to achieve balancing by matching or weighing, and finally, whether to adjust covariates after matching.

Discussion
For big data to be applied and generalized in clinical practice, it must be subjected to a scientific evaluation called EBM. EBM is a medical methodology that integrates appropriate scientific evidence with the experience of doctors in clinical decision-making to provide patients with the best care possible. Therefore, when new findings are obtained using the healthcare big data integration platforms we have discussed above and published in professional journals or clinical trial results and eventually approved by regulatory agencies and recognized as EBM, this represents a complete use of healthcare big data. Another method for EBM is meta-analysis [37][38][39][40], which integrates the scientific knowledge on a subject that humans have thus far accumulated. Meta-analysis is a combined methodology that quantitatively synthesizes research findings within the framework of a SR [34,36]. Systematic reviews and meta-analyses should adhere to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [41] for RCTs and Meta-analysis of Observational Studies in Epidemiology (MOOSE) [42] guidelines for observational studies, which can improve the reliability and value of published health research literature by promoting transparent and accurate reporting and widespread use of robust reporting guidelines. New scientific knowledge is recognized as scientific fact through peer-reviewed publication in relevant professional journals, and the integration of these individual studies into a meta-analysis methodology to prove medical effectiveness is highly efficient, and it is considered to be the highest level of EBM. When searching for evidence-based information, you should select the highest level of evidence possible for clinical implications and recommendations. Systematic reviews and meta-analyses are considered the gold standard for medical decision-making because they are known to contain the best available evidence to answer health research questions [43][44][45].
Even if a qualitative or quantitative evaluation of previous studies has been completed by conducting SR, it is still very difficult to create and analyze a retrospective cohort using big data. First, there is the issue of access restriction, so even after IRB approval, access rights may take a long time to be obtained. Therefore, time difficulties can be said to represent the first issue. Second, even if access is granted, it is not easy for clinicians to actually analyze the data. It takes a lot of time to simply import and analyze such data, which is not easy for researchers who spend a lot of time doing clinical work. Lastly, when a thesis is completed, reviewed, and requested for revision, data access rights are lost again, which makes it difficult to reanalyze the data. So how can these points be overcome? The answer lies in collaborative research. Clinical researchers do not try to approach and analyze topics directly, but they instead aim to conduct collaborative research with non-clinical researchers who do a lot of big data research. However, even in this case, it is very important that the clinical researcher play a significant role in determining the methodology by implementing SR even if a joint study is conducted on the research topic.

Conclusions
There are several types of big data related to medical care. Among them, the data that we can use for research is mainly insurance claim data, and it is possible to access all of them through an individual database or a converged platform. SR can be of great help to researchers when planning observational studies through individual use of insurance claim data or databases of national institutions or observational studies using medical big data platforms. Through SR, researchers can determine the optimal methodology and de-rive