Big Data Analytics in Government: Improving Decision Making for R&D Investment in Korean SMEs

: To expand the ﬁeld of governmental applications of Big Data analytics, this study presents a case of data-driven decision-making using information on research and development (R&D) projects in Korea. The Korean government has continuously expanded the proportion of its R&D investment in small and medium-size enterprises to improve the commercialization performance of national R&D projects. However, the government has struggled with the so-called “Korea R&D Paradox”, which refers to how performance has lagged despite the high level of investment in R&D. Using data from 48,309 national R&D projects carried out by enterprises from 2013 to 2017, we perform a cluster analysis and decision tree analysis to derive the determinants of their commercialization performance. This study provides government entities with insights into how they might adjust their approach to Big Data analytics to improve the e ﬃ ciency of R&D investment in small- and medium-sized enterprises.


Introduction
The concept of Big Data analytics (BDA) pertains to accumulating, combining, analyzing, and using large-scale data for various purposes and of various types. BDA enables organizations in both the private sector and, increasingly, the public sector to make better decisions (i.e., more quickly and efficiently) based on evidence and insights [1][2][3]. Indeed, Big Data applications in government are no longer unusual. Many countries have come to regard Big Data as a growth engine for the future as well as a solution to existing economic and social problems. Over the past decade, governments globally have announced comprehensive strategies for using Big Data at the national level. They first focused on the construction of infrastructure to open access to data and promote its utilization. Thereafter, they supported legal and institutional improvements to empower the private sector to use public data and create added value (indirect role) as well as used Big Data for policymaking (direct role) [4].
Indeed, building on the constructed infrastructure, most governments endeavor to expand their own use of Big Data to formulate policies based on concrete data rather than depending on mere experience or intuition. The use of Big Data has thus far been limited because of the lack of actual data available to the government to implement such data-driven policies. In particular, the use of Big Data has been scarce because of the limitations of the infrastructure required to (i) accumulate and generate reliable data, which is essential for utilization; and (ii) convert the accumulated and generated data into a form that can actually be used in practice. However, an infrastructure that can

Government and Big Data
Evidence-based policy, which refers to establishing policies grounded on objective and scientific research and ensuring they are designed and implemented based on concrete data, has existed since ancient times. In Ancient Greece, Aristotle argued that diverse sources of knowledge should be included to set rules or develop regulations; Aristotle's concept of diverse knowledge has been interpreted to include scientific knowledge [5]. In the medical field in the early 1990s, the phrase "evidence-based" was formulated to refer to medical practices based on evidentiary data [6] and the phrase has since entered into generalized use. It is only in recent years, however, that the emphasis on evidence-based practices has entered the field of government [7,8].
Governmental institutions have traditionally selectively generated and managed the information the government needs, using data for institutional maintenance and reinforcing organizational capabilities rather than making them publicly available. Gradually, however, governments have been encouraged to change this monopolistic approach to information management. Owing to rapid changes in sociocultural environments and behaviors caused by globalization and the increasing complexity and diversity of society as well as the development of ICT, there have been significant changes in the environment in which governmental policies are implemented [9]. It is increasingly argued that the government should eschew opinion-based policies and the selective application of evidence, driven by ideological perspectives, prejudices, and conjecture, and instead make policy decisions based on citable evidence, with ample access to data [10,11]. Research has found that through evidence-based policymaking, governments can gain trust in a changing environment [12], justify policy decisions, make policy decisions more quickly, resolve conflicts in the process of formulating and implementing policies, and improve the quality of policies [13][14][15].
The key factors to evidence-based policymaking are securing the objectivity of the materials or data used [16] and conducting scientific analysis [17]. Therefore, to reach the stage where evidence-based policies can be established, it is first necessary to collect high-quality data that enable the suitable analysis of the issue in question, select scientific methods to analyze this accumulated data, and apply the analytical results to the process of designing policies. However, in only a few limited fields, such as healthcare, security, and public safety, and environmental monitoring and response measures, have governments been able to secure sufficient data with proven objectivity, conduct scientific analysis, and apply these findings to formulate policies. One of the main reasons for this slow progress has been the lack of accumulated data. However, data-based practices are now expected to become applicable to a wider range of fields thanks to improvements in data collection, integration, and analysis techniques [18].
Studies have analyzed the application of BDA to governmental practices in the healthcare, security, and public safety sectors. In the healthcare sector, governments use Big Data to find the strongest scientific basis for suppressing increases in medical expenses. One of the top priorities of governments has been building an infrastructure that links the Big Data from various organizations through the construction of databases that connect the individual patients of public administrative and medical organizations by developing a network of data on existing medical services [1,19]. Using such databases, studies have analyzed the optimal treatments and cost reductions based on predictions of high-cost patients, readmitted patients, and occurrences of complications and medical incidents; other studies have focused on applying these data and achieving service optimization through personalized medical services, clinical decision support systems, and mobile devices [20,21].
In the public safety sector, governments identify crime trends by analyzing the times, areas, and types of crime incidents within criminal records. These data are used to establish public safety policies such as dispatching more police officers to certain crime-prone areas. Based on the analyzed information, an application has been developed to improve citizens' safety; this application notifies citizens in areas in which crime is expected to occur to reduce crime rates. Studies of these issues are referred to as security informatics, an area of expertise continuously advancing through the integration of technical, organizational, and policy-based approaches [22][23][24].

R&D Policy of the Korean Government: the Korea R&D Paradox
With the advent of the knowledge-based society in the 21st century, science and technology have emerged as new growth engines for strengthening national competitiveness, outstripping the importance of other factors of production such as capital and labor. As a result, countries globally are continuously expanding investment in R&D to secure these growth engines. The Korean government has also increased its R&D investment in pursuit of economic development through science and technology. As of 2017, R&D investment in South Korea amounted to 78.8 trillion KRW, the fifth largest in the world and the largest globally in proportion to GDP. Of this total, the government's R&D expenditure was KRW 19.4 trillion, nearly 5.5 times greater than the 3.5 trillion KRW spent in 2000 [25]. In 2017, government-funded research institutes received 7.9 trillion KRW, academia 4.4 trillion KRW, SMEs 4.1 trillion KRW, large firms 0.4 trillion KRW, and other actors, including public research institutes, 2.6 trillion KRW. In particular, R&D investment in SMEs has been steadily increasing, rising from 2,854 billion KRW in 2013 to 4,119 billion KRW in 2017 (The proportion of R&D support for SMEs was calculated based on the purchasing power parity index in 2010; the equivalent index of South Korea was 56.8%, significantly higher than the percentages in the United States (11.4%), France (24.8%), and the United Kingdom (25.2%) [26]); however, investment in large firms has been decreasing, falling from 861 billion KRW to just 419 billion KRW in 2017 [25]. The government expects to improve commercialization performance and economic growth by implementing R&D support for SMEs, which account for 99% of all enterprises in Korea.
However, despite this proactive support, the level of commercialization performance achieved as an outcome of government-sponsored R&D projects has continued to be low. According to the government's plan announced in 2014 to promote innovation among SMEs, the success rate of commercialization attributed to government R&D projects for SMEs has been only around 50% [27]. Recently, the Presidential Advisory Council on Science and Technology formulated and approved a national R&D innovation plan including a provision to double R&D investment for SMEs. The plan sets quantitative targets to support SMEs, requiring government agencies and public institutions, which have annual R&D budgets of 30 billion KRW, to invest a certain percentage of their R&D funding in SMEs [28]. The problem is that performance has been analyzed in a fragmented manner based only on R&D investment and the number of commercialized projects, and there is no systematic analysis of which projects involving SMEs have achieved successful commercialization outcomes thanks to R&D investment and whether such commercialization has generated actual sales.
The Korean government established the NTIS in 2006 to share and jointly utilize information on national R&D projects, which had previously been managed by individual departments. However, the government's utilization of NTIS data has been limited to merely presenting R&D expenditure by actor, research phase, and region using basic statistical analysis or releasing the number of achievements such as papers, patents, and technology transfers, including commercialization performance. Although the government has collected sufficient data on national R&D projects, it has been unable to effectively apply data analytics to formulate data-driven R&D policies.

Analysis Procedure
This study aims to derive the optimal solution for enhancing the efficiency of governmental R&D investment in SMEs. As most of the data on R&D projects carried out in 2018 have not yet been entered into the NTIS, we extracted data on 48,309 national R&D projects conducted by SMEs from 2013 to 2017. Python was used for data preprocessing and analysis.
We next employed cluster analysis to group the data and thus examine the determinants of commercialization performance. Cluster analysis is suitable for the exploration of the large amounts of R&D project data made available by the Korean government. In addition, it can classify these R&D project data to show their characteristics. Using the results of the cluster analysis, we can then understand the structure of project data in high-performing projects when commercialization performance outcomes are not revealed by using indicators such as average investment.
We then clustered the data into groups using the self-organizing map (SOM) algorithm [29]. Cluster analysis methods such as principal component analysis can efficiently form clusters using a small quantity of data to interpret large-scale multidimensional data. However, some data are lost because of the linear data reduction issue; another problem is that these methods are unsuitable for analyzing non-linear targets [30,31]. To avoid these problems, we thus used the SOM algorithm, which can process large-scale data quickly and performs the strongest of all available hierarchical cluster analysis methods [32]. Table 1 shows the 13 input variables for the cluster analysis. These variables were based on the project information available from the NTIS in 2017, which the Korean government uses for the investigation, analysis, and evaluation of national R&D programs [25]. Based on the clustering results, we used four indicators of commercialization performance in the NTIS, namely the average number of commercialized projects, commercialization period, sales from commercialized projects, and the number of jobs created by commercialized projects, to compare and analyze each cluster (Table 2). Finally, we conducted a decision tree analysis using the classification and regression tree (CART) algorithm to identify the determinants of commercialization performance for the projects in the clusters [33]. The input variables for the decision tree analysis included not only the variables in Table 1, but also the categorical variables that could not be used in the cluster analysis. Table 3 shows the added variables and their descriptions. Actor performing under a contract or jointly performing some of the R&D project managed by enterprises; classified into enterprise, university, government-funded research institute, foreign research institute, and other Continuation R&D projects classified into new or continued projects. The latter refers to projects whose project period has expired, but that have been confirmed to continue

Number of commercialized projects Number of commercialized projects
Commercialization period Difference between the year of commercialization and year of the project start Sales Sales from commercialized projects Job creation Number of jobs created by commercialized projects Table 3. Input variables for the decision tree analysis.

Variable Description
Name of department Name of the administrative department that manages all aspects of the planning, evaluation, and management of R&D projects Research field Research field of R&D projects; into nature, life, and artificial following the national standard classifications of science and technology Application field Application field of R&D projects; classified into industry and the public sector

Cluster Analysis: The SOM Algorithm
The SOM algorithm, proposed and developed by Kohonen [29,34], is an unsupervised neural network used to visualize and analyze high-dimensional data in the form of maps arranged in easy-to-understand low dimensional neurons. It consists of two layers of artificial neural networks; one is the input layer that receives input vectors and the other is the competitive layer comprising a two-dimensional grid. In this layer, vectors are clustered at one point according to the characteristics of the input vector. The input layer has the same number of neurons as the number of input variables, and the competitive layer has the same number of neurons as the number of clusters predetermined by the user. The data in the input layer are arranged in the competitive layer through learning, which is called a map. The sorted data is displayed as a grid on the map. Data with similar patterns are located close together on the map, while data with different patterns are located far away from each other. This allows us to easily visually assess not only similarities in the clusters but also similarities between the clusters. To determine the optimal number of clusters, we compared the silhouette coefficient with the number of clusters and conducted the analysis based on the number of clusters with the highest coefficient.

Decision Tree Analysis
Decision tree analysis classifies decision rules into a tree structure to perform the classification and prediction. It is a data mining-based distribution technique that searches for large amounts of unexpected or valuable structures. After the major input variables in a large amount of data are found, decision tree analysis is useful for effectively analyzing the interactions between the individual factors to determine how the various interactions affect the target. In addition, since the analysis process is expressed through a tree structure, it is easy to interpret.
The CART algorithm can be applied regardless of the scale of the objective or input variable. Moreover, the decision tree can be easily interpreted by dividing it through binary splits rather than multiple splits. Another advantage of this approach is that the process of analysis is expressed in the form of trees, which simplifies the interpretation and requires no assumptions of linearity or normality in the variables. This enables the use of both continuous and categorical variables.
Depending on the type of objective variable, the CART algorithm classifies continuous and categorical variables under the classification tree and regression tree, respectively. In cases where the objective variable is categorical, such as in this study, the Gini index and entropy are used to measure impurity. Optimal splitting is conducted by selecting the input variables that minimize the Gini index and entropy [35]. Furthermore, it is robust in response to outliers, and is a non-parametric method that does not require assumptions about the distribution. Since the first separation occurs for the variable with the strongest explanatory power, it is an effective method for identifying important variables. Hence, this study used the CART algorithm to derive a predictive model for the creation of qualitative commercialization performance, which is then verified using 10-fold cross-validation. To evaluate the performance of the predicted outcomes, we use the receiver operating characteristic curve to calculate the area under the curve (AUC) [36].

Clustering Results
We used the SOM algorithm to cluster the 48,309 national R&D projects conducted by SMEs from 2013 to 2017. First, we compared the silhouette coefficients for each number of clusters to select the optimal number of clusters. Upon comparing the values from 2 × 2 up to 10 × 10, we found that the 3 × 3 clusters had the highest silhouette coefficient value (0.4523) and therefore we conducted clustering using 3 × 3 clusters (Figure 1).  Figure 2 presents the results of the cluster analysis using 3 × 3 clusters, showing the distribution of observations across each cluster. Cluster 21 (C21) was the largest, with 15,006 projects, followed by C20 and C02, while C10 was found to be the smallest cluster. As the clustering results included no outlier clusters, we analyzed all the clusters to calculate the average governmental investment per R&D project, as shown in Figure 3, and the average number of projects in which R&D successfully led to commercialization. In the case of successfully commercialized projects, we examined the average number of commercialized projects, average commercialization period, average sales from commercialized projects, and average number of jobs created by commercialized projects (Figure 4).  First, the clusters in which R&D projects led to the highest commercialization performance were C10 and C00. The average time required in C10 and C00 to yield commercialization performance was relatively short, at 0.37 and 0.16 years, respectively. However, while the projects in these clusters reached commercialization in the short term, they were found to have performed more poorly than those in other clusters in terms of qualitative performance, such as sales and job creation. This finding indicates that rapid commercialization in more R&D projects does not necessarily lead to qualitative performance. In particular, although C10 received the largest amount of governmental investment, it was observed to have poor commercialization performance.
Conversely, the cluster with the longest commercialization period, C01, was found to be among the three worst performing clusters in terms of generated sales and job creation. This finding shows that a longer average commercialization period also does not necessarily lead to strong qualitative performance. For C21, another cluster that had a longer commercialization period, it took more than one year to achieve commercialization. However, C21 was among the three best performing clusters in terms of sales as well as exhibiting relatively strong job creation performance. Most of the projects in C20, which performed well in terms of both measures of qualitative performance (sales and job creation) were found to have reached commercialization within six months of the completion of the R&D projects. While the cluster with the highest revenue, C22, appears to have generated high revenue due to the large number of commercialized projects, it also performed relatively well in terms of job creation while also requiring only a short time to reach commercialization (under six months). Considering these findings, we conclude that the commercialization period does not appear to be a determining factor for the qualitative performance of commercialization.

Determinants of Commercialization Performance in Each Cluster
Based on the results of the cluster analysis, we conducted the decision tree analysis to identify the specific factors that led to the qualitative performance in C20 and C22. We also examined which factors led to the qualitative performance in C21 as opposed to another cluster that had similar times to commercialization, C01, which required a longer period to reach commercialization than C20 and C22. Table 4 reports the measured AUC values. Values closer to 1 indicate the higher accuracy of the predictive model; an AUC value of 1 indicates perfect accuracy, while values lower than 1 but greater than or equal to 0.9 may be interpreted as indicating high accuracy. Since all the AUC values measured for each cluster exceed 0.9, the predictive models for each cluster derived in the decision tree analysis can be regarded as being reliable. The results of the decision tree analysis for each cluster are as follows. In the case of C20, which yielded the strongest qualitative performance in terms of job creation, projects designated for "practical use", characteristics equal to "other development", and a technology life cycle equal to "other" had a 0.9814 probability of being in C20. Next, projects designated for "practical use", characteristics equal to "other development", a technology life cycle equal to "emerging", and a phase equal to "applied research" had a 0.9600 probability of being in C20. Projects designated for "practical use", characteristics equal to "other development", a technology life cycle equal to "growth", "maturity", or "decline", and a phase equal to "applied research" had a 0.8312 probability of being in C20 ( Figure 5). In the case of cluster C22, which demonstrated the strongest qualitative performance in terms of sales, new projects "not designated for practical use" with characteristics equal to "idea development" or "other development" had a 0.9994 probability of being in C22 ( Figure 6). Among the projects in C21, which had a longer period to commercialization than C20 and C22 but yielded strong qualitative performance in terms of sales and job creation, new projects "not designated for practical use" with characteristics equal to "product or process development" had a 0.9993 probability of being in C21 (Figure 7). Among the projects in C01, which had a longer period to commercialization, as in the case of C21, but performed poorly in terms of sales and job creation, new projects designated for practical use with characteristics not equal to "other" had a 0.9801 probability of being in C01 (Figure 8).   Of the projects that required a longer-than-average commercialization period, those not designated for practical use were found to perform better. Projects not designated for practical use with characteristics equal to "product or process development" were found to take longer until commercialization but performed better in terms of sales and job creation when they were commercialized successfully. Therefore, projects not designated for practical use with characteristics equal to "product or process development" appeared to require sufficient time rather than rapid commercialization.
C22 showed a large number of projects for non-practical use. In addition, C21, which generated strong qualitative performance, had many projects for non-practical use. However, C01 had a low number of commercialized projects and did not create high qualitative performance, and projects for practical use belonged to C01. Projects for practical use are those in which firms participate to commercialize technology to generate economic and social value from sales and job creation. However, such projects for practical use failed to achieve the expected levels of commercialization, indicating a mismatch in government R&D policies.

Conclusions and Implications
This paper presented a new case of a government's application of BDA. Based on data on national R&D projects in Korea, we conducted cluster and decision tree analyses to identify the determinants of commercialization performance. These analyses showed a low success rate of commercialization for national R&D projects. Among successfully commercialized projects, many were not for practical use, indicating a mismatch in government R&D policies. In addition, many projects were commercialized but failed to create sales or jobs; this shows a lack of social and economic value creation, which is the primary goal of governmental R&D investment in SMEs, and thus a failure to realize a return on investment.
The findings of this study suggest the following policy implications. First, considering the finding that governmental investment did not lead to the determinants of commercialization performance, policymakers must be selective and focused when they design R&D policies for SMEs, as the expansion of inputs does not necessarily lead to an increase in outputs. It seems that the linear-based viewpoint, in which increasing R&D investment simply leads to national economic growth, has prevailed in the policymaking arena. In other words, the results of the cluster analysis show that under the current structure of R&D support, large investment projects do not lead to qualitative commercialization performance. Moreover, although the proportion of projects with less investment is large, such projects do not lead to qualitative performance, either. As such, to enhance the effects of R&D support, it is necessary to first elaborate on how to select supported targets and determine the optimal investment for them.
Second, policymakers must conduct integrated reviews of projects designated for practical use. Whether being designated for practical use was analyzed as the determinants of commercialization performance. Interestingly, however, projects not designated for practical use, but aimed at the development of products or processes, have higher commercialization performance than those projects designated for practical use and with the characteristics of R&D that may directly lead to commercialization performance such as prototypes or product/process development. Policymakers focus on the practical application and commercialization of technologies when providing R&D support for SMEs as well as expanding the proportion of projects designated for practical use; however, the findings of this study show that the effectiveness of such efforts is low. As such, it is necessary to fully review the achievability of objectives and possibility of the realization of performance when selecting projects for practical use rather than first expanding the proportion of projects for practical use.
Finally, policymakers must review the R&D information collected by the NTIS when establishing data-driven R&D policies. It is difficult to interpret those R&D characteristics analyzed as determinants of commercialization performance when they are categorized as "others", making it hard to apply them when formulating policy. Indeed, it is difficult to identify their exact intent because of a lack of standardization. It is thus necessary to ensure collected items can be converted into analyzable data to help policymakers apply the data derived from national R&D projects in practice.
Hence, this study makes a significant contrition to the literature by expanding the field of governments' application of BDA and presenting a case of policymaking based on data. In addition, it shows that the government should be concerned about what data can be made available in the future to make policy decisions. Future research is, however, necessary to more closely examine the factors identified in this analysis as determinants of commercialization performance.