System Framework for Cardiovascular Disease Prediction Based on Big Data Technology

Amid growing concern over the changing climate, environment, and health care, the interconnectivity between cardiovascular diseases, coupled with rapid industrialization, and a variety of environmental factors, has been the focus of recent research. It is necessary to research risk factor extraction techniques that consider individual external factors and predict diseases and conditions. Therefore, we designed a framework to collect and store various domains of data on the causes of cardiovascular disease, and constructed a big data integrated database. A variety of open source databases were integrated and migrated onto distributed storage devices. The integrated database was composed of clinical data on cardiovascular diseases, national health and nutrition examination surveys, statistical geographic information, population and housing censuses, meteorological administration data, and Health Insurance Review and Assessment Service data. The framework was composed of data, speed, analysis, and service layers, all stored on distributed storage devices. Finally, we proposed a framework for a cardiovascular disease prediction system based on lambda architecture to solve the problems associated with the real-time analyses of big data. This system can be used to help predict and diagnose illnesses, such as cardiovascular diseases.


Introduction
As climate conditions change and concern over health care grows, health care-related issues such as cardiovascular diseases have exhibited close correlations with external environmental factors.Globally, the prevalence and risk of cardiovascular disease has increased markedly, accompanied with rapid industrialization, and is now a major problem of aging societies.Several studies have identified a relationship between fine dust (PM 2.5 ) exposure and cardiac infarction based on correlations between environmental pollution and cardiovascular disease [1][2][3], as well as increases in the hospitalization rates of patients with cardiovascular disorders according to temperature and carbon dioxide concentration [4].Moreover, environmental pollution by nitrogen dioxide and sulfur dioxide increases mortality [5].Accordingly, it is necessary to create efficient disease prediction and risk estimation techniques to prevent cardiovascular diseases, which are correlated with a variety of external environmental factors.
The existing cardiovascular disease prediction system considers lifestyle factors such as smoking, drinking, diet, exercise, and stress [6]; however, according to recent research, many additional factors affect cardiovascular disease such as climate change, health conditions, social and economic factors, and atmospheric environment information.Yet, no integration database (DB) based on diverse environmental factors can be used to extract risk factors that occur in different areas.Moreover, no optimal prediction system exists that considers such a variety of factors.
Because recent research has identified the environmental factors that harm health, such as abnormal climate and atmospheric pollution, it is necessary to integrate this information into quantitative evaluations and diagnoses of health, to provide a systematic health care policy.In particular, it is increasingly important for health and environmental policies to consider these factors to minimize risks.In 2013, the UK Department of Health announced the Personalized Health and Care 2020 framework, which reinforced control over medical treatment and welfare information for patients, and constructed the Health & Social Care Information Center as an independent organization that collects, stores, connects, and analyzes distributed social security data [7].As another example, Roski et al. used a personalized medical care clinical decision support system and anticipated service optimization that reflected patient data.This enabled the practical application of health big data for population health analyses and prevention [8].
Numerous studies have examined the correlations between health conditions and cardiovascular disorders, researching cardiac disorder diagnosis and prediction systems by using artificial neural networks, data mining [9,10], and the association rule (i.e., emerging patterns) to identify significant patterns among medical treatment data [11].Furthermore, a variety of studies have examined integrated DBs of large volumes of data, approaching the method from an ontology perspective [12] using open source-based research [13].
Studies have also evaluated changes in environmental pollutant concentrations, and the degree of health damage according to national health insurance data and climate changes; examined the use of danger and risk maps in environment health regions in environmental offices; and evaluated frameworks for processing big data appropriate to each domain [14].It has been found that such data can be implemented by using the MapReduce model in Hadoop, yielding excellent scalability [15,16].
Lambda architecture is a big data technique that can be used to support real-time analyses; however, it has the limitation of not being able to analyze a large volume of data in real time.To address this limitation, a method can be employed that blends data made in advance in a batch layer with data processed in real time.Then, the data can be generated and stored.To achieve this, data are formed in batch view in a cycle with a batch layer, and identical data are formed in real-time view via real-time data processing.These two data sets are then blended and analyzed, enabling the analysis of data that reflects real-time data [17,18].Supporting this, Amazon Web Services, which processes big data, wrote a White Paper on the integration of batch processing and real-time processing into a single network using lambda architecture [19].
To understand big data and streaming data analysis, the Apache Hadoop software library and Microsoft Azure provide a variety of solutions and comprehensive analysis techniques [20,21].Moreover, real-time analyses have been performed by using key value analysis [22], streaming analysis using Apache Spark, and network analysis using the Open Network Operating System controller in real time [23,24].YARN, which is used in batch processing, is a resource management platform of Hadoop that influences the energy efficiency of a cluster and utility of the application [25,26].Furthermore, NoSQL (Not Only SQL database) of Hadoop can be used in the service layer, and a new indexing generation and storage technique for the Hadoop echo system has been developed that achieves better performance in the Hadoop environment [27][28][29].In addition, performance development modeling has been carried out using multiple solution techniques based on lambda architecture and models managing social and crowding emergencies [30,31].
Currently, the collection and storage of large amounts of heterogeneous data from different domains involves difficulties not easily processed in simple DBs; therefore, in this study, we aimed to create an effective processing and analytical technique based on big data.We did not extract the risk factors of clinical data, but rather designed a prototype disease prediction system driven by a complex set of factors.This system includes risk factors from a variety of domains, including environmental, health, clinic, population, and climate conditions.Then, we integrated the DB of these various domains and developed a prototype prediction system based on complex factors from patients with cardiovascular disorders.The specific aim was to solve these problems based on lambda architecture.Specifically, the contributions of this study are threefold: (1) it collects data on various factors that can affect cardiovascular disease; (2) it designs and implements integrated DBs based on big data; and (3) it offers a prototype design of an analysis system that can predict cardiovascular disease.

Methods
Figure 1 presents an overview of the proposed prediction system.The data layer is composed of a data integration engine that enables the migration of a variety of data into distributed storage devices.The data integration engine synthetically stores various properties and data with diverse structures, and then performs preprocessing to make the data suitable for analysis and reformation in the analysis layer.The speed layer is composed of a real-time integration engine that preprocesses data generated in real time, and delivers the results analyzed through a real-time analysis model to the analysis layer.The analysis layer is composed of a data analysis engine, which analyzes the data collected in the data layer with an analysis model pipeline, and then merges the analysis results with those from the speed layer.It then uses these results as the input value for the prediction model pipeline, and delivers the analysis results to the service layer through the prediction model pipeline.The service layer provides users with a hybrid web or hybrid mobile app by using the analysis results in the analysis layer.All the data in each layer are stored on distributed environmental storage devices.it offers a prototype design of an analysis system that can predict cardiovascular disease.

Methods
Figure 1 presents an overview of the proposed prediction system.The data layer is composed of a data integration engine that enables the migration of a variety of data into distributed storage devices.The data integration engine synthetically stores various properties and data with diverse structures, and then performs preprocessing to make the data suitable for analysis and reformation in the analysis layer.The speed layer is composed of a real-time integration engine that preprocesses data generated in real time, and delivers the results analyzed through a real-time analysis model to the analysis layer.The analysis layer is composed of a data analysis engine, which analyzes the data collected in the data layer with an analysis model pipeline, and then merges the analysis results with those from the speed layer.It then uses these results as the input value for the prediction model pipeline, and delivers the analysis results to the service layer through the prediction model pipeline.The service layer provides users with a hybrid web or hybrid mobile app by using the analysis results in the analysis layer.All the data in each layer are stored on distributed environmental storage devices.

Data Integration Engine
The data layer is composed of the data integration engine, which migrates DB or file data such as the national health and nutrition survey provided by the Korea Centers for Disease Control and Prevention and data from the HIRA (Health Insurance Review and Assessment Service) onto distributed storage devices.The data integration engine is implemented into a "map-side only job" by using the MapReduce framework of Hadoop; it uses the computing power of the nodes comprising the Hadoop cluster in parallel.
Algorithm 1 presents the map-side only job implemented in the data integration engine.It accepts various data IDs and records them as input values, after which they are printed out and distributed to storage devices, after performing preprocessing and integration according to the ID of records in each record column.Figure 2 shows the data integration engine in which each DB is composed of optimized map-side only jobs.Each map-side only job extracts the data of a column

Data Integration Engine
The data layer is composed of the data integration engine, which migrates DB or file data such as the national health and nutrition survey provided by the Korea Centers for Disease Control and Prevention and data from the HIRA (Health Insurance Review and Assessment Service) onto distributed storage devices.The data integration engine is implemented into a "map-side only job" by using the MapReduce framework of Hadoop; it uses the computing power of the nodes comprising the Hadoop cluster in parallel.
Algorithm 1 presents the map-side only job implemented in the data integration engine.It accepts various data IDs and records them as input values, after which they are printed out and distributed to storage devices, after performing preprocessing and integration according to the ID of records in each record column.Figure 2 shows the data integration engine in which each DB is composed of optimized map-side only jobs.Each map-side only job extracts the data of a column composed of the records of each DB, and these are stored on distributed storage devices with the ID of each DB after preprocessing, such as normalization.for all column  ∈ record  do 7: column  ′ = Preprocessing (, ) 8: column  ′′ = Integration (, column ′) 9: ′ = pair ( ′ ,  ′ ) 10: EMIT (pair(, ′′),  ′ ) Figure 2. Data integration engine.

Real-Time Data Integration Engine
The real-time data integration engine composed of the speed layer provides and collects data in real time by using an API (Application Program Interface), such as the SGIS (Statistical Geographic Information Service) API of Statistics Korea, which provides data in the form of a query, or public data API.The collected data constitute the streaming data analysis pipeline based on Apache Spark.As shown in Figure 3, the real-time data integration engine collects data in real time, and verifies redundancies with data already collected and analyzed.Then, to ensure that only newly generated data are sent with the Spark streaming job to each DB, it is composed of an API manager, data distributor, and each Spark streaming job.

Real-Time Data Integration Engine
The real-time data integration engine composed of the speed layer provides and collects data in real time by using an API (Application Program Interface), such as the SGIS (Statistical Geographic Information Service) API of Statistics Korea, which provides data in the form of a query, or public data API.The collected data constitute the streaming data analysis pipeline based on Apache Spark.As shown in Figure 3, the real-time data integration engine collects data in real time, and verifies redundancies with data already collected and analyzed.Then, to ensure that only newly generated data are sent with the Spark streaming job to each DB, it is composed of an API manager, data distributor, and each Spark streaming job.for all column  ∈ record  do 7: column  ′ = Preprocessing (, ) 8: column  ′′ = Integration (, column ′) 9: ′ = pair ( ′ ,  ′ ) 10: EMIT (pair(, ′′),  ′ ) Figure 2. Data integration engine.

Real-Time Data Integration Engine
The real-time data integration engine composed of the speed layer provides and collects data in real time by using an API (Application Program Interface), such as the SGIS (Statistical Geographic Information Service) API of Statistics Korea, which provides data in the form of a query, or public data API.The collected data constitute the streaming data analysis pipeline based on Apache Spark.As shown in Figure 3, the real-time data integration engine collects data in real time, and verifies redundancies with data already collected and analyzed.Then, to ensure that only newly generated data are sent with the Spark streaming job to each DB, it is composed of an API manager, data distributor, and each Spark streaming job.

API Manager
The API manager in the first data streaming step of the real-time data integration engine is composed of APIs optimized to source all real-time data.Each API is implemented according to the requirements of the organization providing the API, and some APIs provide data based on the query.To do this, the query generator generates a query according to the API.In addition, the query generator can dynamically generate a query according to user requirements in the service layer.

Data Distributer
The data distributer classifies the data collected from each API in the API manager and delivers data into the data integration engine and Spark streaming engine.The data collected through each API can be delivered with redundant data already collected according to each API condition.This is achieved with the data classifier of the data distributer.The data distributer modifies the data collected through the data classifier or classifies new data, while the Spark job initiator executes the Spark streaming job for analysis and the DB selector selects the suitable DB data and stores them in the DB.

Data Analysis Engine
The data analysis engine involving the analysis layer is composed of an optimized analysis pipeline that identifies correlations between factors and cardiovascular disorders, while the pipeline for the prediction model is based on the analysis results.As shown in Figure 4, the data analysis engine is composed of the analytics pipeline, which optimizes the analytics pipeline and prediction model with optimized analysis models, and the prediction pipeline that generates a knowledge DB for prediction recommendations.
The analytics pipeline, which is composed of a variety of models optimized for analysis, sends each analysis result to the Score Manager.The Score Manager then delivers the analysis results to the feedback generator, which manages the analysis results and analysis model improvement, and the prediction model optimizer, which further optimizes the prediction model.The prediction pipeline generates prediction results for the prediction manager from a variety of models optimized for the prediction model optimizer of the analytics pipeline.It is composed of the knowledge DB generator, which generates the knowledge DB.

API Manager
The API manager in the first data streaming step of the real-time data integration engine is composed of APIs optimized to source all real-time data.Each API is implemented according to the requirements of the organization providing the API, and some APIs provide data based on the query.To do this, the query generator generates a query according to the API.In addition, the query generator can dynamically generate a query according to user requirements in the service layer.

Data Distributer
The data distributer classifies the data collected from each API in the API manager and delivers data into the data integration engine and Spark streaming engine.The data collected through each API can be delivered with redundant data already collected according to each API condition.This is achieved with the data classifier of the data distributer.The data distributer modifies the data collected through the data classifier or classifies new data, while the Spark job initiator executes the Spark streaming job for analysis and the DB selector selects the suitable DB data and stores them in the DB.

Data Analysis Engine
The data analysis engine involving the analysis layer is composed of an optimized analysis pipeline that identifies correlations between factors and cardiovascular disorders, while the pipeline for the prediction model is based on the analysis results.As shown in Figure 4, the data analysis engine is composed of the analytics pipeline, which optimizes the analytics pipeline and prediction model with optimized analysis models, and the prediction pipeline that generates a knowledge DB for prediction recommendations.
The analytics pipeline, which is composed of a variety of models optimized for analysis, sends each analysis result to the Score Manager.The Score Manager then delivers the analysis results to the feedback generator, which manages the analysis results and analysis model improvement, and the prediction model optimizer, which further optimizes the prediction model.The prediction pipeline generates prediction results for the prediction manager from a variety of models optimized for the prediction model optimizer of the analytics pipeline.It is composed of the knowledge DB generator, which ge

Service Engine
The service engine, composed of the service layer, performs queries based on the analysis results according to user requirements following a variety of methods.It is composed of a hybrid web/hybrid mobile app that maximizes service through a suitable visualization of the queried analysis results.As shown in Figure 5, the service engine is composed of a user interface manager, which delivers the

Service Engine
The service engine, composed of the service layer, performs queries based on the analysis results according to user requirements following a variety of methods.It is composed of a hybrid web/hybrid mobile app that maximizes service through a suitable visualization of the queried analysis results.As shown in Figure 5, the service engine is composed of a user interface manager, which delivers the user requirements to the visualization manager, the visualization manager, which delivers the service manager after a query of user requirements in the knowledge DB, and the hybrid web/app service manager, which enables user service on a web or mobile app.
Symmetry 2017, 9, 293 6 of 10 user requirements to the visualization manager, the visualization manager, which delivers the service manager after a query of user requirements in the knowledge DB, and the hybrid web/app service manager, which enables user service on a web or mobile app.

Proposed Prediction System Based on Lambda Architecture
The prediction system introduced in this study emphasizes the importance of the real-time data analysis of big data, and it is based on lambda architecture.Lambda architecture has a speed layer that can analyze data generated in real time, addressing the practical problem of general big data analysis pipelines requiring too long an analysis time to match the speed of data generated in real time.It merges the results analyzed in a batch layer, and then provides the analysis results. Figure 6 shows an overview of the proposed prediction system based on lambda architecture for a cardiovascular disorder prediction system.The proposed prediction system with the speed layer in lambda architecture manages the realtime data integration engine in a layer of the same name.This engine delivers the data collected in real time to the data integration engine of the data layer.At the same time, it analyzes the results of the data analysis engine of the analysis layer through the analysis pipeline in real time.The analysis layer performs the analysis for the entire DB, as performed in the batch layer of lambda architecture via the analysis model pipeline, and merges the analysis results delivered from the speed layer.Then, it delivers the results to the prediction model pipeline.

Proposed Prediction System Based on Lambda Architecture
The prediction system introduced in this study emphasizes the importance of the real-time data analysis of big data, and it is based on lambda architecture.Lambda architecture has a speed layer that can analyze data generated in real time, addressing the practical problem of general big data analysis pipelines requiring too long an analysis time to match the speed of data generated in real time.It merges the results analyzed in a batch layer, and then provides the analysis results. Figure 6 shows an overview of the proposed prediction system based on lambda architecture for a cardiovascular disorder prediction system.
Symmetry 2017, 9, 293 6 of 10 user requirements to the visualization manager, the visualization manager, which delivers the service manager after a query of user requirements in the knowledge DB, and the hybrid web/app service manager, which enables user service on a web or mobile app.

Proposed Prediction System Based on Lambda Architecture
The prediction system introduced in this study emphasizes the importance of the real-time data analysis of big data, and it is based on lambda architecture.Lambda architecture has a speed layer that can analyze data generated in real time, addressing the practical problem of general big data analysis pipelines requiring too long an analysis time to match the speed of data generated in real time.It merges the results analyzed in a batch layer, and then provides the analysis results. Figure 6 shows an overview of the proposed prediction system based on lambda architecture for a cardiovascular disorder prediction system.The proposed prediction system with the speed layer in lambda architecture manages the realtime data integration engine in a layer of the same name.This engine delivers the data collected in real time to the data integration engine of the data layer.At the same time, it analyzes the results of the data analysis engine of the analysis layer through the analysis pipeline in real time.The analysis layer performs the analysis for the entire DB, as performed in the batch layer of lambda architecture via the analysis model pipeline, and merges the analysis results delivered from the speed layer.Then, it delivers the results to the prediction model pipeline.The proposed prediction system with the speed layer in lambda architecture manages the real-time data integration engine in a layer of the same name.This engine delivers the data collected in real time to the data integration engine of the data layer.At the same time, it analyzes the results of the data analysis engine of the analysis layer through the analysis pipeline in real time.The analysis layer performs the analysis for the entire DB, as performed in the batch layer of lambda architecture via the analysis model pipeline, and merges the analysis results delivered from the speed layer.Then, it delivers the results to the prediction model pipeline.
As the data analysis engine merges the analysis results delivered from the speed layer, there should be no redundant results with the analysis results extracted from the analysis model pipeline.To prevent redundancies, the real-time data integration engine of the speed layer applies a time stamp, which can be traced to ensure that these real-time data are indeed recent data that should not be analyzed in the analysis layer.

Experiment Evaluation
Table 1 shows the example data used in the proposed system.These sample data consist of various file and data types.Preprocessing various types of files can be expensive.The system proposed in this study employs the real-time data integration engine of the speed layer, which processes real-time data, and a data integration engine of the batch layer to process already collected file data.The format of the data processed by each engine is different, but the task is the same.To measure the performance of the proposed system, we evaluated the performance of MapReduce in the batch layer with the highest data preprocessing cost.As the data analysis engine merges the analysis results delivered from the speed layer, there should be no redundant results with the analysis results extracted from the analysis model pipeline.To prevent redundancies, the real-time data integration engine of the speed layer applies a time stamp, which can be traced to ensure that these real-time data are indeed recent data that should not be analyzed in the analysis layer.

Experiment Evaluation
Table 1 shows the example data used in the proposed system.These sample data consist of various file and data types.Preprocessing various types of files can be expensive.The system proposed in this study employs the real-time data integration engine of the speed layer, which processes real-time data, and a data integration engine of the batch layer to process already collected file data.The format of the data processed by each engine is different, but the task is the same.To measure the performance of the proposed system, we evaluated the performance of MapReduce in the batch layer with the highest data preprocessing cost.Because the MapReduce jobs of the data integration engine are configured as map-side only jobs, it is possible to perform parallel processing on a block-by-block basis.This eliminates Reduce Jobs, thereby preventing performance degradation during data collection.Figure 8 shows a simple experiment of how to design a MapReduce job as a map-side only job, instead of designing it as a MapReduce job.The difference is that MapReduce jobs have no code, and only the mapper's intermediate result with a null key value is output by the dummy reduce, which performs no function.The experiments were performed in stand-alone mode to identify differences in performance, depending on the type of MapReduce job.As shown in Figure 8, as the number of records increases, the performance of the map-side only job increases by 200% compared with the MapReduce job.Because this experiment is performed on a single node, it can be confirmed that the difference in performance increases substantially when the number of nodes increases.

Conclusions and Future Work
We collected and stored data from a variety of domains that could be factors for cardiovascular disorders and designed a framework to construct a big data integrated DB.The framework was composed of data, speed, analysis, and service layers.The data generated in each layer were stored on distributed storage devices, which includes the national health and nutrition survey provided by the Korea Centers for Disease Control and Prevention, and identical DBs or files containing data provided by the Health Insurance Review and Assessment Service.The data integration engine was implemented as a map-side only job by using the MapReduce framework in Hadoop, thereby adopting the computing power of the nodes comprising the Hadoop cluster in parallel.The real-time data integration engine composed of the speed layer collects real-time data by using APIs, such as the SGIS API of the National Statistical Office or public data APIs.The streaming data analysis pipeline is composed of the data collected by using Apache Spark.The data analysis engine composed of the analysis layer was designed as a pipeline for the prediction model based on the optimized analysis pipeline, which can identify correlations between factors and cardiovascular disorders and analyze the results.The service engine composed of the service layer, which can make a query for a variety of user requirements based on the analyzed results, is composed of a hybrid web/hybrid mobile app to maximize service through suitable visualizations with the query results.The proposed prediction system, which emphasizes the importance of the real-time data analysis of big data, is based on lambda architecture.Regarding the issue problem whereby big data analysis pipelines cannot typically match real-time speeds, the lambda architecture speed layer enables data analysis in real time; it then merges with the results analyzed in the batch layer to provide improved results.Accordingly, we designed a framework for a cardiovascular disorder prediction system based on lambda architecture.In the future, this system can be used to help predict and optimize diagnosis treatments for serious illnesses such as cardiovascular disorders and can also be applied to a variety of other diseases.Based on this study, a variety of health data from many

Conclusions and Future Work
We collected and stored data from a variety of domains that could be factors for cardiovascular disorders and designed a framework to construct a big data integrated DB.The framework was composed of data, speed, analysis, and service layers.The data generated in each layer were stored on distributed storage devices, which includes the national health and nutrition survey provided by the Korea Centers for Disease Control and Prevention, and identical DBs or files containing data provided by the Health Insurance Review and Assessment Service.The data integration engine was implemented as a map-side only job by using the MapReduce framework in Hadoop, thereby adopting the computing power of the nodes comprising the Hadoop cluster in parallel.The real-time data integration engine composed of the speed layer collects real-time data by using APIs, such as the SGIS API of the National Statistical Office or public data APIs.The streaming data analysis pipeline is composed of the data collected by using Apache Spark.The data analysis engine composed of the analysis layer was designed as a pipeline for the prediction model based on the optimized analysis pipeline, which can identify correlations between factors and cardiovascular disorders and analyze the results.The service engine composed of the service layer, which can make a query for a variety of user requirements based on the analyzed results, is composed of a hybrid web/hybrid mobile app to maximize service through suitable visualizations with the query results.The proposed prediction system, which emphasizes the importance of the real-time data analysis of big data, is based on lambda architecture.Regarding the issue problem whereby big data analysis pipelines cannot typically match real-time speeds, the lambda architecture speed layer enables data analysis in real time; it then merges with the results analyzed in the batch layer to provide improved results.Accordingly, we designed a framework for a cardiovascular disorder prediction system based on lambda architecture.In the future, this system can be used to help predict and optimize diagnosis treatments for serious illnesses such as cardiovascular disorders and can also be applied to a variety of other diseases.Based on this study, a variety of health data from many people can be analyzed, including clinical, genomic, and lifestyle data; however, techniques must be developed that can provide the most suitable personal health solutions.By using integrated medical big data platforms, it may be possible to address the challenge of combatting diseases and identify the prediction, progression, and prognosis of diseases through disease correlations, drug side effects, and genome research.

Figure 1 .
Figure 1.Overview of the proposed prediction system.

Figure 1 .
Figure 1.Overview of the proposed prediction system.

Figure 6 .
Figure 6.Overview of the proposed prediction system based on lambda architecture.HDFS (Hadoop Distributed File System).

Figure 6 .
Figure 6.Overview of the proposed prediction system based on lambda architecture.HDFS (Hadoop Distributed File System).

Figure 6 .
Figure 6.Overview of the proposed prediction system based on lambda architecture.HDFS (Hadoop Distributed File System).

Figure 7
Figure 7 is an example of climate data preprocessing, where Figure 7 (a) is an xml file of raw data; (b) is a tableforimproved; and (c) is an example output from the data integration engine.In (b), Grid x and Grid y represent the parsing of information contained in the <header> tag.Through the preprocessing process, the coordinate value is converted into the position name, and the date is changed to the value reflecting the presentation time at the observation time.Date and location information are converted into key values through the integration process and output, as the result values along with the remaining record data.The output is displayed as the MapReduce key and value outputs; in other words, the key and value are separated by a tab character, the key and value are space characters, and the remaining data are combined.

Figure 7
Figure 7 is an example of climate data preprocessing, where Figure 7 (a) is an xml file of raw data; (b) is a tableforimproved; and (c) is an example output from the data integration engine.In (b), Grid x and Grid y represent the parsing of information contained in the <header> tag.Through the preprocessing process, the coordinate value is converted into the position name, and the date is changed to the value reflecting the presentation time at the observation time.Date and location information are converted into key values through the integration process and output, as the result values along with the remaining record data.The output is displayed as the MapReduce key and value outputs; in other words, the key and value are separated by a tab character, the key and value are space characters, and the remaining data are combined.

Figure 7 .
Figure 7.An example of climate data preprocessing in the batch layer.(a) A sample of XML; (b) The meaning of the XML file; (c) An output file.

Figure 7 .
Figure 7.An example of climate data preprocessing in the batch layer.(a) A sample of XML; (b) The meaning of the XML file; (c) An output file.

Symmetry 2017, 9 , 11 Figure 8 .
Figure 8.Comparison of the execution time for MapReduce jobs and map-side only jobs according to the number of data records.

Figure 8 .
Figure 8.Comparison of the execution time for MapReduce jobs and map-side only jobs according to the number of data records.

Table 1 .
Sample data of the proposed system.

Table 1 .
Sample data of the proposed system.